llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-04-19 13:06:10 +00:00

History

vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 )

When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.

2025-04-02 19:40:32 +02:00

cmake

scripts : update sync + fix cmake merge

2025-03-27 10:09:29 +02:00

include

metal : improve FA + improve MoE (#12612 )

2025-03-28 20:21:59 +02:00

src

vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 )

2025-04-02 19:40:32 +02:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : add logging for native build options/vars (whisper/2935)

2025-03-30 08:33:31 +03:00