llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-04-22 22:56:05 +00:00

History

vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 )

q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.

2025-04-09 07:25:08 +02:00

cmake

scripts : update sync + fix cmake merge

2025-03-27 10:09:29 +02:00

include

metal : improve FA + improve MoE (#12612 )

2025-03-28 20:21:59 +02:00

src

vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 )

2025-04-09 07:25:08 +02:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : add logging for native build options/vars (whisper/2935)

2025-03-30 08:33:31 +03:00