q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.
This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.
The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
* Update ChatScreen.tsx
* useAutosizeTextarea.ts
useAutosizeTextarea to encapsulate the logic.
* Implement responsive auto-sizing chat textarea
Replaces the manual textarea resizing with an automatic height adjustment based on content.
- `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization
- Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up).
- Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability.
- Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize.
* -update compressed index.html.gz after npm run build
-refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook
* chore: normalize line endings to LF
refactor: AutosizeTextareaApi -> chatTextareaApi
* refactor: Rename interface to PascalCase
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* gguf-py : support lazy tensor splitting
Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.
* gguf-py : fix flake8 lint
* add bf16 support
* use convert_from_bf16_cuda instead of convert_unary_cuda for f32
* revert 7ec5085
* move functionality into convert_unary with constexpr
* cpu: refactor SIMD mappings and vectorized op functions into separate files
* Fix warning for ggml_float to float
* Fix warnings
* cpu: move all the operations (except mul_mat) to a separate c++ file
* fix whitespace
* Update ggml/src/ggml-cpu/vec.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp
* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
* common: custom hf endpoint support
Add support for custom huggingface endpoints via HF_ENDPOINT environment variable
You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.
Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"
The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"
* Update common/arg.cpp
readability Improvement
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Apply suggestions from code review
---------
Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Upgrade daisyui, tailwindcss.
* Switch to all themes.
* Revert a change.
* Update formatting.
* Install packages before npm build.
* Revert "Install packages before npm build."
This reverts commit 336c5147e614e60993162794ba9d9d4629a916f8.
* Add index.html.gz
* run build
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.
The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
* Prefer vector flash decoding kernel for Gemma models
Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
* Update ggml/src/ggml-cuda/fattn.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers
Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
Fixes#12152
* Addressed comments
* fix HIP builds
* properly sync to stream
* removed ggml_cuda_cpy_fn_ptrs
* move stream sync before free
* guard to only use indirection with graphs
* style fixes
* check for errors
---------
Co-authored-by: slaren <slarengh@gmail.com>