5061 Commits

Author SHA1 Message Date
R0CKSTAR
916c83bfe7
musa: fix compilation warnings in mp_22/31 (#12780)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5061
2025-04-06 15:23:54 +02:00
Jeff Bolz
0c74b04376
vulkan: fix NaN issue in flash attention shader (#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
b5060
2025-04-06 11:03:47 +02:00
Jeff Bolz
80b717d493
vulkan: Use unclamped loads for flash attention mask (#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
b5059
2025-04-06 10:47:13 +02:00
0cc4m
6bf28f0111
Vulkan: Tune Vulkan mmq int dot shader for performance (#12767) b5058 2025-04-05 18:04:03 +02:00
Sergey Fedorov
f1e3eb4249
common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include
b5057
2025-04-05 17:46:00 +02:00
Xuan-Son Nguyen
0364178ca2
clip : refactor clip_init, add tests (#12757)
* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5056
2025-04-05 17:17:40 +02:00
エシュナヴァリシア
c6ff5d2a8d
common: custom hf endpoint support (#12769)
* common: custom hf endpoint support

Add support for custom huggingface endpoints via HF_ENDPOINT environment variable

You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.

Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

* Update common/arg.cpp

readability Improvement

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5055
2025-04-05 15:31:42 +02:00
Olivier Chafik
7a84777f42
sync: minja (#12739)
* sync: minja

https://github.com/google/minja/pull/57

* fix json include
b5054
2025-04-04 21:16:39 +01:00
Georgi Gerganov
3e1d29348b
kv-cache : simplify + fix warning for recurrent models (#12756)
ggml-ci
b5053
2025-04-04 21:48:10 +03:00
bandoti
1be76e4620
ci: add Linux cross-compile build (#12428) b5052 2025-04-04 14:05:12 -03:00
Nauful Shaikh
b772394297
server : webui : Upgrade daisyui, tailwindcss. (#12735)
* Upgrade daisyui, tailwindcss.

* Switch to all themes.

* Revert a change.

* Update formatting.

* Install packages before npm build.

* Revert "Install packages before npm build."

This reverts commit 336c5147e614e60993162794ba9d9d4629a916f8.

* Add index.html.gz

* run build

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-04 16:09:52 +02:00
nick huang
23106f94ea
gguf-split : --merge now respects --dry-run option (#12681)
* gguf-split now respects dry-run option

* removing trailing space
b5050
2025-04-04 16:09:12 +02:00
Nicolò Scipione
94148ba330
sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625) b5049 2025-04-04 16:00:46 +02:00
Ronny Brendel
9ac4d611d0
cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)
fixes error for compiler paths with spaces
2025-04-04 10:12:40 -03:00
Daniel Bevenius
348888e0dc
docs : add XCFramework section to README.md [no ci] (#12746)
This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.

The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.
2025-04-04 10:24:12 +02:00
Jeff Bolz
74d4f5b041
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
b5046
2025-04-04 07:54:35 +02:00
Jeff Bolz
35e592eb30
vulkan: set cmake minimum and project name in vulkan-shaders (#12744) b5045 2025-04-04 07:53:20 +02:00
lhez
7d7b1bafa7
opencl: update doc for OpenCL (#12702)
* opencl: add OpenCL to build.md

* opencl: remove fixed issue/TODO

* opencl: add link to OPENCL.md

* opencl: update doc - refine tools requirement for Windows 11 arm64
2025-04-03 22:18:17 -07:00
Gaurav Garg
c262beddf2
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b5043
2025-04-03 18:20:29 +02:00
yumeyao
5dd5d1ab00
vocab : use string_view::find() to avoid unnecessary looking up beyond the fragment range (#12706) 2025-04-03 18:32:54 +03:00
Jeff Bolz
1c059995e0
vulkan: Fix missing cmake logic for dot product extension (#12721) b5041 2025-04-03 10:08:26 -05:00
Atharva Dubey
2004644b7a
ci : add env variable in ggml-ci and document the same in SYCL.md (#12736) 2025-04-03 15:12:39 +03:00
R0CKSTAR
5f696e88e0
sync : minja (inclusionAI/Ling) and update tests (#12699)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5039
2025-04-03 13:51:35 +02:00
a3sh
193c3e03a6
fix MUSA compiler warning (#12704)
* fix MUSA compiler warning

* replace (void) with GGML_UNUSED
b5038
2025-04-03 09:32:55 +02:00
Chenguang Li
65cfe136a0
CANN: Support operator SIN COS ARGMAX (#12709)
* [CANN]support sin cos argmax

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]codestyle adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]Remove redundant code

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
b5037
2025-04-03 15:18:08 +08:00
Alan Gray
3f9da22c2b
Simplify and improve CUDA graphs through use of indirect copy pointers (#9017)
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes #12152

* Addressed comments

* fix HIP builds

* properly sync to stream

* removed ggml_cuda_cpy_fn_ptrs

* move stream sync before free

* guard to only use indirection with graphs

* style fixes

* check for errors

---------

Co-authored-by: slaren <slarengh@gmail.com>
b5036
2025-04-03 03:31:15 +02:00
hipudding
2a0dc97e56
CANN: Fix failed test cases (#12708)
* CANN: Fix memory waste in aclnn_tensor

* CANN: fix backend ops fail

* CANN: fix acl_tensor memory alloc.

* CANN: format

* CANN: remove trailing whitespace
b5035
2025-04-03 08:49:51 +08:00
lhez
97a20c012b
opencl: use max_alloc_size in backend ctx instead of querying again (#12705) b5034 2025-04-02 17:01:42 -07:00
Jeff Bolz
f01bd02376
vulkan: Implement split_k for coopmat2 flash attention. (#12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
b5033
2025-04-02 14:25:08 -05:00
bandoti
6f3bd38640
cmake: remove caching from vulkan coopmat checks (#12719) b5032 2025-04-02 14:56:26 -03:00
Jeff Bolz
be0a0f8cae
vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
b5031
2025-04-02 19:40:32 +02:00
0cc4m
92e3006bb6
Vulkan: Fix mmq int dot float cache size (#12722) b5030 2025-04-02 19:12:30 +02:00
Georgi Gerganov
833e2b7409
model : print tensor size during load (#12711)
* model : print tensor size during load

* cont : fix units MB -> MiB

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5029
2025-04-02 16:38:54 +03:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers (#11397)
* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes
b5028
2025-04-02 14:52:01 +02:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard (#12695)
* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Sigbjørn Skjæret
83a88bd6af
vocab : BailingMoE : change possessive quantifiers to greedy (#12677) b5026 2025-04-02 11:21:48 +02:00
Xuan-Son Nguyen
42eb248f46
common : remove json.hpp from common.cpp (#12697)
* common : remove json.hpp from common.cpp

* fix comment
b5025
2025-04-02 09:58:34 +02:00
Chenguang Li
9bacd6b374
[CANN] get_rows and dup optimization (#12671)
* [CANN]get_rows and dup optimization.

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]GET_ROWS and CPY/DUP optimization

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-04-02 15:22:13 +08:00
Xuan-Son Nguyen
267c1399f1
common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Junil Kim
f423981ac8
opencl : fix memory allocation size (#12649)
issue:
https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283

This patch fixes the memory allocation size
not exceeding the maximum size of the OpenCL device.
b5022
2025-04-01 09:54:34 -07:00
jklincn
e39e727e9a
llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672) b5021 2025-04-01 14:54:28 +02:00
Sigbjørn Skjæret
5936a616e4
convert : BailingMoE : fix qkv split when head_dim is 0 (#12687)
NOTE: Ling-lite-base is broken, see https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2
2025-04-01 14:37:13 +02:00
Georgi Gerganov
3fd072a540
metal : use F32 prec in FA kernels (#12688)
* metal : use F32 prec in FA kernels

ggml-ci

* cont : fix FA vec kernel

ggml-ci
b5019
2025-04-01 14:57:19 +03:00
R0CKSTAR
a6f32f0b34
Fix clang warning in gguf_check_reserved_keys (#12686)
* Fix clang warning in gguf_check_reserved_keys

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix typo

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5018
2025-04-01 13:12:53 +02:00
Wagner Bruna
2bb3597e42
vulkan: fix build when glslc doesn't support coopmat (#12683) b5017 2025-04-01 11:38:07 +02:00
Romain Biessy
8293970542
SYCL: Rename oneMKL to oneMath (#12192)
* Rename oneMKL Interface to oneMath

* Use oneMath for Intel vendor

* Rename occurences to mkl

* clang-format

* Silence verbose warnings

* Set oneMath HIP_TARGETS

* Fix silence warnings

* Remove step to build oneMath from build instructions

* Use fixed oneMath version

* Remove INTEL_CPU

* Fold CMake oneDNN conditions

* Use Intel oneMKL for Intel devices

* Improve CMake message

* Link against MKL::MKL_SYCL::BLAS only

* Move oneMath documentation to Nvidia and AMD sections
b5016
2025-04-01 16:24:29 +08:00
Akarshan Biswas
8bbf26083d
SYCL: switch to SYCL namespace (#12674) b5015 2025-04-01 10:11:39 +02:00
Sigbjørn Skjæret
35782aeedb
convert : BailingMoE : avoid setting rope_dim to 0 (#12678) 2025-03-31 23:09:48 +02:00
Daniel Bevenius
c80a7759da
vocab : add special infill tokens for CodeLlama (#11850)
* vocab : add special infill tokens for CodeLlama

The commit adds the following special tokens for CodeLlama infill:
- `▁<PRE>`
- `▁<SUF>`
- `▁<MID>`

The motivation for this is that currently the infill example uses
CodeLlama as a suggested model. But when using this model the following
error is generated:
```console
/llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
305251 Aborted                 (core dumped)
./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \
  -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \
  --in-prefix "def helloworld():\n    print(\"hell" \
  --in-suffix "\n   print(\"goodbye world\")\n    "
```

* squash! vocab : add special infill tokens for CodeLlama

Add _<EOT> as well.
b5013
2025-03-31 18:40:56 +02:00
a3sh
250d7953e8
ggml : faster ssm scan (#10558)
* faster ssm_scan

* delete unused commnet

* clang format

* add space

* modify unnecessary calculations

* faster ssm conv implementatioin

* modify file name with dash
b5012
2025-03-31 18:05:13 +02:00