4651 Commits

Author SHA1 Message Date
Adrien Gallouët
c0d4843225
build : fix llama.pc (#11658)
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
b4651
2025-02-06 13:08:13 +02:00
junchao-zhao
8d4d2be143
ggml : fix LoongArch compile error with 128-bit SIMD (#11701) 2025-02-06 11:20:00 +02:00
Jeff Bolz
2c6c8df56d
vulkan: optimize coopmat2 iq2/iq3 callbacks (#11521)
* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes
b4649
2025-02-06 07:15:30 +01:00
Rémy O
8a7e3bf17a
vulkan: initial support for IQ4_XS quantization (#11501) b4648 2025-02-06 07:09:59 +01:00
Jeff Bolz
1b598b3058
vulkan: use smaller combined allocations to avoid fragmentation (#11551) b4647 2025-02-06 07:02:18 +01:00
Charles Duffy
902368a06b
metal : avoid breaking build when metal API predates TARGET_OS_VISION (#11690)
Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9
b4646
2025-02-06 09:52:31 +08:00
Matvey Soloviev
c3db0480bb
readme : add link to Autopen under UIs (#11684)
Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.
2025-02-06 01:55:25 +01:00
Georgi Gerganov
d774ab3acc
metal : adjust support conditions for norm operators (#11671)
cont #11659

ggml-ci
b4644
2025-02-05 10:57:42 +02:00
Johannes Gäßler
fa62da9b2d
CUDA: support for mat. mul. with ne03 != ne13 (#11656) b4643 2025-02-05 08:58:31 +01:00
SAMI
1ec208083c
llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644)
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file

* Fixed the gcc warning regarding minor linting

* Removed trailing whitespace
b4642
2025-02-05 10:45:40 +03:00
Olivier Chafik
9f4cc8f8d3
sync: minja (#11641)
* `sync`: minja

182de30cda

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
b4641
2025-02-05 01:00:12 +00:00
Johannes Gäßler
fd08255d0d
CUDA: non-contiguous (RMS) norm support (#11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4640
2025-02-04 22:21:42 +01:00
fxzjshm
3ec9fd4b77
HIP: force max threads per block to be 1024 (#11621)
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <fxzjshm@163.com>
b4639
2025-02-04 19:18:38 +01:00
Xuan-Son Nguyen
3962fc1a79
server : add try..catch to places not covered by set_exception_handler (#11620)
* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
Radoslav Gerganov
1bef571f6a
arg : list RPC devices first when using --list-devices (#11655)
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435
b4637
2025-02-04 18:16:20 +02:00
Olivier Chafik
db288b60cb
tool-call: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
b4636
2025-02-04 15:48:53 +00:00
Shelby Jenkins
106045e7bb
readme : add llm_client Rust crate to readme bindings (#11628)
[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.

It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.

It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.

So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
2025-02-04 13:20:55 +02:00
Jhen-Jie Hong
f117d84b48
swift : fix llama-vocab api usage (#11645)
* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage
b4634
2025-02-04 13:15:24 +02:00
Jhen-Jie Hong
534c46b53c
metal : use residency set for other platforms (#11648) b4633 2025-02-04 13:07:18 +02:00
Georgi Gerganov
387a1598ca
authors : update 2025-02-04 13:04:10 +02:00
Georgi Gerganov
7c9e0ca520
sync : ggml b4631 2025-02-04 12:59:21 +02:00
Christian Kastner
8f8290ada9
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.

This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
2025-02-04 12:59:15 +02:00
Georgi Gerganov
b34aedd558
ci : do not stale-close roadmap issues 2025-02-04 09:31:01 +02:00
Olivier Chafik
cde3833239
tool-call: allow --chat-template chatml w/ --jinja, default to chatml upon parsing issue, avoid double bos (#11616)
* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic
b4628
2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen
b3451785ac
server : (webui) revert hacky solution from #11626 (#11634) 2025-02-04 00:10:52 +01:00
Woof Dog
1d1e6a90bc
server : (webui) allow typing and submitting during llm response (#11626) 2025-02-03 23:16:27 +01:00
Daniel Bevenius
5598f475be
server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
Georgi Gerganov
8ec05832fa
sync : ggml 2025-02-03 14:57:08 +02:00
Johannes Gäßler
21c84b5d2d
CUDA: fix Volta FlashAttention logic (#11615) b4623 2025-02-03 14:25:56 +02:00
mashdragon
d92cb67e37
server : (webui) Fix Shift+Enter handling (#11609)
* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
Johannes Gäßler
6eecde3cc8
HIP: fix flash_attn_stream_k_fixup warning (#11604) b4621 2025-02-02 23:48:29 +01:00
uvos
396856b400
CUDA/HIP: add support for selectable warp size to mmv (#11519)
CUDA/HIP: add support for selectable warp size to mmv
b4620
2025-02-02 22:40:09 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601)
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
b4619
2025-02-02 22:08:05 +01:00
Olivier Chafik
90f9b88afb
nit: more informative crash when grammar sampler fails (#11593) b4618 2025-02-02 19:58:34 +00:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention (#11583)
* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4617
2025-02-02 19:31:09 +01:00
Eric Curtin
84ec8a58f7
Name colors (#11573)
It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4616
2025-02-02 15:14:48 +00:00
Olivier Chafik
bfcce4d693
tool-call: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
b4615
2025-02-02 09:25:38 +00:00
Olivier Chafik
69804487e0
Fix exotic ci env that lacks ostringstream::str (#11581) b4614 2025-02-02 09:10:15 +00:00
Michał Moskal
ff227703d6
sampling : support for llguidance grammars (#10224)
* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12
b4613
2025-02-02 09:55:32 +02:00
piDack
0cec062a63
llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
Olivier Chafik
53debe6f3c
ci: use sccache on windows HIP jobs (#11553) b4611 2025-02-01 18:22:38 +00:00
Olivier Chafik
cfd74c86db
sync: minja (418a2364b5) (#11574) b4610 2025-02-01 12:24:51 +00:00
Eric Curtin
ecef206ccb
Implement s3:// protocol (#11511)
For those that want to pull from s3

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4609
2025-02-01 10:30:54 +00:00
Olivier Chafik
5bbc7362cb
ci: simplify cmake build commands (#11548) b4608 2025-02-01 00:01:20 +00:00
Olivier Chafik
aa6fb13213
ci: use sccache on windows instead of ccache (#11545)
* Use sccache on ci for windows

* Detect sccache in cmake
b4607
2025-01-31 17:12:40 +00:00
Olivier Chafik
a83f528688
tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4606
2025-01-31 14:15:25 +00:00
Olivier Chafik
b1bcd309fc
fix stop regression (#11543) b4605 2025-01-31 13:48:31 +00:00
Olivier Chafik
5783575c9d
Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533) b4604 2025-01-31 08:24:29 +00:00
Olivier Chafik
4a2b196d03
server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531) b4603 2025-01-31 10:12:40 +02:00
Steve Grubb
1bd3047a93
common: Add missing va_end (#11529)
The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.
2025-01-31 07:58:55 +02:00