1341 Commits

Author SHA1 Message Date
Georgi Gerganov
68ff663a04
repo : update links to new url (#11886)
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
2025-02-15 16:40:57 +02:00
theraininsky
a7b8ce2260
llama-bench : fix unexpected global variable initialize sequence issue (#11832)
* llama-bench : fix unexpected global variable initialize sequence issue

* Update examples/llama-bench/llama-bench.cpp

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-14 02:13:43 +01:00
Reza Rahemtola
c1f958c038
server : (docs) Update wrong tool calling example (#11809)
Call updated to match the tool used in the output just below, following the example in https://github.com/ggerganov/llama.cpp/pull/9639
2025-02-13 17:22:44 +01:00
Olivier Chafik
c7f460ab88
server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none (#11607)
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 10:05:16 +00:00
Vinesh Janarthanan
27e8a23300
sampling: add Top-nσ sampler (#11223)
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov
e4376270d9
llama.cpp: fix warning message (#11839)
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.

Before the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```

After the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
2025-02-13 08:25:34 +02:00
Woof Dog
31afcbee0e
server : (webui) Give copy button back to all message bubbles (#11814)
* All messages get the copy button

* Update index.html.gz
2025-02-12 23:47:11 +01:00
JC
bfd11a2344
Fix: Compile failure due to Microsoft STL breaking change (#11836) 2025-02-12 21:36:11 +01:00
Daniel Bevenius
a18f481f99
server : use common_token_to_piece instead of common_detokenize (#11740)
* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: https://github.com/ggerganov/llama.cpp/issues/11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Xuan-Son Nguyen
507f9174fe
server : (webui) introduce conversation branching + idb storage (#11792)
* server : (webui) introduce conversation branching + idb storage

* mark old conv as "migrated" instead deleting them

* improve migration

* add more comments

* more clarification
2025-02-10 21:23:17 +01:00
Xuan-Son Nguyen
0893e0114e
server : correct signal handler (#11795) 2025-02-10 18:03:28 +01:00
pascal-lc
9ac3457b39
Update README.md [no ci] (#11781)
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Eric Curtin
19d3c8293b
There's a better way of clearing lines (#11756)
Use the ANSI escape code for clearing a line.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Xuan-Son Nguyen
55ac8c7791
server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759)
* redo Settings modal UI

* add python code interpreter

* fix auto scroll

* build

* fix overflow for long output lines

* bring back sticky copy button

* adapt layout on mobile view

* fix multiple lines output and color scheme

* handle python exception

* better state management

* add webworker

* add headers

* format code

* speed up by loading pyodide on page load

* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog
e6e6583199
server : (webui) increase edit textarea size (#11763) 2025-02-08 20:09:55 +01:00
Georgi Gerganov
aaa5505307
server : minor log updates (#11760)
ggml-ci
2025-02-08 18:08:43 +02:00
Nikolaos Pothitos
3ab410f55f
readme : update front-end framework (#11753)
After the migration to React with #11688
2025-02-08 10:43:04 +01:00
Xuan-Son Nguyen
0cf867160c
server : (webui) fix numeric settings being saved as string (#11739)
* server : (webui) fix numeric settings being saved as string

* add some more comments
2025-02-08 10:42:34 +01:00
Eric Curtin
d2fe216fb2
Make logging more verbose (#11714)
Debugged an issue with a user who was on a read-only filesystem.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-07 14:42:46 +00:00
Xuan-Son Nguyen
2fb3c32a16
server : (webui) migrate project to ReactJS with typescript (#11688)
* init version

* fix auto scroll

* bring back copy btn

* bring back thought process

* add lint and format check on CI

* remove lang from html tag

* allow multiple generations at the same time

* lint and format combined

* fix unused var

* improve MarkdownDisplay

* fix more latex

* fix code block cannot be selected while generating
2025-02-06 17:32:29 +01:00
SAMI
1ec208083c
llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644)
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file

* Fixed the gcc warning regarding minor linting

* Removed trailing whitespace
2025-02-05 10:45:40 +03:00
Olivier Chafik
9f4cc8f8d3
sync: minja (#11641)
* `sync`: minja

182de30cda

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
Xuan-Son Nguyen
3962fc1a79
server : add try..catch to places not covered by set_exception_handler (#11620)
* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
Olivier Chafik
db288b60cb
tool-call: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
2025-02-04 15:48:53 +00:00
Jhen-Jie Hong
f117d84b48
swift : fix llama-vocab api usage (#11645)
* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage
2025-02-04 13:15:24 +02:00
Olivier Chafik
cde3833239
tool-call: allow --chat-template chatml w/ --jinja, default to chatml upon parsing issue, avoid double bos (#11616)
* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic
2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen
b3451785ac
server : (webui) revert hacky solution from #11626 (#11634) 2025-02-04 00:10:52 +01:00
Woof Dog
1d1e6a90bc
server : (webui) allow typing and submitting during llm response (#11626) 2025-02-03 23:16:27 +01:00
Daniel Bevenius
5598f475be
server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
mashdragon
d92cb67e37
server : (webui) Fix Shift+Enter handling (#11609)
* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
Eric Curtin
84ec8a58f7
Name colors (#11573)
It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-02 15:14:48 +00:00
Olivier Chafik
bfcce4d693
tool-call: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
2025-02-02 09:25:38 +00:00
piDack
0cec062a63
llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
Eric Curtin
ecef206ccb
Implement s3:// protocol (#11511)
For those that want to pull from s3

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-01 10:30:54 +00:00
Olivier Chafik
a83f528688
tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-31 14:15:25 +00:00
Olivier Chafik
b1bcd309fc
fix stop regression (#11543) 2025-01-31 13:48:31 +00:00
Olivier Chafik
5783575c9d
Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533) 2025-01-31 08:24:29 +00:00
Olivier Chafik
4a2b196d03
server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531) 2025-01-31 10:12:40 +02:00
Daniel Bevenius
a2df2787b3
server : update help metrics processing/deferred (#11512)
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
2025-01-31 06:04:53 +01:00
Olivier Chafik
8b576b6c55
Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639)
---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-30 19:13:58 +00:00
Daniel Bevenius
4314e56c4f
server : use lambda instead of std::bind (#11507)
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.

The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.

Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
2025-01-30 11:05:00 +01:00
Isaac McFadyen
496e5bf46b
server : (docs) added response format for /apply-template [no ci] (#11503) 2025-01-30 10:11:53 +01:00
Daniel Bevenius
e0449763a4
server : update json snippets in README.md [no ci] (#11492)
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.

The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
2025-01-30 05:48:14 +01:00
Nigel Bosch
eb7cf15a80
server : add /apply-template endpoint for additional use cases of Minja functionality (#11489)
* add /apply-template endpoint to server

* remove unnecessary line

* add /apply-template documentation

* return only "prompt" field in /apply-template

* use suggested idea instead of my overly verbose way
2025-01-29 19:45:44 +01:00
Daniel Bevenius
e51c47b401
server : update auto gen files comments [no ci] (#11484)
* server : update auto gen files comments

This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.

The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269bca75b2d08119c653512cd20b4ea2ba ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").

* squash! server : update auto gen files comments [no ci]

Move comments about file generation to README.md.

* squash! server : update auto gen files comments [no ci]

Remove the comments in server.cpp that mention that information
can be found in the README.md file.
2025-01-29 16:34:18 +01:00
Eric Curtin
f0d4b29edf
Parse https://ollama.com/library/ syntax (#11480)
People search for ollama models using the web ui, this change
allows one to copy the url from the browser and for it to be
compatible with llama-run.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-29 11:23:10 +00:00
peidaqi
cf8cc856d7
server : Fixed wrong function name in llamacpp server unit test (#11473)
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
2025-01-29 00:03:42 +01:00
Eric Curtin
7fee2889e6
Add github protocol pulling and http:// (#11465)
As pulling protocols to llama-run

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-28 14:45:41 +00:00
Michael Engel
2b8525d5c8
Handle missing model in CLI parameters for llama-run (#11399)
The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.

Signed-off-by: Michael Engel <mengel@redhat.com>
2025-01-28 08:32:40 +00:00
Eric Curtin
a4417ddda9
Add new hf protocol for ollama (#11449)
https://huggingface.co/docs/hub/en/ollama

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-27 19:36:10 +01:00