* (wip) refactor downloading system [no ci]
* fix all examples
* fix mmproj with -hf
* gemma3: update readme
* only handle mmproj in llava example
* fix multi-shard download
* windows: fix problem with std::min and std::max
* fix 2
* Include speculative decoding stats when timings_per_token is true
New fields added to the `timings` object:
- draft_n : number of draft tokens generated
- draft_accepted_n : number of draft tokens accepted
- draft_accept_ratio: ratio of accepted/generated
* Remove redundant draft_accept_ratio var
* add draft acceptance rate to server console output
* server : Bump cpp-httplib to include AF_UNIX windows support
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
* server : Allow running the server example on a unix socket
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
---------
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
* Fix DOS index bug
* Remove new APIs
* remove extra line
* Remove from API
* Add extra newline
* Update examples/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* sampler: turn lazy grammar trigger words to regexes
* add scripts/tool_bench.sh & .py
* constrain llama json output regardless of function name if matches at beginning
* update relaxed newline space rule in grammar tests
* support add_generation_prompt query parameter (useful for /apply_template)
* Update src/llama-grammar.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
* server : add TEI API format for /rerank endpoint
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix
* also gitignore examples/server/*.gz.hpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B
* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template
* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out
* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability
* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.
Before the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```
After the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
* server : use common_token_to_piece instead of common_detokenize
This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.
The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.
Resolves: https://github.com/ggerganov/llama.cpp/issues/11728
* squash! server : use common_token_to_piece instead of common_detokenize
Use common_token_to_piece for post_sampling_probs as well.
* redo Settings modal UI
* add python code interpreter
* fix auto scroll
* build
* fix overflow for long output lines
* bring back sticky copy button
* adapt layout on mobile view
* fix multiple lines output and color scheme
* handle python exception
* better state management
* add webworker
* add headers
* format code
* speed up by loading pyodide on page load
* (small tweak) add small animation to make it feels like claude
* An empty tool_call_id is better than none!
* sync: minja (tool call name optional https://github.com/google/minja/pull/36)
* Force-disable parallel_tool_calls if template doesn't support it
* More debug logs
* Llama 3.x tools: accept / trigger on more varied spaced outputs
* Fix empty content for functionary v3.2 tool call
* Add proper tool call docs to server README
* readme: function calling *is* supported now
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.
Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.
The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.
Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
* add /apply-template endpoint to server
* remove unnecessary line
* add /apply-template documentation
* return only "prompt" field in /apply-template
* use suggested idea instead of my overly verbose way
* server : update auto gen files comments
This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.
The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269bca75b2d08119c653512cd20b4ea2ba ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").
* squash! server : update auto gen files comments [no ci]
Move comments about file generation to README.md.
* squash! server : update auto gen files comments [no ci]
Remove the comments in server.cpp that mention that information
can be found in the README.md file.