1356 Commits

Author SHA1 Message Date
Eric Curtin
f777a73e18
Some llama-run cleanups (#11973)
Use consolidated open function call from File class. Change
read_all to to_string(). Remove exclusive locking, the intent for
that lock is to avoid multiple processes writing to the same file,
it's not an issue for readers, although we may want to consider
adding a shared lock. Remove passing nullptr as reference,
references are never supposed to be null. clang-format the code
for consistent styling.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-23 13:14:32 +00:00
Ting Lou
36c258ee92
llava: build clip image from pixels (#11999)
* llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance

* Apply suggestions from code review

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-02-22 15:28:28 +01:00
Georgi Gerganov
cf756d6e0a
server : disable Nagle's algorithm (#12020) 2025-02-22 11:46:31 +01:00
Daniel Bevenius
de8b5a3624
llama.swiftui : add "Done" dismiss button to help view (#11998)
The commit updates the help view in the llama.swiftui example to use a
NavigationView and a Done button to dismiss the help view.

The motivation for this is that without this change there is now way to
dimiss the help view.
2025-02-22 06:33:29 +01:00
Alex Brooks
ee02ad02c5
clip : fix visual encoders with no CLS (#11982)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-21 08:11:03 +02:00
momonga
c392e5094d
server (webui): Fix Premature Submission During IME Conversion (#11971)
* fix skip ime composing

* fix npm rebuild

* fix warn

---------

Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-20 19:43:22 +01:00
Michael Engel
0d559580a0
run : add --chat-template-file (#11961)
Relates to: https://github.com/ggml-org/llama.cpp/issues/11178

Added --chat-template-file CLI option to llama-run. If specified, the file
will be read and the content passed for overwriting the chat template of
the model to common_chat_templates_from_model.

Signed-off-by: Michael Engel <mengel@redhat.com>
2025-02-20 10:35:11 +02:00
Georgi Gerganov
abd4d0bc4f
speculative : update default params (#11954)
* speculative : update default params

* speculative : do not discard the last drafted token
2025-02-19 13:29:42 +02:00
igardev
b58934c183
server : (webui) Enable communication with parent html (if webui is in iframe) (#11940)
* Webui: Enable communication with parent html (if webui is in iframe):
- Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server
- On pressing na Escape button sends command "escapePressed" to the parent

Example handling from the parent html side:
- Send command "setText" from parent html to webui in iframe:
const iframe = document.getElementById('askAiIframe');
if (iframe) {
	iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*');
}

- Listen for Escape key from webui on parent html:
// Listen for escape key event in the iframe
window.addEventListener('keydown', (event) => {
	if (event.key === 'Escape') {
		// Process case when Escape is pressed inside webui
	}
});

* Move the extraContext from storage to app.context.

* Fix formatting.

* add Message.extra

* format + build

* MessageExtraContext

* build

* fix display

* rm console.log

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-18 23:01:44 +01:00
Olivier Chafik
63e489c025
tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900)
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type

* addressed clang-tidy lints in [test-]chat.*

* rm minja deps from util & common & move it to common/minja/

* add name & tool_call_id to common_chat_msg

* add common_chat_tool

* added json <-> tools, msgs conversions to chat.h

* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)

* fix deepseek r1 slow test (no longer <think> opening w/ new template)

* allow empty tools w/ auto + grammar

* fix & test server grammar & json_schema params w/ & w/o --jinja
2025-02-18 18:03:23 +00:00
Xuan-Son Nguyen
63ac128563
server : add TEI API format for /rerank endpoint (#11942)
* server : add TEI API format for /rerank endpoint

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix

* also gitignore examples/server/*.gz.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-18 14:21:41 +01:00
xiaobing318
09aaf4f1f5
docs : Fix duplicated file extension in test command (#11935)
This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was:
./tests.sh unit/test_chat_completion.py.py -v -x
It has been corrected to:
./tests.sh unit/test_chat_completion.py -v -x
This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.
2025-02-18 10:12:49 +01:00
Antoine Viallon
c4d29baf32
server : fix divide-by-zero in metrics reporting (#11915) 2025-02-17 11:25:12 +01:00
Xuan-Son Nguyen
0f2bbe6564
server : bump httplib to 0.19.0 (#11908) 2025-02-16 17:11:22 +00:00
708-145
fc10c38ded
examples: fix typo in imatrix/README.md (#11884)
* simple typo fixed

* Update examples/imatrix/README.md

---------

Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-15 21:03:30 +02:00
Georgi Gerganov
68ff663a04
repo : update links to new url (#11886)
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
2025-02-15 16:40:57 +02:00
theraininsky
a7b8ce2260
llama-bench : fix unexpected global variable initialize sequence issue (#11832)
* llama-bench : fix unexpected global variable initialize sequence issue

* Update examples/llama-bench/llama-bench.cpp

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-14 02:13:43 +01:00
Reza Rahemtola
c1f958c038
server : (docs) Update wrong tool calling example (#11809)
Call updated to match the tool used in the output just below, following the example in https://github.com/ggerganov/llama.cpp/pull/9639
2025-02-13 17:22:44 +01:00
Olivier Chafik
c7f460ab88
server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none (#11607)
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 10:05:16 +00:00
Vinesh Janarthanan
27e8a23300
sampling: add Top-nσ sampler (#11223)
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov
e4376270d9
llama.cpp: fix warning message (#11839)
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.

Before the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```

After the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
2025-02-13 08:25:34 +02:00
Woof Dog
31afcbee0e
server : (webui) Give copy button back to all message bubbles (#11814)
* All messages get the copy button

* Update index.html.gz
2025-02-12 23:47:11 +01:00
JC
bfd11a2344
Fix: Compile failure due to Microsoft STL breaking change (#11836) 2025-02-12 21:36:11 +01:00
Daniel Bevenius
a18f481f99
server : use common_token_to_piece instead of common_detokenize (#11740)
* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: https://github.com/ggerganov/llama.cpp/issues/11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Xuan-Son Nguyen
507f9174fe
server : (webui) introduce conversation branching + idb storage (#11792)
* server : (webui) introduce conversation branching + idb storage

* mark old conv as "migrated" instead deleting them

* improve migration

* add more comments

* more clarification
2025-02-10 21:23:17 +01:00
Xuan-Son Nguyen
0893e0114e
server : correct signal handler (#11795) 2025-02-10 18:03:28 +01:00
pascal-lc
9ac3457b39
Update README.md [no ci] (#11781)
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Eric Curtin
19d3c8293b
There's a better way of clearing lines (#11756)
Use the ANSI escape code for clearing a line.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Xuan-Son Nguyen
55ac8c7791
server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759)
* redo Settings modal UI

* add python code interpreter

* fix auto scroll

* build

* fix overflow for long output lines

* bring back sticky copy button

* adapt layout on mobile view

* fix multiple lines output and color scheme

* handle python exception

* better state management

* add webworker

* add headers

* format code

* speed up by loading pyodide on page load

* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog
e6e6583199
server : (webui) increase edit textarea size (#11763) 2025-02-08 20:09:55 +01:00
Georgi Gerganov
aaa5505307
server : minor log updates (#11760)
ggml-ci
2025-02-08 18:08:43 +02:00
Nikolaos Pothitos
3ab410f55f
readme : update front-end framework (#11753)
After the migration to React with #11688
2025-02-08 10:43:04 +01:00
Xuan-Son Nguyen
0cf867160c
server : (webui) fix numeric settings being saved as string (#11739)
* server : (webui) fix numeric settings being saved as string

* add some more comments
2025-02-08 10:42:34 +01:00
Eric Curtin
d2fe216fb2
Make logging more verbose (#11714)
Debugged an issue with a user who was on a read-only filesystem.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-07 14:42:46 +00:00
Xuan-Son Nguyen
2fb3c32a16
server : (webui) migrate project to ReactJS with typescript (#11688)
* init version

* fix auto scroll

* bring back copy btn

* bring back thought process

* add lint and format check on CI

* remove lang from html tag

* allow multiple generations at the same time

* lint and format combined

* fix unused var

* improve MarkdownDisplay

* fix more latex

* fix code block cannot be selected while generating
2025-02-06 17:32:29 +01:00
SAMI
1ec208083c
llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644)
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file

* Fixed the gcc warning regarding minor linting

* Removed trailing whitespace
2025-02-05 10:45:40 +03:00
Olivier Chafik
9f4cc8f8d3
sync: minja (#11641)
* `sync`: minja

182de30cda

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
Xuan-Son Nguyen
3962fc1a79
server : add try..catch to places not covered by set_exception_handler (#11620)
* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
Olivier Chafik
db288b60cb
tool-call: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
2025-02-04 15:48:53 +00:00
Jhen-Jie Hong
f117d84b48
swift : fix llama-vocab api usage (#11645)
* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage
2025-02-04 13:15:24 +02:00
Olivier Chafik
cde3833239
tool-call: allow --chat-template chatml w/ --jinja, default to chatml upon parsing issue, avoid double bos (#11616)
* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic
2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen
b3451785ac
server : (webui) revert hacky solution from #11626 (#11634) 2025-02-04 00:10:52 +01:00
Woof Dog
1d1e6a90bc
server : (webui) allow typing and submitting during llm response (#11626) 2025-02-03 23:16:27 +01:00
Daniel Bevenius
5598f475be
server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
mashdragon
d92cb67e37
server : (webui) Fix Shift+Enter handling (#11609)
* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
Eric Curtin
84ec8a58f7
Name colors (#11573)
It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-02 15:14:48 +00:00
Olivier Chafik
bfcce4d693
tool-call: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
2025-02-02 09:25:38 +00:00
piDack
0cec062a63
llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
Eric Curtin
ecef206ccb
Implement s3:// protocol (#11511)
For those that want to pull from s3

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-01 10:30:54 +00:00
Olivier Chafik
a83f528688
tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-31 14:15:25 +00:00