This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.
The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.
This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.
Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash
$ ./build/bin/llama-server --m<TAB>
--main-gpu --mirostat --mirostat-lr --model --multiline-input
--min-p --mirostat-ent --mlock --model-url
```
* musa: Update MUSA SDK version to rc3.1.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: Remove workaround in PR #10042
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B
* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template
* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out
* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability
* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.
Before the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```
After the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
* ggml-cpu : add chunking support to mul_mat_id
* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row
* disable for arm
* cleanup
* better way to disable for arm
* fix uninitialized counter when using 1 thread only
* revert test-backend-ops changes
* Bug fix for clamp_f32
When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.
* Bug fix for clamp_f32
* Bug fix for clamp_f32
* server : use common_token_to_piece instead of common_detokenize
This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.
The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.
Resolves: https://github.com/ggerganov/llama.cpp/issues/11728
* squash! server : use common_token_to_piece instead of common_detokenize
Use common_token_to_piece for post_sampling_probs as well.
* server : (webui) introduce conversation branching + idb storage
* mark old conv as "migrated" instead deleting them
* improve migration
* add more comments
* more clarification
Technically the fixed width types come only from iostream and
cstdint/stdint.h headers. memory and vector headers should not provide
these. In GCC 15 the headers are cleaned up and you require the proper
header cstdint.
src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type
26 | uint32_t read_u32() const;
| ^~~~~~~~
* redo Settings modal UI
* add python code interpreter
* fix auto scroll
* build
* fix overflow for long output lines
* bring back sticky copy button
* adapt layout on mobile view
* fix multiple lines output and color scheme
* handle python exception
* better state management
* add webworker
* add headers
* format code
* speed up by loading pyodide on page load
* (small tweak) add small animation to make it feels like claude
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.
This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)
Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
* Update llama.cpp
For display progress dots in terminal.
Without this it didn't display dots progress during loading model from file.
* Update llama.cpp
removed trailing spaces
The C API in llama.h claims users can implement `llama_sampler_i` to
create custom `llama_sampler`. The sampler chain takes ownership and
calls `llama_sampler_free` on them. However, `llama_sampler_free` is
hard-coded to use `delete`. This is undefined behavior if the object
wasn't also allocated via `new` from libllama's C++ runtime. Callers
in C and C-compatible languages do not use C++'s `new` operator. C++
callers may not be sharing the same heap as libllama.