* redo Settings modal UI
* add python code interpreter
* fix auto scroll
* build
* fix overflow for long output lines
* bring back sticky copy button
* adapt layout on mobile view
* fix multiple lines output and color scheme
* handle python exception
* better state management
* add webworker
* add headers
* format code
* speed up by loading pyodide on page load
* (small tweak) add small animation to make it feels like claude
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.
This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)
Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
* Update llama.cpp
For display progress dots in terminal.
Without this it didn't display dots progress during loading model from file.
* Update llama.cpp
removed trailing spaces
The C API in llama.h claims users can implement `llama_sampler_i` to
create custom `llama_sampler`. The sampler chain takes ownership and
calls `llama_sampler_free` on them. However, `llama_sampler_free` is
hard-coded to use `delete`. This is undefined behavior if the object
wasn't also allocated via `new` from libllama's C++ runtime. Callers
in C and C-compatible languages do not use C++'s `new` operator. C++
callers may not be sharing the same heap as libllama.
* common : add default embeddings presets
This commit adds default embeddings presets for the following models:
- bge-small-en-v1.5
- e5-small-v2
- gte-small
These can be used with llama-embedding and llama-server.
For example, with llama-embedding:
```console
./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?"
```
And with llama-server:
```console
./build/bin/llama-server --embd-gte-small-default
```
And the embeddings endpoint can then be called with a POST request:
```console
curl --request POST \
--url http://localhost:8080/embeddings \
--header "Content-Type: application/json" \
--data '{"input": "Hello, how are you?"}'
```
I'm not sure if these are the most common embedding models but hopefully
this can be a good starting point for discussion and further
improvements.
Refs: https://github.com/ggerganov/llama.cpp/issues/10932
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
* init version
* fix auto scroll
* bring back copy btn
* bring back thought process
* add lint and format check on CI
* remove lang from html tag
* allow multiple generations at the same time
* lint and format combined
* fix unused var
* improve MarkdownDisplay
* fix more latex
* fix code block cannot be selected while generating
Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file
* Fixed the gcc warning regarding minor linting
* Removed trailing whitespace
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.
ref #11435
[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.
It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.
It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.
So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.
This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.