* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
* Update ChatScreen.tsx
* useAutosizeTextarea.ts
useAutosizeTextarea to encapsulate the logic.
* Implement responsive auto-sizing chat textarea
Replaces the manual textarea resizing with an automatic height adjustment based on content.
- `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization
- Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up).
- Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability.
- Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize.
* -update compressed index.html.gz after npm run build
-refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook
* chore: normalize line endings to LF
refactor: AutosizeTextareaApi -> chatTextareaApi
* refactor: Rename interface to PascalCase
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Upgrade daisyui, tailwindcss.
* Switch to all themes.
* Revert a change.
* Update formatting.
* Install packages before npm build.
* Revert "Install packages before npm build."
This reverts commit 336c5147e614e60993162794ba9d9d4629a916f8.
* Add index.html.gz
* run build
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* (wip) refactor downloading system [no ci]
* fix all examples
* fix mmproj with -hf
* gemma3: update readme
* only handle mmproj in llava example
* fix multi-shard download
* windows: fix problem with std::min and std::max
* fix 2
* tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.
* Include speculative decoding stats when timings_per_token is true
New fields added to the `timings` object:
- draft_n : number of draft tokens generated
- draft_accepted_n : number of draft tokens accepted
- draft_accept_ratio: ratio of accepted/generated
* Remove redundant draft_accept_ratio var
* add draft acceptance rate to server console output
* rpc : send hash when tensor data is above some fixed threshold
ref #10095
* rpc : put cache under $HOME/.cache/llama.cpp
* try to fix win32 build
* another try to fix win32 build
* remove llama as dependency
* server : Bump cpp-httplib to include AF_UNIX windows support
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
* server : Allow running the server example on a unix socket
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
---------
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.
* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
* webui: Make textarea uncontrolled to eliminate devastating lag
* Update index.html.gz
* use signal-style implementation
* rm console log
* no duplicated savedInitValue set
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* added -o option to specify an output file name
* llama-tts returns ENOENT in case of file write error
note : PR #12042 is closed as superseded with this one.
* Fix DOS index bug
* Remove new APIs
* remove extra line
* Remove from API
* Add extra newline
* Update examples/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
As discussed in PR 'llama-tts : add -o option' (#12042):
* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.
* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.