llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-04-14 10:36:07 +00:00

Author	SHA1	Message	Date
0cc4m	92e3006bb6	Vulkan: Fix mmq int dot float cache size (#12722 ) b5030	2025-04-02 19:12:30 +02:00
Georgi Gerganov	833e2b7409	model : print tensor size during load (#12711 ) * model : print tensor size during load * cont : fix units MB -> MiB Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b5029	2025-04-02 16:38:54 +03:00
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes b5028	2025-04-02 14:52:01 +02:00
Georgi Gerganov	a10b36c91a	llama : refactor kv cache guard (#12695 ) * llama : refactor kv cache guard ggml-ci * cont : fix comment [no ci] * llama : fix kv_cache restore logic ggml-ci * context : simplify kv cache updates ggml-ci * cont : better name [no ci] * llama : fix llama_decode return code when could not find KV slot ggml-ci * context : change log err -> warn [no ci] * kv-cache : add comment + warning	2025-04-02 14:32:59 +03:00
Sigbjørn Skjæret	83a88bd6af	vocab : BailingMoE : change possessive quantifiers to greedy (#12677 ) b5026	2025-04-02 11:21:48 +02:00
Xuan-Son Nguyen	42eb248f46	common : remove json.hpp from common.cpp (#12697 ) * common : remove json.hpp from common.cpp * fix comment b5025	2025-04-02 09:58:34 +02:00
Chenguang Li	9bacd6b374	[CANN] get_rows and dup optimization (#12671 ) * [CANN]get_rows and dup optimization. Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]GET_ROWS and CPY/DUP optimization Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-04-02 15:22:13 +08:00
Xuan-Son Nguyen	267c1399f1	common : refactor downloading system, handle mmproj with -hf option (#12694 ) * (wip) refactor downloading system [no ci] * fix all examples * fix mmproj with -hf * gemma3: update readme * only handle mmproj in llava example * fix multi-shard download * windows: fix problem with std::min and std::max * fix 2	2025-04-01 23:44:05 +02:00
Junil Kim	f423981ac8	opencl : fix memory allocation size (#12649 ) issue: https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283 This patch fixes the memory allocation size not exceeding the maximum size of the OpenCL device. b5022	2025-04-01 09:54:34 -07:00
jklincn	e39e727e9a	llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672 ) b5021	2025-04-01 14:54:28 +02:00
Sigbjørn Skjæret	5936a616e4	convert : BailingMoE : fix qkv split when head_dim is 0 (#12687 ) NOTE: Ling-lite-base is broken, see https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2	2025-04-01 14:37:13 +02:00
Georgi Gerganov	3fd072a540	metal : use F32 prec in FA kernels (#12688 ) * metal : use F32 prec in FA kernels ggml-ci * cont : fix FA vec kernel ggml-ci b5019	2025-04-01 14:57:19 +03:00
R0CKSTAR	a6f32f0b34	Fix clang warning in gguf_check_reserved_keys (#12686 ) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5018	2025-04-01 13:12:53 +02:00
Wagner Bruna	2bb3597e42	vulkan: fix build when glslc doesn't support coopmat (#12683 ) b5017	2025-04-01 11:38:07 +02:00
Romain Biessy	8293970542	SYCL: Rename oneMKL to oneMath (#12192 ) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections b5016	2025-04-01 16:24:29 +08:00
Akarshan Biswas	8bbf26083d	SYCL: switch to SYCL namespace (#12674 ) b5015	2025-04-01 10:11:39 +02:00
Sigbjørn Skjæret	35782aeedb	convert : BailingMoE : avoid setting rope_dim to 0 (#12678 )	2025-03-31 23:09:48 +02:00
Daniel Bevenius	c80a7759da	vocab : add special infill tokens for CodeLlama (#11850 ) * vocab : add special infill tokens for CodeLlama The commit adds the following special tokens for CodeLlama infill: - `▁<PRE>` - `▁<SUF>` - `▁<MID>` The motivation for this is that currently the infill example uses CodeLlama as a suggested model. But when using this model the following error is generated: ```console /llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. 305251 Aborted (core dumped) ./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \ -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \ --in-prefix "def helloworld():\n print(\"hell" \ --in-suffix "\n print(\"goodbye world\")\n " ``` * squash! vocab : add special infill tokens for CodeLlama Add _<EOT> as well. b5013	2025-03-31 18:40:56 +02:00
a3sh	250d7953e8	ggml : faster ssm scan (#10558 ) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash b5012	2025-03-31 18:05:13 +02:00
Sigbjørn Skjæret	403fbacbbc	convert : Qwerky : use lora_rank_tokenshift and lora_rank_decay if present (#12667 )	2025-03-31 16:36:25 +02:00
0cc4m	a8a1f33567	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135 ) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version b5010	2025-03-31 14:37:01 +02:00
Georgi Gerganov	1790e73157	cmake : fix whitespace (#0 ) b5009	2025-03-31 15:07:32 +03:00
Georgi Gerganov	0114a32da0	sync : ggml ggml-ci	2025-03-31 15:07:32 +03:00
Sandro Hanea	a7724480fd	cmake: improve Vulkan cooperative matrix support checks (whisper/2966) Co-authored-by: Sandro Hanea <me@sandro.rocks>	2025-03-31 15:07:32 +03:00
Sigbjørn Skjæret	1a85949067	llava : proper description fix (#12668 ) b5006	2025-03-31 11:28:30 +02:00
Akarshan Biswas	6c02a032fa	SYCL: Remove misleading ggml_sycl_op_flatten function (#12387 ) * SYCL: Remove misleading ggml_sycl_op_flatten function * remove trailing whitespace * Fix L2 norm from rebase * remove try catch block from element_wise.cpp * remove comment from common.hp * ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward * norm.cpp: remove try catch exception block b5005	2025-03-31 11:25:24 +02:00
Sigbjørn Skjæret	f52d59d771	llava : fix clip loading GGUFs with missing description (#12660 ) b5004	2025-03-31 11:07:07 +02:00
marcoStocchi	52de2e5949	tts : remove printfs (#12640 ) * tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output. b5003	2025-03-31 11:20:30 +03:00
Sigbjørn Skjæret	2c3f8b850a	llama : support BailingMoE (Ling) (#12634 ) b5002	2025-03-30 22:21:03 +02:00
Georgi Gerganov	4663bd353c	metal : use constexpr in FA kernels + fix typedef (#12659 ) * metal : use constexpr in FA kernels ggml-ci * cont ggml-ci * cont : fix typedef ggml-ci b5001	2025-03-30 22:04:04 +03:00
Juyoung Suk	b3de7cac73	llama : add Trillion 7B model support (#12556 ) * Support Trillion 7B * Update llama.h * Update llama.h * Update llama-vocab.cpp for Trillion * Update llama-vocab.cpp	2025-03-30 20:38:33 +02:00
Sergei Vorobyov	7242dd9675	llama-chat : Add Yandex instruct model template support (#12621 ) * add yandex template * update yandex chat template * fix tests * adjust chat template * fix style * fix tool macro in template * add clarify comment --------- Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b4999	2025-03-30 20:12:03 +02:00
R0CKSTAR	492d7f1ff7	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 ) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b4998	2025-03-30 10:59:38 +02:00
Georgi Gerganov	d3f1f0acfb	sync : ggml ggml-ci b4997	2025-03-30 08:33:31 +03:00
Xuan-Son Nguyen	360dc22c00	cpu : rm unused variable (ggml/1166)	2025-03-30 08:33:31 +03:00
cmdr2	a62d7fa7a9	cpu: de-duplicate some of the operators and refactor (ggml/1144) * cpu: de-duplicate some of the operators and refactor * Fix PR comments * Fix PR comments	2025-03-30 08:33:31 +03:00
Daniel Bevenius	e408d4351a	ggml : add logging for native build options/vars (whisper/2935) This commit adds debug level logging for the native build options and variables to ggml/CMakeLists.txt. The motivation for this is that it can be useful to see the effective result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a cmake build. I've found myself adding similar logging a few times now, so I thought it might be a good idea to add this. Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when running cmake produces the following output: ```console -- GGML_NATIVE : OFF -- GGML_NATIVE_DEFAULT : OFF -- INS_ENB : OFF ```	2025-03-30 08:33:31 +03:00
Daniel Bevenius	3891e183c6	examples : command.wasm updates (whisper/2904) This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated. * emscripten : fix TOTAL_STACK for wasm This commit moves the TOTAL_STACK setting from the compile flags to the linker flags. This is because the TOTAL_STACK setting is a linker setting. The motivation for this change is that currently the following warnings are generated when building: ```console em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] ``` * examples : suppress C++17 deprecation warning for std::codecvt_utf8 This commit suppresses the C++17 deprecation warning for std::codecvt_utf8 similar to what is done in examples/talk-llama/unicode.cpp. The motivation for this change is to suppress these warnings: ```console /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ 4 warnings generated. ``` * ggml : suppress double-promotion warning in GGML_F16x4_REDUCE This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro to suppress a double-promotion warning. Currently the following warning is generated when compiling the command.wasm example: ```console /whisper-work/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1592 \| GGML_F16_VEC_REDUCE(sumf, sum); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /whisper-work/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1640 \| GGML_F16_VEC_REDUCE(sumf[k], sum[k]); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 warnings generated. ``` wasm_f32x4_extract_lane returns a 32-bit float and this is what the addition is performed on. But there is an implicit conversion from 32-bit float to 64-bit double when the result is assigned to `res`, which is of type `ggml_float`. My understanding here is that this is intentional and adding a cast to `ggml_float` should suppress the warning. * emscripten : add -Wno-deprecated to for emscripten This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten builds. The motivation for this is that currently there a number of warnings generated like the following: ```console warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] ``` The downside of this is that we might miss other deprecation warnings in the future so I'm not sure if this is acceptable. But it make the wasm examples cleaner without the warnings. * examples : fix tautological-compare warning in stb_vorbis.c [no ci] This commit applies a fix to address a tautological-compare warning in stb_vorbis.c. The motivation for this is that currently the following warning is generated when compiling the commmand-wasm example: ```console /Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare] 1404 \| if (f->stream_start + loc >= f->stream_end \|\| f->stream_start + loc < f->stream_start) { \| ^ 1 warning generated. ``` This fix was taken from an open pull request on the stb repository that addreses this issue: https://github.com/nothings/stb/pull/1746 * squash! examples : update command.wasm instructions [no ci] This commit adds a Python script to serve the the wasm examples build in the `build-em` directory. Initially I thought that it would be enough to start a simple python server but I did not notice that there was an error in the browser console when I did that: ```console command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated. at command.js:1:1206224 at new Promise (<anonymous>) at loadWasmModuleToWorker (command.js:1:1204981) at Array.map (<anonymous>) at Object.loadWasmModuleToAllWorkers (command.js:1:1206428) at command.js:1:1204318 at callRuntimeCallbacks (command.js:1:1202062) at preRun (command.js:1:6136) at run (command.js:1:1294094) at removeRunDependency (command.js:1:7046) ``` We need a few CORS headers to be set and in order hopefully make this easy for users a Python script is added to the examples directory. This should be able to server all the wasm examples provided they have been built. command.wasm's README.md is updated to reflect this change. * examples : remove unused functions This commit removed the unused functions convert_to_utf8 and convert_to_wstring from examples/common.cpp. * Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]" This reverts commit 8e3c47d96141c7675c985562ebdc705e839e338a. We should not make this change here and instead when the upstream PR is merged we can sync with it. Refs: https://github.com/ggerganov/whisper.cpp/issues/2784	2025-03-30 08:33:31 +03:00
Xuan-Son Nguyen	af6ae1efb2	llama : fix non-causal mask for gemma 3 (#12615 ) b4992	2025-03-30 00:07:37 +01:00
Djip007	0bb2919335	llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632 ) this allow to use GPU host when possible over CPU repack. this have the same effect to resolve this issues (#12498) without completely disable CPU extra buffer. Co-authored-by: philou <philou@framework> b4991	2025-03-29 14:07:37 +01:00
Jay	a69f846351	cmake : fix ccache conflict (#12522 ) If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in cmake again will lead to conflict and compile fail. Signed-off-by: Jay <BusyJay@users.noreply.github.com> b4990	2025-03-29 11:04:58 +01:00
hipudding	d07a0d7a79	CANN : remove clang-format in ggml-cann (#12607 )	2025-03-29 11:03:28 +01:00
Sigbjørn Skjæret	3714c3ee1a	llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631 ) b4988	2025-03-28 22:13:02 +01:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci b4987	2025-03-28 20:21:59 +02:00
Icenowy Zheng	b86f600723	vulkan: fix coopmat shader generation when cross-compiling (#12272 ) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com> b4986	2025-03-28 14:51:06 -03:00
Johannes Gäßler	dd373dd3bf	llama: fix error on bad grammar (#12628 ) b4985	2025-03-28 18:08:52 +01:00
Benson Wong	5d01670266	server : include speculative decoding stats when timings_per_token is enabled (#12603 ) * Include speculative decoding stats when timings_per_token is true New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated * Remove redundant draft_accept_ratio var * add draft acceptance rate to server console output b4984	2025-03-28 10:05:44 +02:00
Radoslav Gerganov	ef03229ff4	rpc : update README for cache usage (#12620 )	2025-03-28 09:44:13 +02:00
amritahs-ibm	13731766db	llamafile : ppc64le GEMV forwarding for FP32. (#12594 ) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com> b4982	2025-03-28 09:43:22 +02:00
Radoslav Gerganov	ab6ab8f809	rpc : send hash when tensor data is above some fixed threshold (#12496 ) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency b4981	2025-03-28 08:18:04 +02:00

1 2 3 4 5 ...

5130 Commits