llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-04-16 11:36:08 +00:00

Author	SHA1	Message	Date
Tei Home	3361e2deba	docs: update: improve the Fedoa CUDA guide (#12536 ) * docs: update fedora-cuda guide - Rename and place into Backend Folder. - Update Host-Supplied Packages. - Expand Recommended Users Section. * docs: improve the flow of CUDA-FEDORA.md	2025-03-24 11:02:26 +00:00
compilade	00d53800e0	llama-vocab : add SuperBPE pre-tokenizer (#12532 ) b4948	2025-03-24 11:47:24 +01:00
R0CKSTAR	7ea75035b6	CUDA: Fix clang warnings (#12540 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b4947	2025-03-24 11:28:34 +01:00
Prajwal B Mehendarkar	c54f6b7988	mmap : skip resource limit checks on AIX (#12541 ) b4946	2025-03-24 12:17:10 +02:00
Jeff Bolz	9b169a4d4e	vulkan: fix mul_mat_vec failure in backend tests (#12529 ) The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs. b4945	2025-03-24 07:56:17 +01:00
Marius Gerdes	77f9c6bbe5	server : Add verbose output to OAI compatible chat endpoint. (#12246 ) Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods. b4944	2025-03-23 19:30:26 +01:00
Lars Sonchocky-Helldorf	18b663d8e4	install : add macports (#12518 ) MacPorts section added	2025-03-23 10:21:48 +02:00
Xuan-Son Nguyen	fbdfefe74e	llama : gemma3 : use output tensor if it exists in model weight (#12506 ) * llama : gemma3 : use output tensor if it exists in model weight * also add to the llm_tensor_names b4942	2025-03-22 23:28:19 +01:00
Georgi Gerganov	ba932dfb50	ggml : fix quantized cpy op (#12310 ) * ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci	2025-03-22 16:23:26 +02:00
R0CKSTAR	fac63a3d78	musa: refine compute capability (#12493 ) * musa: refine compute capability Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b4940	2025-03-22 10:11:37 +01:00
Jeff Bolz	eddfb43850	vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505 ) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally. b4939	2025-03-22 09:40:11 +01:00
stduhpf	4375415b4a	Vulkan: RTE rounding for cpy to quant (#12480 ) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com> b4938	2025-03-21 20:34:50 +01:00
Eve	30c42ef5cb	vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472 ) b4937	2025-03-21 20:27:47 +01:00
Georgi Gerganov	af04481e6b	model : do not repack if a GPU device is present (#12498 ) ggml-ci b4936	2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret	960e726077	chore : cleanup llama_model_loader::TENSOR_ usage (#12492 ) b4935	2025-03-21 10:21:36 +01:00
marcoStocchi	ea1518e839	llama-tts : avoid crashes related to bad model file paths (#12482 ) b4934	2025-03-21 11:12:45 +02:00
蕭澧邦	1aa87ee53d	[SYCL] Fix build on Windows when ccache enabled (#9954 ) (#9976 ) * [SYCL] Fix build on Windows when ccache enabled (#9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> b4933	2025-03-21 14:58:47 +08:00
Svetlozar Georgiev	9ffcc9e374	sycl: cleanup oneDNN related code (#12097 ) b4932	2025-03-21 10:15:56 +08:00
Woof Dog	e04643063b	webui : Prevent rerendering on textarea input (#12299 ) * webui: Make textarea uncontrolled to eliminate devastating lag * Update index.html.gz * use signal-style implementation * rm console log * no duplicated savedInitValue set --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-03-20 15:57:43 +01:00
Sigbjørn Skjæret	dbb3a4739e	llama : make Qwen2MoE QKV bias optional (#12477 ) b4930	2025-03-20 12:49:59 +01:00
Srihari-mcw	3d82dbcbce	ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332 ) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments b4929	2025-03-20 13:35:34 +02:00
Bartowski	732b5fbf5e	convert : avoid calls to tokenizer.added_tokens_decoder (#12473 ) tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens	2025-03-20 08:36:37 +02:00
fairydreaming	568013d0cd	context : clear sets containing encoder output sequence ids before storing new values (#12470 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> b4927	2025-03-19 21:01:57 +01:00
Gaurav Garg	517b5ddbf0	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 ) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b4926	2025-03-19 20:52:06 +01:00
Jeff Bolz	a9b59288e2	vulkan: optimize iq1 coopmat2 dequant functions (#12427 ) b4925	2025-03-19 19:56:23 +01:00
Guus Waals	0fd8487b14	Fix visionOS build and add CI (#12415 ) * ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> b4924	2025-03-19 11:15:23 +01:00
Sigbjørn Skjæret	108e53c2f1	llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456 ) * Add support for GPT2, Bloom and CodeShell tied word embeddings * Deduplicate tied word embeddings weights * Workaround for incorrect weight map It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first. * check++ * fatfingers-- b4923	2025-03-19 09:08:49 +01:00
Sigbjørn Skjæret	a686171ea7	convert : Support chat_template.json (#12460 )	2025-03-19 08:58:13 +01:00
Jeff Bolz	c446b2edd2	vulkan: Submit once enough matmul work has been recorded (#12406 ) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit. b4921	2025-03-19 08:26:26 +01:00
lhez	d84635b1b0	opencl: improve profiling (#12442 ) * opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing b4920	2025-03-18 12:54:55 -07:00
Georgi Gerganov	75422e8bc4	graph : normalize Q, K, V shapes + sync cross attention (#12449 ) * graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check b4919	2025-03-18 21:35:19 +02:00
R0CKSTAR	bb115d2bf7	musa: override warp_size of musa device to 32 (#12445 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-18 19:28:26 +01:00
Xuan-Son Nguyen	29fff308c7	llama : support converting Mistral Small text-only (#12450 )	2025-03-18 19:16:19 +01:00
Georgi Gerganov	c6af2161b2	speculative : fix seg fault in certain cases (#12454 ) b4916	2025-03-18 19:35:11 +02:00
Xuan-Son Nguyen	99aa304fb9	llama : add support for EXAONE tied word embeddings (#12451 ) b4915	2025-03-18 17:24:33 +01:00
Georgi Gerganov	8551c44d84	context : always use non-causal attention for encoder graphs (#12447 ) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci b4914	2025-03-18 13:05:49 +02:00
Łukasz Ślusarczyk	35cae5ba05	SYCL: using graphs is configurable by environment variable and compile option (#12371 ) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> b4913	2025-03-18 11:16:31 +01:00
Georgi Gerganov	810e0af3f5	server : fix warmup draft cache type (#12446 ) ggml-ci b4912	2025-03-18 12:05:42 +02:00
Prajwal B Mehendarkar	eba92d64c3	cmake : fix PowerPC build (#12241 ) Closes #12240 b4911	2025-03-18 11:37:33 +02:00
fj-y-saito	d9a14523bb	ggml : add SVE support for q6_K_q8_K (#12361 ) b4910	2025-03-18 10:14:39 +02:00
0cc4m	fd123cfead	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434 ) b4909	2025-03-18 07:21:40 +01:00
Łukasz Ślusarczyk	a53f7f7b88	fixed compilation warnings in ggml-sycl (#12424 ) b4908	2025-03-18 08:51:25 +08:00
Molly Sophia	7dfad387e3	llama: Add support for RWKV v7 architecture (#12412 ) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> b4907	2025-03-18 07:27:50 +08:00
Sigbjørn Skjæret	60c902926c	docs : bring llama-cli conversation/template docs up-to-date (#12426 )	2025-03-17 21:14:32 +01:00
Gaurav Garg	b1b132efcb	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394 ) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA b4905	2025-03-17 20:25:13 +02:00
Guus Waals	01e8f2138b	ggml-vulkan: remove unused find_program(glslc) (#12416 ) It's already found by FindVulkan.cmake in the parent CMakeLists	2025-03-17 13:35:43 -03:00
Jeff Bolz	484a8ab513	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312 ) b4903	2025-03-17 09:26:18 -05:00
Daniele	cf2270e4d3	vulkan: subgroup size tuning (#12087 ) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de> b4902	2025-03-17 12:42:33 +01:00
Jeff Bolz	f07690c930	vulkan: use fp32 in coopmat2 q4_k dequant function (#12309 ) b4901	2025-03-17 10:43:35 +01:00
Jeff Bolz	891c63956d	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273 ) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking b4900	2025-03-17 10:41:59 +01:00

1 2 3 4 5 ...

4949 Commits