llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-04-18 04:26:09 +00:00

Author	SHA1	Message	Date
xiaofei	776f9e59cc	cmake : fix undefined reference errors for std::filesystem in ggml (#12092 ) (#12094 ) Signed-off-by: Ray Lee <hburaylee@gmail.com> Co-authored-by: Ray Lee <hburaylee@gmail.com>	2025-03-06 22:58:25 +00:00
Johannes Gäßler	5220a16d18	CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222 )	2025-03-06 18:45:09 +01:00
uvos	e721c05c93	HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209 ) This avoids conflict with internal cuda/hip runtimes memory managment behavior.	2025-03-06 08:20:52 +01:00
Henry Linjamäki	94bb63e4f0	opencl : fix buffer alignment (#12197 ) Fix the following error: ``` ggml-alloc.c:99: not enough space in the buffer ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024) ``` which occurs when `ggml_backend_opencl_context::alignment` is larger than `cl_ptr_base` (hard-coded to `0x1000`). Also, fix `ggml_backend_opencl_context::alignment` was set to `CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the value is reported in bits.	2025-03-06 02:33:40 +01:00
Henry Linjamäki	f79243992c	opencl : fix `ulong` kernel args were set from `int` variables (#12174 ) ... which left garbage bits in the upper half of the kernel args. This caused segmentation faults when running PoCL.	2025-03-06 02:31:14 +01:00
simon886212	ed4ce0dda2	opencl : fix profile-related errors (#12095 ) Co-authored-by: ubuntu <ubuntu@localhost.localdomain>	2025-03-06 02:30:05 +01:00
Rémy O	07d1572347	ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154 ) * ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions * cmake: Add GGML_BMI2 build option * ggml: enable BMI2 on relevant CPU variants * ggml-cpu: include BMI2 in backend score * ggml-cpu: register BMI2 in ggml_backend_cpu_get_features * ggml-cpu: add __BMI2__ define when using MSVC	2025-03-06 02:26:10 +01:00
Akarshan Biswas	5e43f104cc	SYCL: Disable f16 Unary OPs as not supported by the kernels (#12201 )	2025-03-05 16:58:23 +01:00
Plamen Minev	16e4b22c5e	ggml : fix GGMLMetalClass ODR (#12200 ) -- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.	2025-03-05 17:16:01 +02:00
mgroeber9110	5bbe6a9fe9	ggml : portability fixes for VS 2017 (#12150 ) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-04 18:53:26 +02:00
David Huang	becade5de7	HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032 ) Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check Adds rocWMMA support to fattn-wmma-f16 --- Signed-off-by: Carl Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Ben Jackson <ben@ben.com>	2025-03-03 22:10:54 +01:00
cmdr2	b64d7cc272	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
cmdr2	0cbee131ad	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) ggml-ci	2025-03-03 18:18:11 +02:00
cmdr2	87abb7e903	cuda/cpu: Increase support for fp16 unary operations (ggml/1125) * Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests	2025-03-03 18:18:11 +02:00
Diego Devesa	6d4c23b81b	whisper : support GGML_BACKEND_DL (whisper/2843) * whisper : support GGML_BACKEND_DL * fix DTW crash * whisper.objc : fix build - add ggml-cpp.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-03 18:18:11 +02:00
midnight	6512a90037	cmake : fix compile assumptions for power9/etc (whisper/2777) * Add small comment re: VSX to readme Co-authored-by: midnight <midnight@example.com>	2025-03-03 18:18:11 +02:00
petterreinholdtsen	4512055792	Told cmake to install ggml-cpp.h as a public header file. (ggml/1126) It is used by Whisper talk-llama example. Co-authored-by: Petter Reinholdtsen <pere@debian.org>	2025-03-03 18:18:11 +02:00
cmdr2	f54a4ba11e	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div	2025-03-03 18:18:11 +02:00
ag2s20150909	9660ffef58	ggml : fix kleidiai build (#12159 ) The libggml API has changed, but this has not been updated.	2025-03-03 13:54:08 +01:00
Akarshan Biswas	ece9745bb8	SYCL: Move CPY kernels to a separate file and add few missing kernels (#12133 ) * SYCL: refactor and move cpy kernels to a separate file * Add few missing cpy kernels * refactor and add debug logs	2025-03-03 11:07:22 +01:00
Diego Devesa	cc473cac7c	ggml-backend : keep paths in native string type when possible (#12144 )	2025-03-02 22:11:00 +01:00
Erik Scholz	80c41ddd8f	CUDA: compress mode option and default to size (#12029 ) cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".	2025-03-01 12:57:22 +01:00
William Tambellini	70680c48e5	ggml : upgrade init_tensor API to return a ggml_status (#11854 ) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-02-28 14:41:47 +01:00
Rémy O	438a83926a	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595 ) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-02-28 09:42:52 +01:00
Johannes Gäßler	9c42b1718c	CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098 )	2025-02-28 09:26:43 +01:00
Prashant Vithule	05e6f5aad0	ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064 ) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>	2025-02-28 09:36:12 +02:00
hipudding	673cfef9aa	CANN: Fix build error with GCC 13 (#11990 ) Remove unused header file that causes compilation failure on ARM platform with GCC 13.	2025-02-28 15:23:47 +08:00
Eve	fbeda9002d	vulkan: matmul dequantization improvements (#12015 ) * faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8	2025-02-28 08:20:08 +01:00
Daniele	581650b7ca	vulkan: improve im2col (#11826 ) * vulkan: improve im2col performance	2025-02-28 07:52:51 +01:00
Vladimir Vuksanovic	b95c8af37c	cmake: Fix ggml backend dependencies and installation (#11818 ) * Fix dependencies between ggml and backends ggml backends link only to ggml-base and ggml links to all backends. * Fix installation of ggml backends Set up GNUInstallDirs before setting the installation directory of ggml backends	2025-02-27 09:42:48 +02:00
Jeff Bolz	a82c9e7c23	vulkan: fix assertion when qy_needs_dequant (#12068 ) Looks like a copy/paste bug from qx_needs_dequant.	2025-02-25 16:30:21 +01:00
Judd	c132239bfb	add OP sigmoid (#12056 ) Co-authored-by: Judd <foldl@boxvest.com>	2025-02-25 12:32:20 +01:00
Molly Sophia	393fca629e	ggml-cpu: Fix build with sve (#12059 ) * ggml-cpu: Fix build with sve Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml-cpu: Remove unused variable in sve q3_k vec dot Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-02-25 19:28:22 +08:00
Rémy O	61d4f39dfe	vulkan: implement more backpropagation operators (#11914 ) * vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK	2025-02-25 12:04:45 +01:00
Gian-Carlo Pascutto	58d07a8043	metal : copy kernels for quant to F32/F16 conversions (#12017 ) metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-25 11:27:58 +02:00
lhez	34a846b584	opencl: fix for small models (#11950 ) * opencl: fix small shape gemv, remove unused extensions * opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size * opencl: fix for token length < 4 * opencl: use wave size of 64 for all Adreno GPUs --------- Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com> Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>	2025-02-24 14:47:07 -07:00
Neo Zhang Jianyu	08d5986290	[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035 ) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-24 22:33:23 +08:00
Akarshan Biswas	8303e8b0fb	SYCL: Fix GGML_SYCL_DEBUG macro (#11995 )	2025-02-24 10:18:25 +00:00
Aaron Teo	af7747c95a	ggml-cpu: Support s390x SIMD Instruction Set (#12019 ) * ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong charx16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>	2025-02-22 21:39:24 +00:00
Johannes Gäßler	a28e0d5eb1	CUDA: app option to compile without FlashAttention (#12025 )	2025-02-22 20:44:34 +01:00
Johannes Gäßler	5fa07c2f93	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto	d70908421f	cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000 )	2025-02-22 09:43:24 +01:00
PureJourney	ecc8e3aeff	CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984 ) * CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-21 12:21:05 +01:00
Bodhi	0b3863ff95	MUSA: support ARM64 and enable dp4a .etc (#11843 ) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-21 09:46:23 +02:00
Charles Xu	c5d91a7400	ggml-cpu: Add CPU backend support for KleidiAI library (#11390 ) * ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list	2025-02-20 15:06:51 +02:00
Prashant Vithule	4806498bf1	ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917 ) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-20 12:08:32 +02:00
Johannes Gäßler	73e2ed3ce3	CUDA: use async data loading for FlashAttention (#11894 ) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-17 14:03:24 +01:00
Rémy O	2eea03d86a	vulkan: implement several ops relevant for ggml_opt (#11769 ) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-17 07:55:57 +01:00
Jeff Bolz	bf42a23d0a	vulkan: support multi/vision rope, and noncontiguous rope (#11902 )	2025-02-16 08:52:23 +01:00
Hale Chan	c2ea16f260	metal : fix the crash caused by the lack of residency set support on Intel Macs. (#11904 )	2025-02-16 08:50:26 +02:00

... 2 3 4 5 6 ...

754 Commits