summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-14imatrix: collect layer influence statistics (#328)Kawrakow
* imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14Add ability to hide imatrix details in llama-quantize (#329)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13Improved IQ1_M quantization (#327)Kawrakow
* Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12Fix KLD precision (#325)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11Correct L4 rms_norm (#324)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10LlaMA-4 support (text only) (#321)Kawrakow
* llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08Guard against attempts to use MLA for non-MLA models (#320)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Update AUTHORSKawrakow
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07Update AUTHORSKawrakow
Forgot to add @Nexesenex
2025-04-07Use links for ggml/llama.cpp authors (#318)Kawrakow
* Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Better iq2_xs quantization (#312)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Add copyright notices (#317)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Update LICENSEKawrakow
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.
2025-04-05We need to synchronize before using device to host async memcpy (#313)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04Add -flax-vector-conversions for GCC on ARM (#311)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: FA and FlashMLA (#310)Kawrakow
* Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Fix GCC compilation errors on ARM (#309)Kawrakow
* Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: much faster MoE prompt processing (#307)Kawrakow
* MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01docs: update README.md (#304)Ikko Eltociear Ashimine
2025-04-01Fix ARM_NEON build failure due to q8_2 (#303)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Quantization improvements (2) (#302)Kawrakow
* iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Additional guards for interleaved quants (#299)Kawrakow
* Make sure no interleaved quants are being used for token embeddings also with `--pure` and/or `--custom-q`. * Simplify --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Fix #300 (#301)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29Quantization improvements (#295)Kawrakow
* Better make_qx_quants Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant. * Sae for iq4_nl, iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27Make sure tensor row size is multiple of block size also when quantizing ↵Kawrakow
with --pure (#294) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * Add check if selected type is possible with --pure I often want to quantize with --pure to see quantization performance without quantization mixes. But for models where there qre tensors with row sizes that are not multiple of 256, this results in a crash for k- and i-quants. Hence, lets add a check if the quant selected via --pure is applicable, and change it if not. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27Use bf16 instead of fp16 block scales for q8_1 (#292)Kawrakow
* WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25llama-bench: enable having different number of threads for tg and pp (#284)Kawrakow
* llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25Update sweep bench (depracating .jsonl support) (#289)saood06
* Update sweep bench (depracating .jsonl support) * Fix README.md
2025-03-25CUDA: better MoE implementation (#283)Kawrakow
* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23Improve DeepSeek batched processing speed (#282)Kawrakow
* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23Attempt to improve FlashMLA on the CPU (#277)Kawrakow
* Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23Test transparent huge pages on Linux (#278)Kawrakow
* Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22Native build ooption for CUDA when GGML_NATIVE is set (#280)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22Fighting with cmake (#279)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22Add Gemma3 support (text only) (#276)Kawrakow
* WIP Gemma3: not working * gemma3: build_gemma3 seems to be working now * Revert changes to convert_hf_to_gguf.py It wasn't working, so I guess, it is better to leave the conversion up tp upstream. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21Fix bug: missing parentheses in logical expression (#275)Kawrakow
This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21Specify tensor name regex for tensors to be repacked (#274)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21FlashMLA-3: the best of both worlds (CPU only) (#273)Kawrakow
* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include * FlashMLA-3: the best of both worlds - CPU only --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21Convert models to row-interleaved quants using the quantize tool (#272)Kawrakow
* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19Honor mmap setting when using tensor overrides (#270)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19Fix ggml_compute_forward_dup_q (#269)Kawrakow
I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19Prevent FlashMLA-1 from running on CUDA (#268)Kawrakow
as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18Allow q8_0 cache on the CPU for FlashMLA-2 (#265)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18Fix #261 (#262)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18Compile time option to use bf16 for qunts without MMQ kernels (#261)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)Kawrakow
* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. * FlashMLA-2: avoid conversions to f32 also on CUDA * Be able to compute for more than 65535 tokens On CUDA just a quick hack that allows us to cancatenate tensors with more than 65535 rows along zroth dimension as needed by FlashMLA-2. Also needed some care in the perplexity tool to avoid int overflows when evaluating the computed logits. * Reduce memory usage for FlashMLA-2 Oh, also fix int overflow in the CUDA concat implementation. It is funny how the llama.cpp 64-bit police has gone (almost) everywhere and replaced 32-bit ints with 64-bit ints, needed or not, but hasn't done it where it is actually needed. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-17Prepare wk_b tensors of DeepSeek models on the fly (#259)Kawrakow
* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-13FlashMLA-2 (CPU): faster and smaller compute buffer size (#253)Kawrakow
* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-12MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252)Kawrakow
* FlashMLA(CUDA): WIP to allow q8_0 quantized cache * WIP * FlashMLA(CUDA) - allow q8_0 for KV cache This works, and PP is not bad, but TG is still quite a bit slower. * FlashMLA(CUDA) - allow q8_0 for KV cache This is better. ~9% slower than f16 cache for short contexts, nearly on par at 16k tokens. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>