ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-04-14	imatrix: collect layer influence statistics (#328)	Kawrakow
	* imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14	Add ability to hide imatrix details in llama-quantize (#329)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13	Improved IQ1_M quantization (#327)	Kawrakow
	* Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12	Fix KLD precision (#325)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11	Correct L4 rms_norm (#324)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10	LlaMA-4 support (text only) (#321)	Kawrakow
	* llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08	Guard against attempts to use MLA for non-MLA models (#320)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Update AUTHORS	Kawrakow
	Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07	Update AUTHORS	Kawrakow
	Forgot to add @Nexesenex
2025-04-07	Use links for ggml/llama.cpp authors (#318)	Kawrakow
	* Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Better iq2_xs quantization (#312)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Add copyright notices (#317)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Update LICENSE	Kawrakow
	I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.
2025-04-05	We need to synchronize before using device to host async memcpy (#313)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04	Add -flax-vector-conversions for GCC on ARM (#311)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03	Metal: FA and FlashMLA (#310)	Kawrakow
	* Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03	Fix GCC compilation errors on ARM (#309)	Kawrakow
	* Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03	Metal: much faster MoE prompt processing (#307)	Kawrakow
	* MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01	docs: update README.md (#304)	Ikko Eltociear Ashimine

2025-04-01	Fix ARM_NEON build failure due to q8_2 (#303)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01	Quantization improvements (2) (#302)	Kawrakow
	* iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01	Additional guards for interleaved quants (#299)	Kawrakow
	* Make sure no interleaved quants are being used for token embeddings also with `--pure` and/or `--custom-q`. * Simplify --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01	Fix #300 (#301)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29	Quantization improvements (#295)	Kawrakow
	* Better make_qx_quants Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant. * Sae for iq4_nl, iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27	Make sure tensor row size is multiple of block size also when quantizing ↵	Kawrakow
	with --pure (#294) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * Add check if selected type is possible with --pure I often want to quantize with --pure to see quantization performance without quantization mixes. But for models where there qre tensors with row sizes that are not multiple of 256, this results in a crash for k- and i-quants. Hence, lets add a check if the quant selected via --pure is applicable, and change it if not. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27	Use bf16 instead of fp16 block scales for q8_1 (#292)	Kawrakow
	* WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25	llama-bench: enable having different number of threads for tg and pp (#284)	Kawrakow
	* llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25	Update sweep bench (depracating .jsonl support) (#289)	saood06
	* Update sweep bench (depracating .jsonl support) * Fix README.md
2025-03-25	CUDA: better MoE implementation (#283)	Kawrakow
	* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23	Improve DeepSeek batched processing speed (#282)	Kawrakow
	* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23	Attempt to improve FlashMLA on the CPU (#277)	Kawrakow
	* Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23	Test transparent huge pages on Linux (#278)	Kawrakow
	* Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22	Native build ooption for CUDA when GGML_NATIVE is set (#280)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22	Fighting with cmake (#279)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22	Add Gemma3 support (text only) (#276)	Kawrakow
	* WIP Gemma3: not working * gemma3: build_gemma3 seems to be working now * Revert changes to convert_hf_to_gguf.py It wasn't working, so I guess, it is better to leave the conversion up tp upstream. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21	Fix bug: missing parentheses in logical expression (#275)	Kawrakow
	This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21	Specify tensor name regex for tensors to be repacked (#274)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21	FlashMLA-3: the best of both worlds (CPU only) (#273)	Kawrakow
	* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include * FlashMLA-3: the best of both worlds - CPU only --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21	Convert models to row-interleaved quants using the quantize tool (#272)	Kawrakow
	* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19	Honor mmap setting when using tensor overrides (#270)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19	Fix ggml_compute_forward_dup_q (#269)	Kawrakow
	I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19	Prevent FlashMLA-1 from running on CUDA (#268)	Kawrakow
	as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18	Allow q8_0 cache on the CPU for FlashMLA-2 (#265)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18	Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18	Fix #261 (#262)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18	Compile time option to use bf16 for qunts without MMQ kernels (#261)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18	FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)	Kawrakow
	* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. * FlashMLA-2: avoid conversions to f32 also on CUDA * Be able to compute for more than 65535 tokens On CUDA just a quick hack that allows us to cancatenate tensors with more than 65535 rows along zroth dimension as needed by FlashMLA-2. Also needed some care in the perplexity tool to avoid int overflows when evaluating the computed logits. * Reduce memory usage for FlashMLA-2 Oh, also fix int overflow in the CUDA concat implementation. It is funny how the llama.cpp 64-bit police has gone (almost) everywhere and replaced 32-bit ints with 64-bit ints, needed or not, but hasn't done it where it is actually needed. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-17	Prepare wk_b tensors of DeepSeek models on the fly (#259)	Kawrakow
	* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-13	FlashMLA-2 (CPU): faster and smaller compute buffer size (#253)	Kawrakow
	* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-12	MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252)	Kawrakow
	* FlashMLA(CUDA): WIP to allow q8_0 quantized cache * WIP * FlashMLA(CUDA) - allow q8_0 for KV cache This works, and PP is not bad, but TG is still quite a bit slower. * FlashMLA(CUDA) - allow q8_0 for KV cache This is better. ~9% slower than f16 cache for short contexts, nearly on par at 16k tokens. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>