ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-05-20	Bug fixes from mainline (#439)	Kawrakow
	* Add __syncthreads() to the new FA kernel * Clearing padding --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-18	Forgotten MMQ ref and typo (#431)	Nexes the Elder

2025-05-17	Option to enable disable the IQK CPU FA kernels (#429)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17	Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)	Kawrakow
	* Zen4: faster PP for iq4_ks and iq5_ks * Zen4: faster PP for iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17	IQ5_KS_R4: row-interleaved IQ5_KS (#426)	Kawrakow
	* iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16	Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)	Kawrakow
	* Fix IQ4_K on AVX2 * Fix IQ4_KS on AVX2 * Fix IQ5_K on AVX2 * Fix IQ6_K on AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15	Adding forgotten template instance for iq5_ks (#424)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15	Adding IQ5_KS - 5.25 bpw quants (#422)	Kawrakow
	* iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15	Fix standard attention on the CPU (#421)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15	CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)	Kawrakow
	* MMQ for iq2_k * This works * MMQ for iq3_k * MMQ for iq2_ks * Fix iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14	CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)	Kawrakow
	* MMQ for iq4_k: WIP (not working) * MMQ for iq4_k: working now * MMQ for iq5_k * Cleanup * MMQ for iq5_k: slightly faster * MMQ for iq6_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14	Fix SER (CUDA) (#416)	Kawrakow
	* Fixing SER bugs * Cleanup * This seems to fix it. * This seems to work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13	Fix SER (CPU) (#415)	Kawrakow
	* Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13	Fix imatrix calculation for MLA models (#411)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13	Better CPU FA performance for DeepSeek-Lite (#410)	Kawrakow
	* Better CPU FA performance for DeepSeek-Lite * It must be like this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12	Update README.md	Kawrakow

2025-05-12	Fix new CUDA FA on Touring (#413)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12	Add batch warmup to sweep-bench (#375)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12	Enable faster prompt processing with mainline llama.cpp GGUFs (#409)	Kawrakow
	* Enable MLA-3 in crippled GGUFs: WIP * Enable MLA-3 in crippled GGUFs: seems to work * Add newly created tensors to model.tensors_by_name Else they don't get run-time repacked. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12	Faster DeepSeek FA on CUDA (#408)	Kawrakow
	* New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12	GPU offload policy (#405)	Kawrakow
	* Adding GPU offload policy * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-11	Revert "Fix race in the CUDA DeepSeek FA kernel (#406)"	Iwan Kawrakow
	This reverts commit 36e6e888b75ae93fb5aac212bb0e147d8379ae23. I should have tested. We get NaNs.
2025-05-11	Fix race in the CUDA DeepSeek FA kernel (#406)	Kawrakow
	Reference: https://github.com/ggml-org/llama.cpp/pull/13438 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-10	TG improvements for MoE models (#404)	Kawrakow
	* cuda: Remove unnecessary device to host copy of row ids We get 3-4% TG speed improvement for DeepSeek-Lite just from that. * CPU: fix get_rows when SER is used With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working. * CUDA: fix TG with SER --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09	Handle incompatible DeepSeek GGUFs (#394)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09	Fix missing rope_freqs with convert_hf_to_gguf (#402)	saood06
	* lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09	Update README.md	Kawrakow
	@saood06 Thanks!
2025-05-09	Fix CUDA FlashMLA-3 with quantized KV cache (#400)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09	Update README.md	Kawrakow

2025-05-09	Support for Llama-3-Nemotron models (#377)	saood06
	* conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07	Update README.md	Kawrakow

2025-05-07	FlashMLA-3 for DeepSeek models on CUDA (#386)	Kawrakow
	* CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07	fix some MSVC build problem. (#392)	Gaolingx
	* cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt
2025-05-07	Fix DeepSeek q8_0 cache (#391)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07	Fix build for Xeon Gold 6226R (#390)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-06	Update README.md	Kawrakow

2025-05-05	Fix DeepSeek FA (#382)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	CUDA: MMQ for IQ4_KS (#374)	Kawrakow
	* WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	Update README.md	Kawrakow

2025-05-04	Update README.md	Kawrakow

2025-05-04	CUDA: faster FA TG for GQA models (#370)	Kawrakow
	* cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	Another attempt to fix #367 (#371)	Kawrakow
	* Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03	cmake: force MSVC compiler charset to utf-8 (#369)	Gaolingx

2025-05-03	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02	Fix FA bug on AVX2 (#364)	Kawrakow
	* Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02	Fix model architecture name (#366)	saood06
	Co-authored-by: junhuihe <junhui-he@outlook.com>
2025-04-30	Update README.md (#352)	Kawrakow
	* Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30	Fix IQK_FA_ALL_QUANTS on AVX2 (#360)	Kawrakow
	* Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29	Add missing enum values for qwen3 and qwen3moe (#356)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29	Apply Qwen3 PR from llama.cpp (#355)	Ben Harris