ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-05-07	FlashMLA-3 for DeepSeek models on CUDA (#386)	Kawrakow
	* CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07	fix some MSVC build problem. (#392)	Gaolingx
	* cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt
2025-05-07	Fix DeepSeek q8_0 cache (#391)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07	Fix build for Xeon Gold 6226R (#390)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-06	Update README.md	Kawrakow

2025-05-05	Fix DeepSeek FA (#382)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	CUDA: MMQ for IQ4_KS (#374)	Kawrakow
	* WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	Update README.md	Kawrakow

2025-05-04	Update README.md	Kawrakow

2025-05-04	CUDA: faster FA TG for GQA models (#370)	Kawrakow
	* cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04	Another attempt to fix #367 (#371)	Kawrakow
	* Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03	cmake: force MSVC compiler charset to utf-8 (#369)	Gaolingx

2025-05-03	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02	Fix FA bug on AVX2 (#364)	Kawrakow
	* Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02	Fix model architecture name (#366)	saood06
	Co-authored-by: junhuihe <junhui-he@outlook.com>
2025-04-30	Update README.md (#352)	Kawrakow
	* Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30	Fix IQK_FA_ALL_QUANTS on AVX2 (#360)	Kawrakow
	* Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29	Add missing enum values for qwen3 and qwen3moe (#356)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29	Apply Qwen3 PR from llama.cpp (#355)	Ben Harris

2025-04-29	Update AUTHORS	Kawrakow
	Add @ubergarm
2025-04-29	CPU FA improvements (#351)	Kawrakow
	* FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26	Add GLM-4-0414 Model Support (#344)	ubergarm
	* Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26	Fix division by zero bug (#349)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26	Add support for Cohere2 (#341)	Kawrakow
	* Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25	Fix q4_1 and q5_1 on Arm (#348)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25	Add ability to manually set arch flags (#347)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25	Fix FA on ARM (#346)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25	Fix LLaMA-4 attention (#342)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24	cuda: use switch in constexpr funcs (#343)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24	Update gguf-py constants (#298)	saood06
	* Update GGMLQuantizationType * Update LlamaFileType * Update GGML_QUANT_SIZES
2025-04-22	BitNet adjustments (#338)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-22	Add support for bitnet2b_2501 model (#337)	saood06
	* add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-21	Fix termux/android build (#336)	saood06
	* Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13
2025-04-17	Better TG performance for GQA models (CPU) (#332)	Kawrakow
	* Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15	Better gemm/gemv on AVX2 fr q4_0_r8 (#331)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15	Allow q8_0 KV cache for head size 256 (#330)	Kawrakow
	* Allow q8_0 KV cache for head size 256 * We need also these --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14	imatrix: collect layer influence statistics (#328)	Kawrakow
	* imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14	Add ability to hide imatrix details in llama-quantize (#329)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13	Improved IQ1_M quantization (#327)	Kawrakow
	* Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12	Fix KLD precision (#325)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11	Correct L4 rms_norm (#324)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10	LlaMA-4 support (text only) (#321)	Kawrakow
	* llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08	Guard against attempts to use MLA for non-MLA models (#320)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Update AUTHORS	Kawrakow
	Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07	Update AUTHORS	Kawrakow
	Forgot to add @Nexesenex
2025-04-07	Use links for ggml/llama.cpp authors (#318)	Kawrakow
	* Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Better iq2_xs quantization (#312)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Add copyright notices (#317)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07	Update LICENSE	Kawrakow
	I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.
2025-04-05	We need to synchronize before using device to host async memcpy (#313)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>