summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-03cmake: force MSVC compiler charset to utf-8 (#369)Gaolingx
2025-05-03Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02Fix FA bug on AVX2 (#364)Kawrakow
* Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02Fix model architecture name (#366)saood06
Co-authored-by: junhuihe <junhui-he@outlook.com>
2025-04-30Update README.md (#352)Kawrakow
* Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30Fix IQK_FA_ALL_QUANTS on AVX2 (#360)Kawrakow
* Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29Add missing enum values for qwen3 and qwen3moe (#356)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29Apply Qwen3 PR from llama.cpp (#355)Ben Harris
2025-04-29Update AUTHORSKawrakow
Add @ubergarm
2025-04-29CPU FA improvements (#351)Kawrakow
* FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26Add GLM-4-0414 Model Support (#344)ubergarm
* Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26Fix division by zero bug (#349)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26Add support for Cohere2 (#341)Kawrakow
* Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Fix q4_1 and q5_1 on Arm (#348)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Add ability to manually set arch flags (#347)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Fix FA on ARM (#346)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Fix LLaMA-4 attention (#342)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24cuda: use switch in constexpr funcs (#343)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24Update gguf-py constants (#298)saood06
* Update GGMLQuantizationType * Update LlamaFileType * Update GGML_QUANT_SIZES
2025-04-22BitNet adjustments (#338)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-22Add support for bitnet2b_2501 model (#337)saood06
* add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-21Fix termux/android build (#336)saood06
* Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13
2025-04-17Better TG performance for GQA models (CPU) (#332)Kawrakow
* Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15Better gemm/gemv on AVX2 fr q4_0_r8 (#331)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15Allow q8_0 KV cache for head size 256 (#330)Kawrakow
* Allow q8_0 KV cache for head size 256 * We need also these --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14imatrix: collect layer influence statistics (#328)Kawrakow
* imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14Add ability to hide imatrix details in llama-quantize (#329)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13Improved IQ1_M quantization (#327)Kawrakow
* Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12Fix KLD precision (#325)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11Correct L4 rms_norm (#324)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10LlaMA-4 support (text only) (#321)Kawrakow
* llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08Guard against attempts to use MLA for non-MLA models (#320)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Update AUTHORSKawrakow
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07Update AUTHORSKawrakow
Forgot to add @Nexesenex
2025-04-07Use links for ggml/llama.cpp authors (#318)Kawrakow
* Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Better iq2_xs quantization (#312)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Add copyright notices (#317)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Update LICENSEKawrakow
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.
2025-04-05We need to synchronize before using device to host async memcpy (#313)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04Add -flax-vector-conversions for GCC on ARM (#311)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: FA and FlashMLA (#310)Kawrakow
* Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Fix GCC compilation errors on ARM (#309)Kawrakow
* Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: much faster MoE prompt processing (#307)Kawrakow
* MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01docs: update README.md (#304)Ikko Eltociear Ashimine
2025-04-01Fix ARM_NEON build failure due to q8_2 (#303)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Quantization improvements (2) (#302)Kawrakow
* iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Additional guards for interleaved quants (#299)Kawrakow
* Make sure no interleaved quants are being used for token embeddings also with `--pure` and/or `--custom-q`. * Simplify --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Fix #300 (#301)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29Quantization improvements (#295)Kawrakow
* Better make_qx_quants Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant. * Sae for iq4_nl, iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27Make sure tensor row size is multiple of block size also when quantizing ↵Kawrakow
with --pure (#294) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * Add check if selected type is possible with --pure I often want to quantize with --pure to see quantization performance without quantization mixes. But for models where there qre tensors with row sizes that are not multiple of 256, this results in a crash for k- and i-quants. Hence, lets add a check if the quant selected via --pure is applicable, and change it if not. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>