summaryrefslogtreecommitdiff
path: root/ggml/src
AgeCommit message (Collapse)Author
2025-05-15Adding IQ5_KS - 5.25 bpw quants (#422)Kawrakow
* iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15Fix standard attention on the CPU (#421)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)Kawrakow
* MMQ for iq2_k * This works * MMQ for iq3_k * MMQ for iq2_ks * Fix iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)Kawrakow
* MMQ for iq4_k: WIP (not working) * MMQ for iq4_k: working now * MMQ for iq5_k * Cleanup * MMQ for iq5_k: slightly faster * MMQ for iq6_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14Fix SER (CUDA) (#416)Kawrakow
* Fixing SER bugs * Cleanup * This seems to fix it. * This seems to work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13Fix SER (CPU) (#415)Kawrakow
* Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13Better CPU FA performance for DeepSeek-Lite (#410)Kawrakow
* Better CPU FA performance for DeepSeek-Lite * It must be like this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12Fix new CUDA FA on Touring (#413)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12Faster DeepSeek FA on CUDA (#408)Kawrakow
* New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12GPU offload policy (#405)Kawrakow
* Adding GPU offload policy * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-11Revert "Fix race in the CUDA DeepSeek FA kernel (#406)"Iwan Kawrakow
This reverts commit 36e6e888b75ae93fb5aac212bb0e147d8379ae23. I should have tested. We get NaNs.
2025-05-11Fix race in the CUDA DeepSeek FA kernel (#406)Kawrakow
Reference: https://github.com/ggml-org/llama.cpp/pull/13438 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-10TG improvements for MoE models (#404)Kawrakow
* cuda: Remove unnecessary device to host copy of row ids We get 3-4% TG speed improvement for DeepSeek-Lite just from that. * CPU: fix get_rows when SER is used With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working. * CUDA: fix TG with SER --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09Fix CUDA FlashMLA-3 with quantized KV cache (#400)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07FlashMLA-3 for DeepSeek models on CUDA (#386)Kawrakow
* CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07Fix DeepSeek q8_0 cache (#391)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07Fix build for Xeon Gold 6226R (#390)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-05Fix DeepSeek FA (#382)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04CUDA: MMQ for IQ4_KS (#374)Kawrakow
* WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04CUDA: faster FA TG for GQA models (#370)Kawrakow
* cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04Another attempt to fix #367 (#371)Kawrakow
* Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02Fix FA bug on AVX2 (#364)Kawrakow
* Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30Fix IQK_FA_ALL_QUANTS on AVX2 (#360)Kawrakow
* Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29CPU FA improvements (#351)Kawrakow
* FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26Fix division by zero bug (#349)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26Add support for Cohere2 (#341)Kawrakow
* Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Fix q4_1 and q5_1 on Arm (#348)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Add ability to manually set arch flags (#347)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25Fix FA on ARM (#346)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24cuda: use switch in constexpr funcs (#343)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-21Fix termux/android build (#336)saood06
* Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13
2025-04-17Better TG performance for GQA models (CPU) (#332)Kawrakow
* Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15Better gemm/gemv on AVX2 fr q4_0_r8 (#331)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15Allow q8_0 KV cache for head size 256 (#330)Kawrakow
* Allow q8_0 KV cache for head size 256 * We need also these --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13Improved IQ1_M quantization (#327)Kawrakow
* Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Better iq2_xs quantization (#312)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Add copyright notices (#317)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-05We need to synchronize before using device to host async memcpy (#313)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04Add -flax-vector-conversions for GCC on ARM (#311)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: FA and FlashMLA (#310)Kawrakow
* Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Fix GCC compilation errors on ARM (#309)Kawrakow
* Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03Metal: much faster MoE prompt processing (#307)Kawrakow
* MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Fix ARM_NEON build failure due to q8_2 (#303)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Quantization improvements (2) (#302)Kawrakow
* iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01Fix #300 (#301)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29Quantization improvements (#295)Kawrakow
* Better make_qx_quants Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant. * Sae for iq4_nl, iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27Use bf16 instead of fp16 block scales for q8_1 (#292)Kawrakow
* WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25CUDA: better MoE implementation (#283)Kawrakow
* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23Improve DeepSeek batched processing speed (#282)Kawrakow
* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>