summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-05DeepSeek CUDA Flash Attention (#241)Kawrakow
* WIP CUDA FA with Dk != Dv * WIP * CUDA FA WIP - It actually works! No TG yet, but for PP I can run FA with fp16 cache and it gets the same answer. * CUDA FA WIP - it now works for Q8_0 + Q8_0 for KV cache * CUDA FA WIP - TG, not working yet. * CUDA FA with Dk != Dv: it works now for DeepSeek --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-03Flash MLA (CPU only) (#240)Kawrakow
* FlashMLA - it finally works (on the CPU) * FlashMLA: allow for f16 and bf16 cache in addition to q8_0 * It works with ggml FA, not with iqk FA * WIP * FlashMLA: it now works with iqk I had forgotten to divide the Q stride by sizeof(float) and that's why, very cobfusingly, it was working for TG but not for PP. * WIP * FlashMLA: that should be it for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-02SER - Smart Expert Reduction (#239)Kawrakow
* A better way to measure the cost of ggml_barrier * Smart expert selection * Add ser option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-01A better way to measure the cost of ggml_barrier (#238)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-01Reduce size of compute buffers (#237)Kawrakow
* This reduces compute buffer size for MLA * This should accomplish it for standard attention * Much better * Better concat for contiguous tensors If all the op does is to concatenate the second tensor to the first, why would we want to have a loop? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-27Option to use MLA without a transposed cache (#235)Kawrakow
The `-mla` command line option turns into an int from a bool. mla = 0: use standard attention mla = 1: use MLA with transposed cache mla > 1: use MLA without transposed cache Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-27Faster MLA on CUDA (#234)Kawrakow
* Slight MLA TG performance improvement on CUDA The low MLA performance on CUDA is dues to the wk_b * q_nope operation. It turns into n_head matrix multiplications with n_head separate quantization and GEMV steps. The associated overhead is just too much for TG where each GEMV is very fast (512 x 128 = 131 KFLOP for DeepSeek-Lite, 4X that for DeepSeekV3/R1). The way it was done there was also a copy of each q_nope row before quantization, which I have now eliminated. This results in a ~2.5% speedup. What needs to happen instead is to launch a single computation that quantizes all heads, and then have a kernel that does the GEMV for all heads instead of n_head sequential GEMVs. * Slightly better * CUDA: Quantize non-contiguous tensors * Much better MLA It is a total hack, but it works. * Cleanup Remove duplicated gemv's. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-25Give the user the option to override where model weights are stored (#232)Kawrakow
* Give the user the option to override where model weights are stored * Fix ggml_nbytes() problem and cleanup For a tensor with zero elements ggml_nbytes() was returning uint64_t::max, and this was causing graph allocation failure. * Add timing info to CUDA graph evaluation * Add more timing info --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-24Fix #230 (#231)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23Fused MoE ffn_up and ffn_gate (#229)Kawrakow
* Fusing MoE up * unary(gate) * Fusing MoE up * unary(gate): CUDA We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite * On CUDA also fuse MoE down * (up * unary(gate)) in case the MUL_MAT_ID op for the down experts is the next op in the graph. * Command line option to enable fused MoE up*unary(gate) * Add fmoe option to llama-bench * Adding forgotten gelu, relu, silu on ARM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23Add new sweep-bench benchmark (#225)saood06
* examples : add new sweep-bench benchmark * Change documentation to reference ik_llama.cpp * Made it compile with ik_llama * Fix JSONL output --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-23Fix compilation error with IQK_FA_ALL_QUANTS enabled (#226)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22Fix #217 (#220)Kawrakow
* Fix #217 * Remove stuff commited by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22Fuse MoE up and gate matrix multiplications (#219)Kawrakow
* This seems to be a better way to do the attention matrix multiplications in the TG case. * Cleanup * Fuse up and gate gemms in MoE models Small (~1-2%) but measurable performan ce gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-22Better strategy for attention matrix multiplications when generating tokens ↵Kawrakow
(#218) * This seems to be a better way to do the attention matrix multiplications in the TG case. * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-21Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD (#216)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-20Honor attn_output specified in the command line also for low-bit quantsIwan Kawrakow
2025-02-20Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128 ↵Kawrakow
(#213) * Fix gemm/gemv for legacy quants when row size is not divisible by 128 * Fix typo --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-20Optimized GEMM/GEMV for IQ1_S (#212)Kawrakow
* Adding iq1_s to iqk_mul_mat (Zen4) * iq1_s: slightly better on Zen4 * iq1_s: AVX2 * iq1s: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19Q8_KV: 8-bit quantization type targeting the KV cache (#208)Kawrakow
* Adding q8_KV - Basics + AVX2 gemm/gemv * q8_KV: Better AVX2 gemm * q8_KV: Better Zen4 gemm We get 225.7 t/s for L3-8B. In comparison q8_0 without run-tinme-repacking is at 169 t/s. * q8_KV: AVX2 gemm/gemv We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr. * q8_KV: be able to use it for K cache This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_size*type_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left * llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache. * q8_KV: be able to use it for K cache in FA * q8_KV: repack it for K*Q in FA * q8_KV: slightly faster gemv on Zen4 * q8_KV: slightly faster gemv on Zen4 * q8_KV: ARM_NEON We get PP-512 = 167 t/s for L3-8B without interleaving! We do the interleaving on the fly, so I wonder if this could be done for other quants as well. * q8_KV: use it in FA on NEON * q8_KV_r8 - repacked q8_KV On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s) This makes no sense whatsoever as the q8_KV_r8 GEMM is basically the q8_k_r8 GEMM with the unnecessary block stuff removed (so, one would think that it would be faster). * q8_KV_r8: don't use nrc_y = 16 on Zen4 This is faster - 350 t/s. Why? Much better than the 290 t/s we had before, but still slower than the 370 t/s for q8_k_r8. * q8_KV: nrc_y = 16 also doesn't pay off in FA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19Repack also experts (#210)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-15Bug fix in activation quantizationIwan Kawrakow
I added a change in the last PR how activations are quantized. It looked like it is working and slightly improving performance. But I now hit an edge case where I get gibberish that goes away if I remove the change. I absolutely don't see what goes wrong, so leaving the change in commented out for now.
2025-02-15Moving 4D gemm logic from ggml.c to iqk_mul_mat.cpp (#207)Kawrakow
This allows us to optimize TG performance for GQA models. E.g., for IQ4_XS L3-8B with 8k TG-64 goes from 8.6 to 10.26 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13MLA: allow Q8_0 K-cache for MLA (#206)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13Faster MLA prompt processing (#205)Kawrakow
* Do not allocate / report caches that are not used It is either the standard KV cache or MLA cache, not both. * Rename X_pe to X_rope Much easier to follow, at least for my brain, when we have X_rope : rotational position encoding X_nope : no position encoding instead of X_pe and X_nope, where I was wondering wtf is 'pe' and 'nope'. * WIP * WIP * WIP * WIP * Warn user when disabling MLA * MLA: compile time option to not use transposed KV cache Cuts KV cache size in nearly half at the expense of slower TG performance for long contexts (it becomes similar to no-MLA). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12Fix iqk_mul_mat on AVX512 systems that are missing BF16 support (#204)Kawrakow
* Fix iqk_mul_mat on AVX512 systems that are missing BF16 support * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12Fix imatrix overprotectiveness (#202)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-11DeepSeek FA support (CPU only) (#200)Kawrakow
* Adding support for K head size != V head size This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv. * iqk support for K head size != V head size To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek) * FA: very slightly faster for nq = 1 (TG) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-10 Load all MoE experts during warmup and make warmup 1 token (#198)saood06
* Load all MoE experts during warmup Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Unify warmup to one token --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-09Add optional MLA (#188)Kawrakow
* Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <saood05@gmail.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09FA: Add option to build all FA kernels (#197)Kawrakow
Similar to the CUDA situation. It is OFF by default. If OFF, only F16, Q8_0, Q6_0, and, if the CPU provides native BF16 support, BF16 FA kernels will be included. To enable all, cmake -DGGML_IQK_FA_ALL_QUANTS=1 ... This cuts compilation time for iqk_mul_mat.cpp by almost half (45 seconds vs 81 seconds on my Ryzen-7950X). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194)Kawrakow
* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon) * iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon) * Simdify q8_K128 quantization also on Neon * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-08Revert #79 (#192)Kawrakow
* Revert "Do not quantize activations if not necessary (#79)" This reverts commit 0bf4d99774aa3b6d00ef564acbc4dc211e45db33. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07cuda: non-contiguous rms norm (#190)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07Add additional checks for iq1_s_r4 quantization (#191)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)Kawrakow
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving * Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving * Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06IQ1_M_R4: better 1.75 bpw quants (#187)Kawrakow
* iq1_m_r4: basics (quantize/dequantize) * iq1_m_r4: Zen4 gemm * iq1_m_r4: neon gemm * iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4 With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4. * iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05iq1_s_r4: slightly faster NEON gemm/gemv (#186)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05IQ1_S_R4: better 1.5 bpw quants (#185)Kawrakow
* iq1_s_r4: basics - quantize/dequantize * iq1_s_r4: gemm/gemv works on AVX2/Zen4 * Don't forget to make sure we have a multiple of 4 rows per thread * iq1_s_r4: this is better * iq1_s_r4: fix Zen4 after AVX2 changes * iq1_s_r4: NEON gemm/gemv * iq1_s_r4: more bits for shared experts With this mix we arrive at PPL(512) = 9.4140 for Deepseek-Lite using 1.766 bpw for the repeating layers. On the Ryzen-7950X we get PP-512 = 494 t/s and TG-128 = 52 t/s @ 16 threads. * Forgotten counter increment * iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv * Compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30Deepseek-Lite (#184)Kawrakow
* Quantization mixes tweaks * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on Zen4 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q5_0_r4 work with row size that are not a multiple of 128 ... on Zen4 and AVX2 * Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128 also on NEON. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 (#182)Kawrakow
* Slightly faster AVX2 implementation for q4_k_r4 * Even better AVX2 implementation for q4_k_r4 We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 291 t/s when I last measured on 3c5f8722. With FA and Q8_0 K-cache we get to 339.5 t/s. * Fix llama-bench labels that I broke with #181 * Faster AVX2 implementation for q5_k_q4 We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 273 t/s. * Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4 After the changes I made to AVX2, it ends up being slightly faster compared to what I had for Zen4. * Minor tweak * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-29Various (#181)Kawrakow
* Adding gp option to llama-bench Similar to pg, but it only looks at TG speed with a given prompt length. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 They still need to be divisible by 32. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .. on NEON * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on NEON * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27Minor performance improvements (#179)Kawrakow
* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. * q4_0_r8 (AVX2) * q4_0_r8 (NEON) Tiny bit faster PP (~128 vs ~126 t/s), same TG. * q4_0_r8 (Zen4) Somehow only marginally faster? 268 t/s vs 261 t/s * q4_0_r8 (Zen4) - slightly better 282 t/s for a pure q4_0 L3-8B quantization. * Apply platform specific modifications when repacking E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0. This results in a ~3% performance improvement. Hence, * Changed the signature of the repack_X functions to take a bool argument indicating if the repacking is done online and, if so, apply modifications as appropriate while repacking. * Added iqk_modify_tensor to apply modifications to models that have already been repacked while loading the model. Caveat: just like rtr, this needs to have mmap disabled (else one would need to move the data to a not mmap-ed buffer, so much more complicated). * Apply platform specific modifications when repacking On Zen4 we can pre-convert the signed quants in q8_0_r4 and q8_k_r8 to unsigned thus avoiding these operations in matrix multiplications. With this change we hit PP-512 = 382.40 t/s (q8_k_r8) PP-512 = 306.92 t/s (q8_0_r4) for L3-8B on a Ryzen-7950X using q8_0 KV-cache. * Process up to 16 columns per kernel call for q8_k_r8 This brings PP-512 up to 389 t/s. * Be able to load Deepseek-v2-Lite --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27Interleave 8 rows (Q8_0, IQ4_XS) (#178)Kawrakow
* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24Update chat templates (#177)Kawrakow
* Adopting chat template stuff from llama.cpp * Removing missed conflict marker --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-23Deepseek V3 support added (#176)saood06
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23Add Deepseek-R1-Distill pre-tokenizerIwan Kawrakow
2025-01-22Better BF16 support on AVX2 (#175)Kawrakow
* Adding BF16 support for AVX2 PP performance is the same as fp16 (~153 t/s on Ryzen-5975WX), but TG is quite a bit lower (3.65 t/s vs 4.72 t/s at 8 threads). Why? * Slightly faster fp16/bf16 gemv on AVX2 It still saturates at the same lower peformance for bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-21On Zen4 repack fp16 models to bf16_r16 when run-time-repacking is requested ↵Kawrakow
(#174) This massively improves performance. As this is opt-in, we do not worry about possible precision loss in the f16 -> bf16 conversion. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-20More Flash Attention improvements (#173)Kawrakow
* FA: slightly faster V*softmax(K*Q)) on Zen4 * FA: it is also faster on AVX2 and ARM_NEON * Deleted forgotten commented out code * FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache * FA: slightly faster V*softmax(K*Q)) on Zen4 We now get 130.9 t/s for a context of 32k tokens. * FA: don't store sum scaling factor in SIMD registers * FA: timing * FA: faster q8_0 cache via run-time-repacking On Zen4 q8_0 KV-cache now slightly outperforms BF16. We get 134 t/s for 32k tokens, which is ~30% better than the main branch, and ~18% better than the last commit. We simply repack the K-cache to q8_0_r4 before the K*Q multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication template. * FA: Fix AVX2 * FA: fix ARN_NEON * FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON * FA: dedicated mat mul for D = 128 also for ARM_NEON * FA: turn off performance timer --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>