summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-27Interleave 8 rows (Q8_0, IQ4_XS) (#178)Kawrakow
* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24Update chat templates (#177)Kawrakow
* Adopting chat template stuff from llama.cpp * Removing missed conflict marker --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-23Deepseek V3 support added (#176)saood06
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23Add Deepseek-R1-Distill pre-tokenizerIwan Kawrakow
2025-01-22Better BF16 support on AVX2 (#175)Kawrakow
* Adding BF16 support for AVX2 PP performance is the same as fp16 (~153 t/s on Ryzen-5975WX), but TG is quite a bit lower (3.65 t/s vs 4.72 t/s at 8 threads). Why? * Slightly faster fp16/bf16 gemv on AVX2 It still saturates at the same lower peformance for bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-21On Zen4 repack fp16 models to bf16_r16 when run-time-repacking is requested ↵Kawrakow
(#174) This massively improves performance. As this is opt-in, we do not worry about possible precision loss in the f16 -> bf16 conversion. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-20More Flash Attention improvements (#173)Kawrakow
* FA: slightly faster V*softmax(K*Q)) on Zen4 * FA: it is also faster on AVX2 and ARM_NEON * Deleted forgotten commented out code * FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache * FA: slightly faster V*softmax(K*Q)) on Zen4 We now get 130.9 t/s for a context of 32k tokens. * FA: don't store sum scaling factor in SIMD registers * FA: timing * FA: faster q8_0 cache via run-time-repacking On Zen4 q8_0 KV-cache now slightly outperforms BF16. We get 134 t/s for 32k tokens, which is ~30% better than the main branch, and ~18% better than the last commit. We simply repack the K-cache to q8_0_r4 before the K*Q multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication template. * FA: Fix AVX2 * FA: fix ARN_NEON * FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON * FA: dedicated mat mul for D = 128 also for ARM_NEON * FA: turn off performance timer --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-15CPU Flash Attention improvements (#172)Kawrakow
* Slightly faster FA for bf16 KV cache ~2-3% sort of thing. Sadly, when we go beyond 8k tokens, the advantage kind of goes away. * Slightly faster FA for Q8_0 KV cache * FA: allow bf16 for V-cache with any supported K-cache E.g., -ctk q8_0 -ctv bf16 is slightly faster than -ctk q8_0 -ctv q8_0 on Zen4 for not too long context lengths (say, <= 4096). * FA: much better bf16 kv-cache speed for large contexts We now hit 122 t/s for LLaMA-3.1-8B (quantized as iq4_xs and run-time-repacked) with a context of 32768. IIRC, the previous best for such large context was ~90 t/s. Non-negligible improvement at 16384 and 8192 as well: 173.4 and 214 t/s. * FA: slightly better quantized kv-cache speed for large contexts E.g., for q8_0 and context of 32768, we are now at 113 t/s for LLaMA-3.1-8B. Also simplified the quantized K*Q multiplication. * Fix q8_0 KV cache when not using FA - WIP (AVX2) 1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use those to quantize activations for quants that use Q8_0 or Q8_1 as their vec_dot type. 2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1 3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type 4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than GGML_TYPE_Q8_0_X4 as the K and V types 5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4 in iqk_mul_mat Also added an optimization in ggml_compute_forward_mul_mat when ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using nthread/GCD(n12*ne13, nthread) threads per head. This results in a non-negligible performance gain for large contexts. Question: why is it not allowed to use quantized V-cache when not using FA? * Fix q8_0 KV cache when not using FA - NEON * Fix AVX2 Again the issue with _mm256_maddubs_epi16 overflowing that I keep forgetting. * FA: don't use large Q steps on AVX2 for fp16 K-cache * On Zen4 it is also better to not use large Q steps for fp16 K-cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12Fix the strange FA behavior with odd/even batch sizes (#171)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12MoE fix for R4 quants (#170)Kawrakow
* Fix bug in iqk_mul_mat I recently added the possibility to have a matrix multiplication kernel that processes 16 columns in the right matrix per iteration. This introduced a bug that shows up when batch size is greater than 16, is not a multiple of 16, and the remainder is not a multiple of the maximum columns being processed by the regular kernels (and so, never showed up in my testing using TG-128 and PP-512). This commit fixes the issue. * Make sure rows per thread is a multiple of 4 also for MoE when using _r4 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10Be able to re-quantize MS BitNet I2_S models (#169)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10Falcon3 changes (#168)Kawrakow
* Add Falcon3 pre-tokinizer (same as llama3) * q8_k16: use integer arithmetic to sum row values The existing implementation that just sums up the f32 quantizations works fine for the original BitNet models and also for the TriLM ternary models. But for Falcon3 I see a significant difference between the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum up the values in a row, then the CPU-GPU PPL difference becomes much smaller, and we get a lower PPL than Microsoft BitNet, which claims to be "losless". --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23iq4_0_r4: Use AVX2 version for matrix x vector (#163)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23IQ3_S_R4 (#162)Kawrakow
* iq3_s_r4: WIP * iq3_s_r4: Zen4 * iq3_s_r4: slightly better Zen4 * iq3_s_r4: AVX2 * iq3_s_r4: NEON * iq3_s_r4: rearrange quants * iq3_s_r4: rearranged quants - AVX2 * iq3_s_r4: rearranged quants - NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23MSVC fixes (#161)Kawrakow
Closes #160 * MSVC fixes * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22Faster R4 legacy quants (#158)Kawrakow
* q4_0_r4(avx2): convert q8_1 scales with SIMD instrinsics PP-512 goes to 283 t/s from 265 t/s * qx_0_r4(AVX2): convert scales with SIMD instrinsics Also fix q8_0_r4 to not overflow. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22R4 i-quants improvements (#157)Kawrakow
* Add nrc_y = 16 implementation. Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from 148.5 t/s. As we are sure that we will be multiplying with 16 columns, we can spend the time to add the mins and make the iq2_s quants unsigned. * nrc_y = 16: AVX2 iq2_s We go from 176.8 to 203.3 t/s. * nrc_y = 16: NEON iq2_s We go from 50.4 to 62.3 t/s. We didn't need to do anything other than to set func16 to mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have so many vector registers for all accumulators, unpacking and preparing the iq2_s quants is so expensive that we still gain ~23% in performance by reusing the unpacked quants 16 times instead of just 8, despite having to load/unload the accumulated results to/from the available vector registers. * nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs iq2_xxs: 76.34 -> 85.33 t/s iq2_xs: 54.13 -> 67.99 t/s iq3_xxs: 67.45 -> 73.56 t/s * nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs iq2_xxs: 195.7 -> 221.8 t/s iq2_xs : 192.6 -> 220.6 t/s iq3_xxs: 184.4 -> 206.9 t/s * r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21IQ2_S_R4 (#156)Kawrakow
* iq2_s_r4: Zen4 * Minor * iq2_s_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21IQ2_XS_R4 (#155)Kawrakow
* iq2_xs_r4: Zen4 * iq2_xs_r4: AVX2 * iq2_xs_r4: slightly better matrix x vector on AVX2 * iq2_xs_r4: NEON - not much better than iq2_xs * iq2_xs_r4: slightly better NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20IQ2_XXS_R4 (#154)Kawrakow
* iq2_xxs_r4: Zen4 Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512 TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread * Minor * iq2_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20fix typo (#151)Nexes the Elder
2024-12-20IQ3_XXS_R4 (#153)Kawrakow
* iq3_xxs_r4: 1st shot on Zen4 PP-512: 107 t/s -> 137 t/s TG-128(1 thread): 2.64 t/s -> 3.44 t/s * iq4_xxs_r4: WIP * iq4_xxs_r4: 1st shot at AVX2 Note: there is a bug in the AVX2 implementation for nrc_y = 1 for IQ quants with blocks of 32. I have fixed it for now by using the nrc_y > 1 implementation (which works) also for nrc_y = 1. * iq3_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18IQ4_KS_R4 (#150)Kawrakow
* iq4_ks_r4: Zen4 * iq4_ks_r4: AVX2 * iq4_ks_r4: WIP * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: NEON * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18IQ5_K_R4 (#149)Kawrakow
* iq5_k_r4: Zen4 Much slower than the others. * iq5_k_r5: WIP * Minor * iq5_k_r4: fix AVX2 nrc_y = 1 case * iq5_k_r4: better Zen4 But TG is still slower than iq5_k * iq5_k_r4: slightly better AVX2 * iq5_k_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, ↵Kawrakow
iq4_k_r4 (#148) * Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4 More importantly: simplify. * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17Be able to repack tensors at run time (#147)Kawrakow
* Be able to repack tensors at run time * Repack: also add bf16 as repackable type * Repack: make sure number of rows is a multiple of the packing --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17IQ2_K_R4 (#146)Kawrakow
* iq2_k_r4: Zen4 * iq2_k_r4: NEON * iq2_k_r4: better matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17IQ3_K_R4 (#145)Kawrakow
* iq3_k_r4 WIP * iq3_k_r4: Zen4 * iq3_k_r4: AVX2 * iq3_k_r4: NEON * iq3_k_r4: faster matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16Slightly faster IQ4_K_R4 on AVX2/Zen4 (#144)Kawrakow
* iq4_k_r4: slightly better AVX2 227 t/s -> 249 t/s * iq4_k_r4: slightly better Zen4 232 t/s -> 251 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16Slightly faster IQ4_XS_R4 on AVX2 (#143)Kawrakow
* iq4_xs_r4: slightly faster and correct AVX2 implementation * Minor * Delete unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-16q8_k_r8: this change for NEON got lost?Iwan Kawrakow
2024-12-15BF16_R16 - 16 interleaved bf16 rows (#142)Kawrakow
* Not working bf16_r4 * Adding bf16_r8 Small performance gain compared to bf16 - 258 t/s vs 234 t/s. I guess, this is still sub-obtimal. * bf16_rx: Very slightly faster by interleaving 16 rows 258 t/s -> 263 t/s * Rename bf16_r4 to bf16_r16 We are interleaving 16 rows now. * Cleanup unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-14Q8_K_R8: Fastest quantized matrix multiplications (#141)Kawrakow
* q8_k_r8: fastest matrix multiplication known to human kind We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X! * q8_k_r8: AVX2 I was worried that we don't have enough vector registrers on AVX2, but it looks like it handles it just fine. We get PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX. Slightly slower than the Zen4 version with double the threads, but still a huge upgrade compared to Q8_0_R4. * q8_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 159.2 t/s. Compare this to the 128 t/s we have fr Q8_0_R4. * q8_k_r4: go to signed ints Why? * On AVX2 _mm256_maddubs_epi16() may overflow, so we need to stay within the signed int range and use _mm256_sign_epi8. Not yet tested on the AVX2 comp, vut expect major slowdown. * It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8() needed tto convert from unsigned to signed seems to be extremely slow on the M2-Max * We only lose ~0.5% in oerformance on Zen4 (there the exclusive or that we now use to convert fro signed to unsigned seems to be much faster than on M2-Max) * Shutup useless compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-13Faster R4 quants on Zen4 (#139)Kawrakow
* q3_k_r4: faster Zen4 * q3_k_r4: faster Zen4 256.2 -> 272.7 t/s for PP-512 * q6_k_r4: faster Zen4 243.2 -> 261.3 t/s for PP-512 * q4_k_r4: slightly faster Zen4 262.4 t/s -> 268.1 t/s * q5_k_r4: slightly faster Zen4 248.3 t/s -> 256.7 t/s * iq4_xs_r4: slightly faster Zen4 256.8 t/s -> 272.0 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-13Another fixIwan Kawrakow
2024-12-13Adding lost q4_k_r4 caseIwan Kawrakow
Not sure how it got lost.
2024-12-12IQ4_K_R4 (#138)Kawrakow
* iq4_k_r4: WIP * iq4_k_r4: Zen4 and hopefully AVX2 On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s for iq4_k. Applying the extra shift costs a ~6 performance penalty. * iq4_k_r4: AVX2 PP-512 = 227.60 t/s. The shifts are really costly. * iq4_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Fix AVX2 implementation of iq4_nl_r4 (#137)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Q2_K_R4 (#136)Kawrakow
* q2_k_r4: Zen4 PP-512(LLaMA-3.1-8B) = 256 t/s * q3_k_r4: AVX2 * q2_k_r4: AVX2 We get PP-512(LLaMA-3.1-8B) = 287 t/s. Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow forgot to push upstream. * q2_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 106.2 t/s. TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S. * Make sure rows per thread are a multiple of 4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Better ARM_NEON implementation for R4 quants (#135)Kawrakow
* q6_k_r4: Better ARM implementation PP-512(LLaMA-3.1-8B) is now 104.2 t/s up from 83.2 t/s. I.e., q6_k_r4 now beats q6_0_r4. * q5_k_r4: Better ARM implementation PP-512(LLaMA-3.1-8B) is now 107.8 t/s up from 96.9 t/s. I.e., q5_k_r4 now beats q5_0_r4. * q4_k_r4: Better ARM implementation PP-512(LLaMA-3.1-8B) is now 122.1 t/s up from 110 t/s. I.e., q4_k_r4 is now (nearly) on par with q4_0_r4. * iq4_xs_r4: Better ARM implementation PP-512(LLaMA-3.1-8B) is now 131.3 t/s up from 115.8 t/s. iq4_xs_r4 is now the prompt processing champion on ARM. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Q3_K_R4 (#134)Kawrakow
* q3_k_r4: Zen4 works, but not as good as it should be 238 t/s, so sloghtly slower than q6_k_r4. * q3_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 106.9 t/s. This is 1.93X faster than q3_K_S! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10Q5_K_R4 (#132)Kawrakow
* q5_k_r4: WIP * q5_k_r4: Zen4 and AVX2 We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4. Q5_K_S has PP-512 = 190 t/s. * q5_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 96.1 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4 (#131)Kawrakow
* iq4_k_r4: slightly faster on Zen4 * iq4_xs_r4: very slightly faster Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10Q6_K_R4 (#130)Kawrakow
* Adding q6_k_r4 * q6_k_r4: 1st functional AVX2 version * q6_k_r4: AVX2 and simple Zen4 "Simple" as in processing 4 instead of 8 rows at once. On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs 195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s vs 5.38 t/s for Q6_K. * q6_k_r4: 1st NEON version PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K. TG-128 is slightly lower rthan q6_K for low number of threads, becomes very slightly better at 8 threads. * q6_k_r4: slightly faster NEON PP-512(LLaMA-3.1-8B) = 83.25 t/s * q6_k_r4: slightly faster Zen4 238.3 t/s -> 243.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-09Q4_K_R4 (#129)Kawrakow
* Something is still wrong * Simply don't see what is wrong * q4_k_r4: finally works on Zen4 I had forgotten to prevent token_embd.weight being quantized with q4_k_r4! * q4_k_r4: AVX2 We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX. This is ~30% better than Q4_K_S. * q4_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 110 t/s. Not quite as good as q4_0_r4, but still a massive improvement compared to he 69 t/s for q4_K. * q4_k_r4: slightly better AVX2 PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX * Minor * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08Faster IQ4_XS_R4 on Zen4 (#128)Kawrakow
* Faster iq4_xs_r4 on Zen4 The trick is to simply prepare the Q8 block sums for blocks of 32 as floats. This brings PP-512 up to 254.6 t/s from 224 t/s. * Fix broken matrix x vector product on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08Rename iq4_nl_x4 to iq4_nl_r4 (#126)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08R4 improvements on ARM_NEON (#125)Kawrakow
* q4_0_r4: 6% faster PP on NEON * qx_0_r4_q8_0 template Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP ~7% faster. * Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4 * Simplify * Minor iq4_xs_r4 improvement on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)Kawrakow
* Adding iq2_bn_r4 This Zen4-only implementation achieves PP-512 = 826 t/s (!!!) for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn. * Make sure rows per thread are a multiple of the number of interleaved rows With this I can run iq2_bn_r4 with 32 threads and this increases PP-512 to 872 t/s. * iq2_bn_r4: 1st shot at NEON PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s for Bitnet-1.58b-3B). TG-128 is ~5% slower. * iq2_bn_r4: NEON PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn for 1 thread, but saturates to about the same 93 t/s at 8 threads. * iq2_bn_r4: Experimenting on NEON The matrix x vvector multiplication is erratic. iq2_bn_r4 is faster at 1, 2, and 4 threads, but saturates to a lower t/s at 8 threads compared to iq2_bn. iq2_bn actually manages 99 t/s at 8 threads and not 93 as I wrore in the last commit. iq2_bn_r4 performance has huge fluctuations at 4 and 8 threads. * Some cleanup * iq2_bn_r4: AVX2 As expected, PP is slightly slower as we just don;t have enough vector registers (690 vs 710 t/s). TG is slightly faster (18.2 vs 16.7 t/s at 1 thread). * iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn. * iq2_bn_r4: simdify q8_K16 quantization (AVX2) PP-512 becomes 834 t/s and TG-128 now saturates to the same performance as iq2_bn for 4 threads. * iq2_bn_r4: simdify q8_K16 quantization (NEON) PP-512 is now 304.7 t/s, and TG-128 @ 8 threads very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s) * iq2_bn_r4: fix AVX2 after breaking it two commits ago * iq2_bn_r4: better AVX2 As we don't have enough vector registers on AVX2, it is better to do two passes per row needing only half of the accumulator registers that way. With this, we now beat iq2_bn PP also on AVX2 by a small margin. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04IQ4_XS_R4 (#123)Kawrakow
* Adding iq4_xs_r4 This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4. * iq4_xs_r4: WIP * iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4 * iq4_xs_r4: NEON We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs! * DRY --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>