ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-02-15	Bug fix in activation quantization	Iwan Kawrakow
	I added a change in the last PR how activations are quantized. It looked like it is working and slightly improving performance. But I now hit an edge case where I get gibberish that goes away if I remove the change. I absolutely don't see what goes wrong, so leaving the change in commented out for now.
2025-02-15	Moving 4D gemm logic from ggml.c to iqk_mul_mat.cpp (#207)	Kawrakow
	This allows us to optimize TG performance for GQA models. E.g., for IQ4_XS L3-8B with 8k TG-64 goes from 8.6 to 10.26 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13	MLA: allow Q8_0 K-cache for MLA (#206)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-13	Faster MLA prompt processing (#205)	Kawrakow
	* Do not allocate / report caches that are not used It is either the standard KV cache or MLA cache, not both. * Rename X_pe to X_rope Much easier to follow, at least for my brain, when we have X_rope : rotational position encoding X_nope : no position encoding instead of X_pe and X_nope, where I was wondering wtf is 'pe' and 'nope'. * WIP * WIP * WIP * WIP * Warn user when disabling MLA * MLA: compile time option to not use transposed KV cache Cuts KV cache size in nearly half at the expense of slower TG performance for long contexts (it becomes similar to no-MLA). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12	Fix iqk_mul_mat on AVX512 systems that are missing BF16 support (#204)	Kawrakow
	* Fix iqk_mul_mat on AVX512 systems that are missing BF16 support * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-12	Fix imatrix overprotectiveness (#202)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-11	DeepSeek FA support (CPU only) (#200)	Kawrakow
	* Adding support for K head size != V head size This is relevant for DeepSeek models. At this point ggml CPU FA works. Now I need to go and change iqk FA to make it work with Dk != Dv. * iqk support for K head size != V head size To not have compilation time explode, just Dk = 192, Dv = 128 for now (DeepSeek) * FA: very slightly faster for nq = 1 (TG) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-10	Load all MoE experts during warmup and make warmup 1 token (#198)	saood06
	* Load all MoE experts during warmup Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Unify warmup to one token --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-02-09	Add optional MLA (#188)	Kawrakow
	* Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <saood05@gmail.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09	FA: Add option to build all FA kernels (#197)	Kawrakow
	Similar to the CUDA situation. It is OFF by default. If OFF, only F16, Q8_0, Q6_0, and, if the CPU provides native BF16 support, BF16 FA kernels will be included. To enable all, cmake -DGGML_IQK_FA_ALL_QUANTS=1 ... This cuts compilation time for iqk_mul_mat.cpp by almost half (45 seconds vs 81 seconds on my Ryzen-7950X). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09	Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194)	Kawrakow
	* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon) * iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon) * Simdify q8_K128 quantization also on Neon * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-08	Revert #79 (#192)	Kawrakow
	* Revert "Do not quantize activations if not necessary (#79)" This reverts commit 0bf4d99774aa3b6d00ef564acbc4dc211e45db33. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07	cuda: non-contiguous rms norm (#190)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-07	Add additional checks for iq1_s_r4 quantization (#191)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06	Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)	Kawrakow
	* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving * Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving * Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06	IQ1_M_R4: better 1.75 bpw quants (#187)	Kawrakow
	* iq1_m_r4: basics (quantize/dequantize) * iq1_m_r4: Zen4 gemm * iq1_m_r4: neon gemm * iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4 With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4. * iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05	iq1_s_r4: slightly faster NEON gemm/gemv (#186)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05	IQ1_S_R4: better 1.5 bpw quants (#185)	Kawrakow
	* iq1_s_r4: basics - quantize/dequantize * iq1_s_r4: gemm/gemv works on AVX2/Zen4 * Don't forget to make sure we have a multiple of 4 rows per thread * iq1_s_r4: this is better * iq1_s_r4: fix Zen4 after AVX2 changes * iq1_s_r4: NEON gemm/gemv * iq1_s_r4: more bits for shared experts With this mix we arrive at PPL(512) = 9.4140 for Deepseek-Lite using 1.766 bpw for the repeating layers. On the Ryzen-7950X we get PP-512 = 494 t/s and TG-128 = 52 t/s @ 16 threads. * Forgotten counter increment * iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv * Compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30	Deepseek-Lite (#184)	Kawrakow
	* Quantization mixes tweaks * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on Zen4 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q5_0_r4 work with row size that are not a multiple of 128 ... on Zen4 and AVX2 * Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128 also on NEON. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-30	Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 (#182)	Kawrakow
	* Slightly faster AVX2 implementation for q4_k_r4 * Even better AVX2 implementation for q4_k_r4 We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 291 t/s when I last measured on 3c5f8722. With FA and Q8_0 K-cache we get to 339.5 t/s. * Fix llama-bench labels that I broke with #181 * Faster AVX2 implementation for q5_k_q4 We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 273 t/s. * Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4 After the changes I made to AVX2, it ends up being slightly faster compared to what I had for Zen4. * Minor tweak * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-29	Various (#181)	Kawrakow
	* Adding gp option to llama-bench Similar to pg, but it only looks at TG speed with a given prompt length. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 They still need to be divisible by 32. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .. on NEON * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on NEON * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27	Minor performance improvements (#179)	Kawrakow
	* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. * q4_0_r8 (AVX2) * q4_0_r8 (NEON) Tiny bit faster PP (~128 vs ~126 t/s), same TG. * q4_0_r8 (Zen4) Somehow only marginally faster? 268 t/s vs 261 t/s * q4_0_r8 (Zen4) - slightly better 282 t/s for a pure q4_0 L3-8B quantization. * Apply platform specific modifications when repacking E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0. This results in a ~3% performance improvement. Hence, * Changed the signature of the repack_X functions to take a bool argument indicating if the repacking is done online and, if so, apply modifications as appropriate while repacking. * Added iqk_modify_tensor to apply modifications to models that have already been repacked while loading the model. Caveat: just like rtr, this needs to have mmap disabled (else one would need to move the data to a not mmap-ed buffer, so much more complicated). * Apply platform specific modifications when repacking On Zen4 we can pre-convert the signed quants in q8_0_r4 and q8_k_r8 to unsigned thus avoiding these operations in matrix multiplications. With this change we hit PP-512 = 382.40 t/s (q8_k_r8) PP-512 = 306.92 t/s (q8_0_r4) for L3-8B on a Ryzen-7950X using q8_0 KV-cache. * Process up to 16 columns per kernel call for q8_k_r8 This brings PP-512 up to 389 t/s. * Be able to load Deepseek-v2-Lite --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-27	Interleave 8 rows (Q8_0, IQ4_XS) (#178)	Kawrakow
	* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-24	Update chat templates (#177)	Kawrakow
	* Adopting chat template stuff from llama.cpp * Removing missed conflict marker --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-23	Deepseek V3 support added (#176)	saood06
	Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23	Add Deepseek-R1-Distill pre-tokenizer	Iwan Kawrakow

2025-01-22	Better BF16 support on AVX2 (#175)	Kawrakow
	* Adding BF16 support for AVX2 PP performance is the same as fp16 (~153 t/s on Ryzen-5975WX), but TG is quite a bit lower (3.65 t/s vs 4.72 t/s at 8 threads). Why? * Slightly faster fp16/bf16 gemv on AVX2 It still saturates at the same lower peformance for bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-21	On Zen4 repack fp16 models to bf16_r16 when run-time-repacking is requested ↵	Kawrakow
	(#174) This massively improves performance. As this is opt-in, we do not worry about possible precision loss in the f16 -> bf16 conversion. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-20	More Flash Attention improvements (#173)	Kawrakow
	* FA: slightly faster Vsoftmax(KQ)) on Zen4 * FA: it is also faster on AVX2 and ARM_NEON * Deleted forgotten commented out code * FA: slightly faster Vsoftmax(KQ)) also for fp16 K-cache * FA: slightly faster Vsoftmax(KQ)) on Zen4 We now get 130.9 t/s for a context of 32k tokens. * FA: don't store sum scaling factor in SIMD registers * FA: timing * FA: faster q8_0 cache via run-time-repacking On Zen4 q8_0 KV-cache now slightly outperforms BF16. We get 134 t/s for 32k tokens, which is ~30% better than the main branch, and ~18% better than the last commit. We simply repack the K-cache to q8_0_r4 before the KQ multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication template. FA: Fix AVX2 * FA: fix ARN_NEON * FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON * FA: dedicated mat mul for D = 128 also for ARM_NEON * FA: turn off performance timer --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-15	CPU Flash Attention improvements (#172)	Kawrakow
	* Slightly faster FA for bf16 KV cache ~2-3% sort of thing. Sadly, when we go beyond 8k tokens, the advantage kind of goes away. * Slightly faster FA for Q8_0 KV cache * FA: allow bf16 for V-cache with any supported K-cache E.g., -ctk q8_0 -ctv bf16 is slightly faster than -ctk q8_0 -ctv q8_0 on Zen4 for not too long context lengths (say, <= 4096). * FA: much better bf16 kv-cache speed for large contexts We now hit 122 t/s for LLaMA-3.1-8B (quantized as iq4_xs and run-time-repacked) with a context of 32768. IIRC, the previous best for such large context was ~90 t/s. Non-negligible improvement at 16384 and 8192 as well: 173.4 and 214 t/s. * FA: slightly better quantized kv-cache speed for large contexts E.g., for q8_0 and context of 32768, we are now at 113 t/s for LLaMA-3.1-8B. Also simplified the quantized KQ multiplication. Fix q8_0 KV cache when not using FA - WIP (AVX2) 1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use those to quantize activations for quants that use Q8_0 or Q8_1 as their vec_dot type. 2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1 3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type 4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than GGML_TYPE_Q8_0_X4 as the K and V types 5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4 in iqk_mul_mat Also added an optimization in ggml_compute_forward_mul_mat when ne12ne13 > 1 (KQ and Vsoftmax(KQ)) to process n12ne13/GCD(n12ne13, nthread) threads simultaneously using nthread/GCD(n12ne13, nthread) threads per head. This results in a non-negligible performance gain for large contexts. Question: why is it not allowed to use quantized V-cache when not using FA? Fix q8_0 KV cache when not using FA - NEON * Fix AVX2 Again the issue with _mm256_maddubs_epi16 overflowing that I keep forgetting. * FA: don't use large Q steps on AVX2 for fp16 K-cache * On Zen4 it is also better to not use large Q steps for fp16 K-cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12	Fix the strange FA behavior with odd/even batch sizes (#171)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-12	MoE fix for R4 quants (#170)	Kawrakow
	* Fix bug in iqk_mul_mat I recently added the possibility to have a matrix multiplication kernel that processes 16 columns in the right matrix per iteration. This introduced a bug that shows up when batch size is greater than 16, is not a multiple of 16, and the remainder is not a multiple of the maximum columns being processed by the regular kernels (and so, never showed up in my testing using TG-128 and PP-512). This commit fixes the issue. * Make sure rows per thread is a multiple of 4 also for MoE when using _r4 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10	Be able to re-quantize MS BitNet I2_S models (#169)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10	Falcon3 changes (#168)	Kawrakow
	* Add Falcon3 pre-tokinizer (same as llama3) * q8_k16: use integer arithmetic to sum row values The existing implementation that just sums up the f32 quantizations works fine for the original BitNet models and also for the TriLM ternary models. But for Falcon3 I see a significant difference between the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum up the values in a row, then the CPU-GPU PPL difference becomes much smaller, and we get a lower PPL than Microsoft BitNet, which claims to be "losless". --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23	iq4_0_r4: Use AVX2 version for matrix x vector (#163)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23	IQ3_S_R4 (#162)	Kawrakow
	* iq3_s_r4: WIP * iq3_s_r4: Zen4 * iq3_s_r4: slightly better Zen4 * iq3_s_r4: AVX2 * iq3_s_r4: NEON * iq3_s_r4: rearrange quants * iq3_s_r4: rearranged quants - AVX2 * iq3_s_r4: rearranged quants - NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23	MSVC fixes (#161)	Kawrakow
	Closes #160 * MSVC fixes * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22	Faster R4 legacy quants (#158)	Kawrakow
	* q4_0_r4(avx2): convert q8_1 scales with SIMD instrinsics PP-512 goes to 283 t/s from 265 t/s * qx_0_r4(AVX2): convert scales with SIMD instrinsics Also fix q8_0_r4 to not overflow. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22	R4 i-quants improvements (#157)	Kawrakow
	* Add nrc_y = 16 implementation. Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from 148.5 t/s. As we are sure that we will be multiplying with 16 columns, we can spend the time to add the mins and make the iq2_s quants unsigned. * nrc_y = 16: AVX2 iq2_s We go from 176.8 to 203.3 t/s. * nrc_y = 16: NEON iq2_s We go from 50.4 to 62.3 t/s. We didn't need to do anything other than to set func16 to mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have so many vector registers for all accumulators, unpacking and preparing the iq2_s quants is so expensive that we still gain ~23% in performance by reusing the unpacked quants 16 times instead of just 8, despite having to load/unload the accumulated results to/from the available vector registers. * nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs iq2_xxs: 76.34 -> 85.33 t/s iq2_xs: 54.13 -> 67.99 t/s iq3_xxs: 67.45 -> 73.56 t/s * nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs iq2_xxs: 195.7 -> 221.8 t/s iq2_xs : 192.6 -> 220.6 t/s iq3_xxs: 184.4 -> 206.9 t/s * r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21	IQ2_S_R4 (#156)	Kawrakow
	* iq2_s_r4: Zen4 * Minor * iq2_s_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21	IQ2_XS_R4 (#155)	Kawrakow
	* iq2_xs_r4: Zen4 * iq2_xs_r4: AVX2 * iq2_xs_r4: slightly better matrix x vector on AVX2 * iq2_xs_r4: NEON - not much better than iq2_xs * iq2_xs_r4: slightly better NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20	IQ2_XXS_R4 (#154)	Kawrakow
	* iq2_xxs_r4: Zen4 Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512 TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread * Minor * iq2_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20	fix typo (#151)	Nexes the Elder

2024-12-20	IQ3_XXS_R4 (#153)	Kawrakow
	* iq3_xxs_r4: 1st shot on Zen4 PP-512: 107 t/s -> 137 t/s TG-128(1 thread): 2.64 t/s -> 3.44 t/s * iq4_xxs_r4: WIP * iq4_xxs_r4: 1st shot at AVX2 Note: there is a bug in the AVX2 implementation for nrc_y = 1 for IQ quants with blocks of 32. I have fixed it for now by using the nrc_y > 1 implementation (which works) also for nrc_y = 1. * iq3_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18	IQ4_KS_R4 (#150)	Kawrakow
	* iq4_ks_r4: Zen4 * iq4_ks_r4: AVX2 * iq4_ks_r4: WIP * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: NEON * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18	IQ5_K_R4 (#149)	Kawrakow
	* iq5_k_r4: Zen4 Much slower than the others. * iq5_k_r5: WIP * Minor * iq5_k_r4: fix AVX2 nrc_y = 1 case * iq5_k_r4: better Zen4 But TG is still slower than iq5_k * iq5_k_r4: slightly better AVX2 * iq5_k_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17	Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, ↵	Kawrakow
	iq4_k_r4 (#148) * Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4 More importantly: simplify. * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17	Be able to repack tensors at run time (#147)	Kawrakow
	* Be able to repack tensors at run time * Repack: also add bf16 as repackable type * Repack: make sure number of rows is a multiple of the packing --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17	IQ2_K_R4 (#146)	Kawrakow
	* iq2_k_r4: Zen4 * iq2_k_r4: NEON * iq2_k_r4: better matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17	IQ3_K_R4 (#145)	Kawrakow
	* iq3_k_r4 WIP * iq3_k_r4: Zen4 * iq3_k_r4: AVX2 * iq3_k_r4: NEON * iq3_k_r4: faster matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>