summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-14Minor iq3_k tweakIwan Kawrakow
2024-10-14iq3_k: fix and optimize Metal dot product (#87)Kawrakow
* iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13Fix and optimize iq2k Metal implementation (#86)Kawrakow
* I somehow broke iq2_k on Metal? - fix dequantize * I somehow broke iq2_k on Metal? - fix dot product * iq2_k: optimize Metal dot product 42.6 t/s -> 46.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13IQ2_KS: 2.1875 bpw non-linear quantization (#85)Kawrakow
* Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-11Minor: printf -> LLAMA_LOG_INFOIwan Kawrakow
2024-10-10Better model info (#84)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09New SOTA quantization: 4.25 bpw IQ4_KS (#83)Kawrakow
* iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04Fix compiler warningsIwan Kawrakow
2024-10-04Move scale fudge factors to quantization (#81)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04Move to c++17 projectwide (#80)Kawrakow
* Slightly better * Make the entire project c++17 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04Do not quantize activations if not necessary (#79)Kawrakow
* Do not quantize activations if not necessary * Do not quantize activations if not necessary also for MoE models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02q6_0: Slightly faster Zen4/AVX2 (#78)Kawrakow
* Faster q6_0 on AVX2 PP-512 goes up by 3.4%. * q6_0: this is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02Fused unary(x)*y (#70)Kawrakow
* Adding fused y*unary(x) op * Fused y*unary(x) op: CUDA * Fused y*unary(x) op: dedicated CPU implementation for silu and gelu * Fused y*unary(x) op: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02Adding Q6_0 (#77)Kawrakow
* Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02iq4_nl: faster quantization (#76)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01Fix Q5_0 flash attention (#75)Kawrakow
When I changed iqk_mul_mat to use type-1 dot products for type-0 legacy quants, I forgot to also change the vec_dot_type when the dot product is done via ggml as in flash attention. This commit fixes it. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01Fix last commitIwan Kawrakow
Did not re-check on AVX2/Zen4 after NEON related changes and, sure enough, I broke AVX2/Zen4.
2024-10-01IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74)Kawrakow
* Be able to use IQ4_NL for KV cache on AVX2/Zen4 * Be able to use IQ4_NL for KV cache on ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01CUDA: faster float -> iq4_nl conversion (#73)Kawrakow
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Speed up float -> iq4_nl conversion on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72)Kawrakow
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Fix AVX2 In addition to fixing iq4_nl, it seems I never adhusted the AVX2 implementation for iq2_tn to the block scale removal? This commit also fixes that. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-29Allow bf16 kv-cache (#69)Kawrakow
On the CPU I get the exact same PPL with and without FA using bf16 for kv-cache. But on CUDA the bf16 kv-cache result is about the same as the fp16 kv-cache CPU result, so I'm missing some conversion somewhere. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28Time to fix replace_all (#68)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28CUDA non-contiguous RoPE (#66)Kawrakow
In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini. This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28Adding SWIGLU unary op (#65)Kawrakow
* Adding GGML_UNARY_OP_SWIGLU This commit implements the ggml op and CPU compute forward. I see ~3-4% speedup of PP-512 for Phi-3.5-mini. * GGML_UNARY_OP_SWIGLU: CUDA implementation I observe ~12% speedup for PP-512(Phi-3.5-mini). * GGML_UNARY_OP_SWIGLU: Metal implementation We get ~2% speedup for PP-512(Phi-3.5-mini). * GGML_UNARY_OP_SWIGLU: minor improvement on Metal * GGML_UNARY_OP_SWIGLU: cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28Better sub-3-bit quantization mixes with a qkv tensor (#64)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27Adding ability to have meta data per tensor row (#61)Kawrakow
* POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-25Use fp32 for K*Q in Metal FA implementation (#62)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-19MinorIwan Kawrakow
2024-09-17Fix compiler warnings (#58)Kawrakow
* Fix C++ compilation warnings caused by ggml-common.h * Disable c99-extensions warning I get tons of those on macOS due to the arm_neon.h header. * Disable c99-extensions warning only for APPLE * Fix warnings in iqk_quantize.cpp Also add GGML_ABORT when implementation is missing. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17BF16 support on Metal (#56)Kawrakow
* BF16 support on Metal * Faster BF16 Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-16iqk_mul_mat(ARM_NEON): adding bf16 support (#41)Kawrakow
It looks like ArmV8 ISA has support for bf16, but my M2 Max does not have it, so resorting to bf16 -> f32 conversion and computations in f32. This is 2x slower than f16, but 8x better compared to what I get if I try to run a bf16 model on the M2 (NEON and Metal). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-15MinorIwan Kawrakow
2024-09-14Adding bf16 support to CUDA (#40)Kawrakow
* Adding bf16 support to CUDA - matrix multipications * Adding bf16 support to CUDA - cleanup * Adapt to latest master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14Improve Q5_0 performance (#55)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14Quantization mixes tweaks (#53)Kawrakow
* Some tweaks for i-quants Improve Gemma2 PPL while reducing size * Some tweaks for iq2_k and iq3_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-13MinorIwan Kawrakow
2024-09-13Fix bug and D < 128 case for Q8_0 k-cache (#52)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-12Quantized Flash Attention for all supported CPU platforms (#51)Kawrakow
* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 * NEON Flash Attention: quantized K*Q for q4_0 I could finally take advantage of the matrix multiplication templates. We get quite a bit of speedup that way for q4_0: For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step> results in PP-2048 = 287 t/s vs 268 t/s when converting the q4_0 k-cache and Q to fp16 and using fp16 multiplication. * NEON Flash Attention: quantized K*Q for q4_1 * NEON Flash Attention: quantized K*Q for q8_0 This makes quite a bit of difference: For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs 178 t/s when converting things to fp16 and using fp16 matrix multiplication. We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the performance of PP-512. In contrast, llama.cpp with Q8_0 cache is 38% of PP-512. * Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * Tidy up FlashMS * Delete no longer used stuff With the usage of quantized matrix multiplications for quantized k- and/or v-cache, we no longer need the helper methods loading entire rows. * Disallow mixing bf16 with other types for kv caches --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11AVX2 Flash Attention 2 (#50)Kawrakow
* AVX2 Flash Attention: add ability to use Q8_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_1 for kv-cache * Fix Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11ARM_NEON Flash Attention (#49)Kawrakow
* NEON Flash Attention - first working version Simply reuse the Zen4/AVX2 implementation, but use f16 for the K*Q multiplication and V*softmax(K*Q) accumulation. This makes the FlashMS portion somewhat awkward because we do not have fast f16 implementations for expf (and tanh when softcap is enabled), so we need to convert back-and-fort to f32. FA is slightly faster than no-FA for the 4B TriLM model, but lightly slower for Gemma-2b. * NEON Flash Attention - convert Q to f16 before computing Q*K * NEON Flash Attention - use fp32 for K*Q operations Else I get wrong results for LLaMA-3.1-8B (but it works for Gemma-2b). * Delete commented out stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10AVX2 Flash Attention (#48)Kawrakow
* First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10iq2_tn: slightly better performance on AVX2 (#47)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10IQ1_TN Metal implementation (#46)Kawrakow
* iq1_tn: Metal implementation Rquires to change the get_rows and matrix multiplication kernels to use a dequantizer type rather than a dequantization function. But once this is done, we can simply reuse the iq1_bn implementation. This change will also allow to add other quantization types that have meta data (such as a row scale) stored at the beginning of a row (or change existing quantization types to row-wise scales). * Some cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09Add CUDA support for IQ1_TN (#45)Kawrakow
* iq1_tn: adding CUDA dequantize * iq1_tn: adding CUDA dot product * Delete commented out stuff * Delete forgotten TODO --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)Kawrakow
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-08iq2_tn: slightly faster PP (#43)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-08Adding fused rms_norm (#42)Kawrakow
* Fused rms_norm: works on the CPU * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05Add support for bf16 to iqk_mul_mat (#39)Kawrakow
* WIP: adding BF16 support to iqk_mul_mat * Minor * Improve TG speed (when not memory bound) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>