diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2024-10-16 15:18:26 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-10-16 15:18:26 +0300 |
commit | 76b97c80645362ac65a2e33043fd8d46bdaf8c56 (patch) | |
tree | b2b8ab9efb91a6ce4dd9d0fccbc9e11141ca1d80 /ggml/src/ggml-common.h | |
parent | 993ca95e9e3108f0352fa2a3384cab0775c7f7c1 (diff) |
Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP
* iq4_kss: CUDA dequantize works
So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.
* iq4_kss: slightly better quantization
* iq4_kss: another small quantization improvement
* iq4_kss: CUDA works
TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.
* iq4_kss: new bit arrangement - CUDA and Zen4 work
Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.
* iq4_kss: ARM_NEON. Predictably very slow
* iq4_kss: Metal
PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.
* iq4_kss: somewhat faster Metal dot product
45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0
* iq4_kss: AVX2
Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.
* iq4_kss: very slightly faster Metal dot product
48.7 t/s -> 49.3 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-common.h')
-rw-r--r-- | ggml/src/ggml-common.h | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/ggml/src/ggml-common.h b/ggml/src/ggml-common.h index 3a7b8989..f8824b0e 100644 --- a/ggml/src/ggml-common.h +++ b/ggml/src/ggml-common.h @@ -448,6 +448,11 @@ typedef struct { static_assert(sizeof(block_iq4_ks) == QK_K/32 + QK_K/2, "wrong iq4_ks block size/padding"); typedef struct { + uint32_t qs[QK_K/8]; +} block_iq4_kss; +static_assert(sizeof(block_iq4_kss) == QK_K/8*sizeof(uint32_t), "wrong iq4_kss block size/padding"); + +typedef struct { ggml_half d; uint16_t extra; uint8_t scales[QK_K/32]; |