diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-07-28 12:11:59 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-07-28 12:11:59 +0200 |
commit | 291066e6df5318c322a03e592483aae8820d3b19 (patch) | |
tree | 1c8cafa8d0bc73c3aa39c71ab53b53eb307d3774 /ggml/src/ggml-impl.h | |
parent | f62615b44f7df586cb58ed9fffca59b96820117b (diff) |
IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics
* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
* iq4_k: TG now works on CUDA
* iq4_k: AVX512 implementation
For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.
* iq4_k: AVX2 implementation
For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.
* iq4_k: NEON implementation
For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.
* iq4_k: Metal implementation
For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.
* iq4_k: scalar dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-impl.h')
0 files changed, 0 insertions, 0 deletions