diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-02-27 16:34:24 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-02-27 16:34:24 +0200 |
commit | 0becb22ac05b6542bd9d5f2235691aa1d3d4d307 (patch) | |
tree | 17d2a402b346779a611e917a71ce5b84e1aa43e8 /examples/quantize/quantize.cpp | |
parent | c24a2a6e6005e5d424301525a42ba45a4a362d30 (diff) |
IQ4_XS: a 4.25 bpw quantization (#5747)
* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/quantize/quantize.cpp')
-rw-r--r-- | examples/quantize/quantize.cpp | 3 |
1 files changed, 2 insertions, 1 deletions
diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp index 2d187823..7662ec80 100644 --- a/examples/quantize/quantize.cpp +++ b/examples/quantize/quantize.cpp @@ -36,7 +36,8 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = { { "Q3_K_S", LLAMA_FTYPE_MOSTLY_Q3_K_S, " 2.75G, +0.5551 ppl @ LLaMA-v1-7B", }, { "Q3_K_M", LLAMA_FTYPE_MOSTLY_Q3_K_M, " 3.07G, +0.2496 ppl @ LLaMA-v1-7B", }, { "Q3_K_L", LLAMA_FTYPE_MOSTLY_Q3_K_L, " 3.35G, +0.1764 ppl @ LLaMA-v1-7B", }, - { "IQ4_NL", LLAMA_FTYPE_MOSTLY_IQ4_NL, " 4.25 bpw non-linear quantization", }, + { "IQ4_NL", LLAMA_FTYPE_MOSTLY_IQ4_NL, " 4.50 bpw non-linear quantization", }, + { "IQ4_XS", LLAMA_FTYPE_MOSTLY_IQ4_XS, " 4.25 bpw non-linear quantization", }, { "Q4_K", LLAMA_FTYPE_MOSTLY_Q4_K_M, "alias for Q4_K_M", }, { "Q4_K_S", LLAMA_FTYPE_MOSTLY_Q4_K_S, " 3.59G, +0.0992 ppl @ LLaMA-v1-7B", }, { "Q4_K_M", LLAMA_FTYPE_MOSTLY_Q4_K_M, " 3.80G, +0.0532 ppl @ LLaMA-v1-7B", }, |