SOTA 3-bit quants (#5196)

* iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-01-30 15:14:12 +0200
committer: GitHub <noreply@github.com> 2024-01-30 15:14:12 +0200
commit: f4d7e5497485ce6ce0e322533930b7da4657dd2d (patch)
tree: 78b30048cb4a9c78d5cf3e231a1ac3e9ed190577 /examples/quantize/quantize.cpp
parent: 2256f36b79a932a478d4dcdf02c1e5a60056e5f3 (diff)
1 files changed, 1 insertions, 0 deletions
diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp
index 0236f218..a9673f0d 100644
--- a/examples/quantize/quantize.cpp
+++ b/examples/quantize/quantize.cpp
@@ -25,6 +25,7 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = {
     { "IQ2_XS", LLAMA_FTYPE_MOSTLY_IQ2_XS, " 2.31 bpw quantization",            },
     { "Q2_K",   LLAMA_FTYPE_MOSTLY_Q2_K,   " 2.63G, +0.6717 ppl @ LLaMA-v1-7B", },
     { "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S, " 2.16G, +9.0634 ppl @ LLaMA-v1-7B", },
+    { "IQ3_XXS",LLAMA_FTYPE_MOSTLY_IQ3_XXS," 3.06 bpw quantization",            },
     { "Q3_K",   LLAMA_FTYPE_MOSTLY_Q3_K_M, "alias for Q3_K_M" },
     { "Q3_K_XS",LLAMA_FTYPE_MOSTLY_Q3_K_XS,"3-bit extra small quantization"   , },
     { "Q3_K_S", LLAMA_FTYPE_MOSTLY_Q3_K_S, " 2.75G, +0.5551 ppl @ LLaMA-v1-7B", },
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-01-30 15:14:12 +0200
committer	GitHub <noreply@github.com>	2024-01-30 15:14:12 +0200
commit	f4d7e5497485ce6ce0e322533930b7da4657dd2d (patch)
tree	78b30048cb4a9c78d5cf3e231a1ac3e9ed190577 /examples/quantize/quantize.cpp
parent	2256f36b79a932a478d4dcdf02c1e5a60056e5f3 (diff)