Adding IQ2_KL (#602)

* Experiments for 2.6875 bpw quants At least according to rmse, this is significantly better than q2_K, while using only 1/16 more bits per weight. * iq2_kl: basics * iq2_kl: CUDA dequantize * iq2_kl: small improvement in PPL Also check the two neighbouring values for the block scale and use the one that minimizes RMSE. * iq2_kl: MMQ Quite good: PP-512(L3-8B) = 8472 t/s. * iq2_kl: MMVQ We get PP-128(L3-8B) = 162 t/s. Which means that this is not quite as good as it should be as (almost) same bpq q2_K is at 170 t/s. * iq2_kl: Zen4 GEMM/GEMV Not particularly fast. I may need to think about rearranging the bits. * iq2_kl: better Zen4 * iq2_kl: convert/repack to q8_k_r8 (AVX2) * iq2_kl: AVX2 GEMM/GEMV * iq2_kl: WIP NEON The compiler started crashing!!! * iq2_kl: NEON Had to work around a compiler crash when using vzip2q_u8 using vqtbl2q_u8. * iq2_kl: convert/repack to q8_k_r8 (NEON) * iq2_kl: Metal dequantize * iq2_kl: Metal GEMV - pretty slow * iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s) * iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s) * iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s) * iq2_kl: slightly better Metal dequantize PP-512 goes to 476 t/s up from 466 t/s. * iq2_kl: slightly better Metal dequantize PP-512 goes to 492 t/s up from 476 t/s. * Add iq2_kl to constants.py --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-07-14 18:55:08 +0200
committer: GitHub <noreply@github.com> 2025-07-14 18:55:08 +0200
commit: 45fae1a14444622478774f9a417e1d417af1ca46 (patch)
tree: 2609ef06be5640749834d4fc691446771ab29f42 /gguf-py/gguf/constants.py
parent: f5353047ef461e6fc9d527e09a06c9802c699929 (diff)
1 files changed, 2 insertions, 0 deletions
diff --git a/gguf-py/gguf/constants.py b/gguf-py/gguf/constants.py
index 0fe3ed35..767637c5 100644
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@@ -1321,6 +1321,7 @@ class GGMLQuantizationType(IntEnum):
     IQ3_KT    = 154
     IQ4_KT    = 155
     IQ3_KS    = 156
+    IQ2_KL    = 157
     Q4_0_R8   = 202
     Q5_0_R4   = 206
     Q8_0_R8   = 208
@@ -1537,6 +1538,7 @@ GGML_QUANT_SIZES: dict[GGMLQuantizationType, tuple[int, int]] = {
     GGMLQuantizationType.IQ3_KT      : ( 256,  100),
     GGMLQuantizationType.IQ4_KT      : ( 256,  128),
     GGMLQuantizationType.IQ3_KS      : ( 256,  102),
+    GGMLQuantizationType.IQ2_KL      : ( 256,   86),
     GGMLQuantizationType.Q4_0_R8     : (  32,   18),
     GGMLQuantizationType.Q5_0_R4     : (  32,   22),
     GGMLQuantizationType.Q8_0_R8     : (  32,   34),
author	Kawrakow <iwankawrakow@gmail.com>	2025-07-14 18:55:08 +0200
committer	GitHub <noreply@github.com>	2025-07-14 18:55:08 +0200
commit	45fae1a14444622478774f9a417e1d417af1ca46 (patch)
tree	2609ef06be5640749834d4fc691446771ab29f42 /gguf-py/gguf/constants.py
parent	f5353047ef461e6fc9d527e09a06c9802c699929 (diff)