From 45fae1a14444622478774f9a417e1d417af1ca46 Mon Sep 17 00:00:00 2001
From: Kawrakow <iwankawrakow@gmail.com>
Date: Mon, 14 Jul 2025 18:55:08 +0200
Subject: Adding IQ2_KL (#602)

* Experiments for 2.6875 bpw quants

At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.

* iq2_kl: basics

* iq2_kl: CUDA dequantize

* iq2_kl: small improvement in PPL

Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.

* iq2_kl: MMQ

Quite good: PP-512(L3-8B) = 8472 t/s.

* iq2_kl: MMVQ

We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.

* iq2_kl: Zen4 GEMM/GEMV

Not particularly fast. I may need to think about rearranging the bits.

* iq2_kl: better Zen4

* iq2_kl: convert/repack to q8_k_r8 (AVX2)

* iq2_kl: AVX2 GEMM/GEMV

* iq2_kl: WIP NEON

The compiler started crashing!!!

* iq2_kl: NEON

Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.

* iq2_kl: convert/repack to q8_k_r8 (NEON)

* iq2_kl: Metal dequantize

* iq2_kl: Metal GEMV - pretty slow

* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)

* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)

* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 476 t/s up from 466 t/s.

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 492 t/s up from 476 t/s.

* Add iq2_kl to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
---
 examples/quantize/quantize.cpp | 1 +
 1 file changed, 1 insertion(+)

(limited to 'examples/quantize/quantize.cpp')
diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp
index fd657373..7dd02313 100644
--- a/examples/quantize/quantize.cpp
+++ b/examples/quantize/quantize.cpp
@@ -76,6 +76,7 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = {
     { "IQ2_K_R4", LLAMA_FTYPE_MOSTLY_IQ2_K_R4, "IQ2_K repacked",},
     { "IQ2_KS",   LLAMA_FTYPE_MOSTLY_IQ2_KS,   " 2.1875 bpw non-linear quantization",},
     { "IQ2_KT",   LLAMA_FTYPE_MOSTLY_IQ2_KT,   " 2.125 bpw trellis quantization",   },
+    { "IQ2_KL",   LLAMA_FTYPE_MOSTLY_IQ2_KL,   " 2.69 bpw non-linear quantization", },
     { "IQ3_KS",   LLAMA_FTYPE_MOSTLY_IQ3_KS,   " 3.19 bpw non-linear quantization", },
     { "IQ3_K",    LLAMA_FTYPE_MOSTLY_IQ3_K,    " 3.44 bpw non-linear quantization", },
     { "IQ3_K_R4", LLAMA_FTYPE_MOSTLY_IQ3_K_R4, "IQ3_K repacked", },
-- 
cgit v1.2.3