Better 1.5 bit quantization (#5971)

* Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment * iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-03-11 07:51:49 +0100
committer: GitHub <noreply@github.com> 2024-03-11 07:51:49 +0100
commit: be858f620508385ad12d0e5e862010e666ca729c (patch)
tree: 4bdff142eba5a222bddeabf7f3e025550202cac3 /ggml-quants.h
parent: ef3ced26a3817d92890b97b83acaeb018ade02d0 (diff)
1 files changed, 2 insertions, 2 deletions
diff --git a/ggml-quants.h b/ggml-quants.h
index 47dd5285..74aabf41 100644
--- a/ggml-quants.h
+++ b/ggml-quants.h
@@ -217,8 +217,8 @@ static_assert(sizeof(block_iq3_s) == sizeof(ggml_fp16_t) + 13*(QK_K/32) + IQ3S_N
 
 typedef struct {
     ggml_fp16_t d;
-    uint8_t qs[QK_K/8];
-    uint8_t scales[QK_K/16];
+    uint8_t  qs[QK_K/8];
+    uint16_t qh[QK_K/32];
 } block_iq1_s;
 static_assert(sizeof(block_iq1_s) == sizeof(ggml_fp16_t) + QK_K/8 + QK_K/16, "wrong iq1_s block size/padding");
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-03-11 07:51:49 +0100
committer	GitHub <noreply@github.com>	2024-03-11 07:51:49 +0100
commit	be858f620508385ad12d0e5e862010e666ca729c (patch)
tree	4bdff142eba5a222bddeabf7f3e025550202cac3 /ggml-quants.h
parent	ef3ced26a3817d92890b97b83acaeb018ade02d0 (diff)