IQ4_K: SOTA 4-bit quantization (#6)

* iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-07-28 12:11:59 +0200
committer: GitHub <noreply@github.com> 2024-07-28 12:11:59 +0200
commit: 291066e6df5318c322a03e592483aae8820d3b19 (patch)
tree: 1c8cafa8d0bc73c3aa39c71ab53b53eb307d3774 /ggml/src/ggml-quants.c
parent: f62615b44f7df586cb58ed9fffca59b96820117b (diff)
1 files changed, 1 insertions, 0 deletions
diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index da4c9b9a..fef124c3 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -14947,6 +14947,7 @@ bool ggml_validate_row_data(enum ggml_type type, const void * data, size_t nbyte
             {
                 VALIDATE_ROW_DATA_D_F16_IMPL(block_iq4_nl, data, nb);
             } break;
+        case GGML_TYPE_IQ4_K: break;
         case GGML_TYPE_Q4_0_4_4:
         case GGML_TYPE_Q4_0_4_8:
             {
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-07-28 12:11:59 +0200
committer	GitHub <noreply@github.com>	2024-07-28 12:11:59 +0200
commit	291066e6df5318c322a03e592483aae8820d3b19 (patch)
tree	1c8cafa8d0bc73c3aa39c71ab53b53eb307d3774 /ggml/src/ggml-quants.c
parent	f62615b44f7df586cb58ed9fffca59b96820117b (diff)