SOTA 2-bit quants (#4773)

* iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-01-08 16:02:32 +0100
committer: GitHub <noreply@github.com> 2024-01-08 16:02:32 +0100
commit: dd5ae06405c5565b99889bdb3f168f4351252cfb (patch)
tree: 4a7a3ca0dcf7acf48e2248503daa87d66002ab37 /tests/test-quantize-fns.cpp
parent: 668b31fc7d86245435ad6574e0e1126e734049e2 (diff)
1 files changed, 5 insertions, 0 deletions
diff --git a/tests/test-quantize-fns.cpp b/tests/test-quantize-fns.cpp
index a2459a28..cee71261 100644
--- a/tests/test-quantize-fns.cpp
+++ b/tests/test-quantize-fns.cpp
@@ -134,6 +134,11 @@ int main(int argc, char * argv[]) {
             continue;
         }
 
+        if ((ggml_type)i == GGML_TYPE_IQ2_XXS) {
+            printf("Skip %s due to missing quantization functionality\n", ggml_type_name((ggml_type) i));
+            continue;
+        }
+
         printf("Testing %s\n", ggml_type_name((ggml_type) i));
 
         if (qfns.from_float && qfns.to_float) {
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-01-08 16:02:32 +0100
committer	GitHub <noreply@github.com>	2024-01-08 16:02:32 +0100
commit	dd5ae06405c5565b99889bdb3f168f4351252cfb (patch)
tree	4a7a3ca0dcf7acf48e2248503daa87d66002ab37 /tests/test-quantize-fns.cpp
parent	668b31fc7d86245435ad6574e0e1126e734049e2 (diff)