diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-01-08 16:02:32 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-01-08 16:02:32 +0100 |
commit | dd5ae06405c5565b99889bdb3f168f4351252cfb (patch) | |
tree | 4a7a3ca0dcf7acf48e2248503daa87d66002ab37 /tests/test-quantize-fns.cpp | |
parent | 668b31fc7d86245435ad6574e0e1126e734049e2 (diff) |
SOTA 2-bit quants (#4773)
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'tests/test-quantize-fns.cpp')
-rw-r--r-- | tests/test-quantize-fns.cpp | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/tests/test-quantize-fns.cpp b/tests/test-quantize-fns.cpp index a2459a28..cee71261 100644 --- a/tests/test-quantize-fns.cpp +++ b/tests/test-quantize-fns.cpp @@ -134,6 +134,11 @@ int main(int argc, char * argv[]) { continue; } + if ((ggml_type)i == GGML_TYPE_IQ2_XXS) { + printf("Skip %s due to missing quantization functionality\n", ggml_type_name((ggml_type) i)); + continue; + } + printf("Testing %s\n", ggml_type_name((ggml_type) i)); if (qfns.from_float && qfns.to_float) { |