diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-02-24 16:23:52 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-02-24 16:23:52 +0200 |
commit | 4c4cb30736582cacb1a164a9d4bc8e17b1014be7 (patch) | |
tree | 5f953370b3124531d9cbb9b9d5cfdb264ddf60bc /tests/test-quantize-fns.cpp | |
parent | 525213d2f5da1eaf4b922b6b792cb52b2c613368 (diff) |
IQ3_S: a much better alternative to Q3_K (#5676)
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'tests/test-quantize-fns.cpp')
-rw-r--r-- | tests/test-quantize-fns.cpp | 4 |
1 files changed, 3 insertions, 1 deletions
diff --git a/tests/test-quantize-fns.cpp b/tests/test-quantize-fns.cpp index 5e92d574..04656bb9 100644 --- a/tests/test-quantize-fns.cpp +++ b/tests/test-quantize-fns.cpp @@ -151,6 +151,7 @@ int main(int argc, char * argv[]) { const float max_quantization_error = type == GGML_TYPE_Q2_K ? MAX_QUANTIZATION_TOTAL_ERROR_2BITS : type == GGML_TYPE_Q3_K ? MAX_QUANTIZATION_TOTAL_ERROR_3BITS : + type == GGML_TYPE_IQ3_S ? MAX_QUANTIZATION_TOTAL_ERROR_3BITS : type == GGML_TYPE_IQ3_XXS ? MAX_QUANTIZATION_TOTAL_ERROR_3BITS_XXS : MAX_QUANTIZATION_TOTAL_ERROR; failed = !(total_error < max_quantization_error); num_failed += failed; @@ -167,7 +168,8 @@ int main(int argc, char * argv[]) { const float vec_dot_error = dot_product_error(qfns, test_size, test_data.data(), test_data2.data()); const float max_allowed_error = type == GGML_TYPE_Q2_K || type == GGML_TYPE_IQ2_XS || type == GGML_TYPE_IQ2_XXS || - type == GGML_TYPE_IQ3_XXS ? MAX_DOT_PRODUCT_ERROR_LOWBIT : MAX_DOT_PRODUCT_ERROR; + type == GGML_TYPE_IQ3_XXS || type == GGML_TYPE_IQ3_S ? MAX_DOT_PRODUCT_ERROR_LOWBIT + : MAX_DOT_PRODUCT_ERROR; failed = !(vec_dot_error < max_allowed_error); num_failed += failed; if (failed || verbose) { |