diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2024-12-06 12:15:39 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-12-06 12:15:39 +0100 |
commit | 3682e4700db6b8cb2ca8e3da365578078f21ab0c (patch) | |
tree | ea1680494ca00580b0a038cdef035c596e80e58c /examples | |
parent | f64de08203aaee95ca755336de3e1db85d990198 (diff) |
iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)
* Adding iq2_bn_r4
This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.
* Make sure rows per thread are a multiple of the number of interleaved rows
With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.
* iq2_bn_r4: 1st shot at NEON
PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.
* iq2_bn_r4: NEON
PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.
* iq2_bn_r4: Experimenting on NEON
The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.
* Some cleanup
* iq2_bn_r4: AVX2
As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).
* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector
It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.
* iq2_bn_r4: simdify q8_K16 quantization (AVX2)
PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.
* iq2_bn_r4: simdify q8_K16 quantization (NEON)
PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)
* iq2_bn_r4: fix AVX2 after breaking it two commits ago
* iq2_bn_r4: better AVX2
As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
-rw-r--r-- | examples/quantize/quantize.cpp | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/examples/quantize/quantize.cpp b/examples/quantize/quantize.cpp index f8ce3edd..89fb8464 100644 --- a/examples/quantize/quantize.cpp +++ b/examples/quantize/quantize.cpp @@ -29,6 +29,7 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = { { "IQ1_M", LLAMA_FTYPE_MOSTLY_IQ1_M, " 1.75 bpw quantization", }, { "IQ1_BN", LLAMA_FTYPE_MOSTLY_IQ1_BN, " 1.62 bpw quantization (Bitnet)", }, { "IQ2_BN", LLAMA_FTYPE_MOSTLY_IQ2_BN, " 2.00 bpw quantization (Bitnet)", }, + { "IQ2_BN_R4",LLAMA_FTYPE_MOSTLY_IQ2_BN_R4," 2.00 bpw quantization (Bitnet)", }, { "Q2_K", LLAMA_FTYPE_MOSTLY_Q2_K, " 2.63G, +0.6717 ppl @ LLaMA-v1-7B", }, { "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S, " 2.16G, +9.0634 ppl @ LLaMA-v1-7B", }, { "IQ3_XXS", LLAMA_FTYPE_MOSTLY_IQ3_XXS, " 3.06 bpw quantization", }, |