diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2024-12-22 10:52:56 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-12-22 10:52:56 +0100 |
commit | 907cde6be257d295e720cece8b8fb999072befa1 (patch) | |
tree | 1437a8df713c17775590d90f88d211f54b1d9da3 /examples/llama-bench/CMakeLists.txt | |
parent | 93419de68f90fede135480a2717785d519df9f42 (diff) |
R4 i-quants improvements (#157)
* Add nrc_y = 16 implementation.
Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from
148.5 t/s. As we are sure that we will be multiplying with 16
columns, we can spend the time to add the mins and make the
iq2_s quants unsigned.
* nrc_y = 16: AVX2 iq2_s
We go from 176.8 to 203.3 t/s.
* nrc_y = 16: NEON iq2_s
We go from 50.4 to 62.3 t/s.
We didn't need to do anything other than to set func16 to
mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have
so many vector registers for all accumulators, unpacking and preparing
the iq2_s quants is so expensive that we still gain ~23% in performance
by reusing the unpacked quants 16 times instead of just 8, despite
having to load/unload the accumulated results to/from the
available vector registers.
* nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs
iq2_xxs: 76.34 -> 85.33 t/s
iq2_xs: 54.13 -> 67.99 t/s
iq3_xxs: 67.45 -> 73.56 t/s
* nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs
iq2_xxs: 195.7 -> 221.8 t/s
iq2_xs : 192.6 -> 220.6 t/s
iq3_xxs: 184.4 -> 206.9 t/s
* r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/llama-bench/CMakeLists.txt')
0 files changed, 0 insertions, 0 deletions