diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-01-27 18:53:47 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-01-27 18:53:47 +0200 |
commit | f725576345582144dfebd7f1e6c8ac93eb1eb0ca (patch) | |
tree | 12de4f7a7c4c9c75e1df955764200102e901a29d /examples | |
parent | d9c4ea48d1e41d8f7215ff1c094d75e7229b65e2 (diff) |
Minor performance improvements (#179)
* Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
* Try interleaving 8 iq4_xs rows
It is also faster on AVX2.
This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).
So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
* Cleanup
* 8-rows interleaved q8_0 (AVX2)
* 8-rows interleaved q8_0 (Zen4)
* 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
* 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
* FA: repack Q8_0 to Q8_0_R8
* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)
* FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
* q4_0_r8 (AVX2)
* q4_0_r8 (NEON)
Tiny bit faster PP (~128 vs ~126 t/s), same TG.
* q4_0_r8 (Zen4)
Somehow only marginally faster?
268 t/s vs 261 t/s
* q4_0_r8 (Zen4) - slightly better
282 t/s for a pure q4_0 L3-8B quantization.
* Apply platform specific modifications when repacking
E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0.
This results in a ~3% performance improvement.
Hence,
* Changed the signature of the repack_X functions to take a
bool argument indicating if the repacking is done online and,
if so, apply modifications as appropriate while repacking.
* Added iqk_modify_tensor to apply modifications to models that
have already been repacked while loading the model. Caveat:
just like rtr, this needs to have mmap disabled (else one would
need to move the data to a not mmap-ed buffer, so much more
complicated).
* Apply platform specific modifications when repacking
On Zen4 we can pre-convert the signed quants in q8_0_r4 and
q8_k_r8 to unsigned thus avoiding these operations in matrix
multiplications. With this change we hit
PP-512 = 382.40 t/s (q8_k_r8)
PP-512 = 306.92 t/s (q8_0_r4)
for L3-8B on a Ryzen-7950X using q8_0 KV-cache.
* Process up to 16 columns per kernel call for q8_k_r8
This brings PP-512 up to 389 t/s.
* Be able to load Deepseek-v2-Lite
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
0 files changed, 0 insertions, 0 deletions