Age | Commit message (Collapse) | Author |
|
* Custom quantization rules with regular expressions
* Add the --custom-q option to the help
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP CUDA FA with Dk != Dv
* WIP
* CUDA FA WIP - It actually works!
No TG yet, but for PP I can run FA with fp16 cache and it gets
the same answer.
* CUDA FA WIP - it now works for Q8_0 + Q8_0 for KV cache
* CUDA FA WIP - TG, not working yet.
* CUDA FA with Dk != Dv: it works now for DeepSeek
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA - it finally works (on the CPU)
* FlashMLA: allow for f16 and bf16 cache in addition to q8_0
* It works with ggml FA, not with iqk FA
* WIP
* FlashMLA: it now works with iqk
I had forgotten to divide the Q stride by sizeof(float) and
that's why, very cobfusingly, it was working for TG but not for PP.
* WIP
* FlashMLA: that should be it for now
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* A better way to measure the cost of ggml_barrier
* Smart expert selection
* Add ser option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* This reduces compute buffer size for MLA
* This should accomplish it for standard attention
* Much better
* Better concat for contiguous tensors
If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
The `-mla` command line option turns into an int from a bool.
mla = 0: use standard attention
mla = 1: use MLA with transposed cache
mla > 1: use MLA without transposed cache
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Slight MLA TG performance improvement on CUDA
The low MLA performance on CUDA is dues to
the wk_b * q_nope operation.
It turns into n_head matrix multiplications with
n_head separate quantization and GEMV steps.
The associated overhead is just too much for TG
where each GEMV is very fast (512 x 128 = 131 KFLOP
for DeepSeek-Lite, 4X that for DeepSeekV3/R1).
The way it was done there was also a copy of each q_nope
row before quantization, which I have now eliminated.
This results in a ~2.5% speedup.
What needs to happen instead is to launch a single
computation that quantizes all heads, and then have
a kernel that does the GEMV for all heads instead of
n_head sequential GEMVs.
* Slightly better
* CUDA: Quantize non-contiguous tensors
* Much better MLA
It is a total hack, but it works.
* Cleanup
Remove duplicated gemv's.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Give the user the option to override where model weights are stored
* Fix ggml_nbytes() problem and cleanup
For a tensor with zero elements ggml_nbytes() was returning
uint64_t::max, and this was causing graph allocation failure.
* Add timing info to CUDA graph evaluation
* Add more timing info
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fusing MoE up * unary(gate)
* Fusing MoE up * unary(gate): CUDA
We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite
* On CUDA also fuse MoE down * (up * unary(gate))
in case the MUL_MAT_ID op for the down experts is the next
op in the graph.
* Command line option to enable fused MoE up*unary(gate)
* Add fmoe option to llama-bench
* Adding forgotten gelu, relu, silu on ARM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Adding q8_KV - Basics + AVX2 gemm/gemv
* q8_KV: Better AVX2 gemm
* q8_KV: Better Zen4 gemm
We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.
* q8_KV: AVX2 gemm/gemv
We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.
* q8_KV: be able to use it for K cache
This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
removed most of it when implementing the quants with per row scale,
bit it was stull lurking in ggml_copy. Not sure if these were the last
remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
the K-cache as a 2D tensor so we can have per row meta data as needed
by q8_KV.
Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.
* q8_KV: be able to use it for K cache in FA
* q8_KV: repack it for K*Q in FA
* q8_KV: slightly faster gemv on Zen4
* q8_KV: slightly faster gemv on Zen4
* q8_KV: ARM_NEON
We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.
* q8_KV: use it in FA on NEON
* q8_KV_r8 - repacked q8_KV
On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).
* q8_KV_r8: don't use nrc_y = 16 on Zen4
This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.
* q8_KV: nrc_y = 16 also doesn't pay off in FA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Do not allocate / report caches that are not used
It is either the standard KV cache or MLA cache, not both.
* Rename X_pe to X_rope
Much easier to follow, at least for my brain, when we have
X_rope : rotational position encoding
X_nope : no position encoding
instead of X_pe and X_nope, where I was wondering wtf is 'pe'
and 'nope'.
* WIP
* WIP
* WIP
* WIP
* Warn user when disabling MLA
* MLA: compile time option to not use transposed KV cache
Cuts KV cache size in nearly half at the expense of slower
TG performance for long contexts (it becomes similar to
no-MLA).
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding support for K head size != V head size
This is relevant for DeepSeek models.
At this point ggml CPU FA works.
Now I need to go and change iqk FA to make it work
with Dk != Dv.
* iqk support for K head size != V head size
To not have compilation time explode, just
Dk = 192, Dv = 128 for now (DeepSeek)
* FA: very slightly faster for nq = 1 (TG)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Load all MoE experts during warmup
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Unify warmup to one token
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
|
* Deepseek MLA Optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make MLA optional
* Remove some unnecessary copies in the MLA attention
* Deepseek MLA Optimizations V2 (#195)
* Avoid allocating MHA KV cache when MLA is turned on
* Added missing gguf-py file
* Added final optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make sure we do have wk_b and wv_b before enabling MLA
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use type_k and type_v to set the types of the MLA caches
They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.
* Better gemm strategy when nth > nhead
It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.
---------
Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving
* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving
* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_m_r4: basics (quantize/dequantize)
* iq1_m_r4: Zen4 gemm
* iq1_m_r4: neon gemm
* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4
With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.
* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s_r4: basics - quantize/dequantize
* iq1_s_r4: gemm/gemv works on AVX2/Zen4
* Don't forget to make sure we have a multiple of 4 rows per thread
* iq1_s_r4: this is better
* iq1_s_r4: fix Zen4 after AVX2 changes
* iq1_s_r4: NEON gemm/gemv
* iq1_s_r4: more bits for shared experts
With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.
On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.
* Forgotten counter increment
* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv
* Compiler warnings
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Quantization mixes tweaks
* Make iq4_nl_r4 work with row size that are not a multiple of 128
... on Zen4
* Make iq4_nl_r4 work with row size that are not a multiple of 128
... on AVX2
* Make iq4_nl_r4 work with row size that are not a multiple of 128
... on AVX2
* Make q6_0_w4 work with row size that are not a multiple of 128
... on Zen4
* Make q6_0_w4 work with row size that are not a multiple of 128
... on Zen4
* Make q5_0_r4 work with row size that are not a multiple of 128
... on Zen4 and AVX2
* Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128
also on NEON.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
* Try interleaving 8 iq4_xs rows
It is also faster on AVX2.
This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).
So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
* Cleanup
* 8-rows interleaved q8_0 (AVX2)
* 8-rows interleaved q8_0 (Zen4)
* 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
* 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
* FA: repack Q8_0 to Q8_0_R8
* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)
* FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
* q4_0_r8 (AVX2)
* q4_0_r8 (NEON)
Tiny bit faster PP (~128 vs ~126 t/s), same TG.
* q4_0_r8 (Zen4)
Somehow only marginally faster?
268 t/s vs 261 t/s
* q4_0_r8 (Zen4) - slightly better
282 t/s for a pure q4_0 L3-8B quantization.
* Apply platform specific modifications when repacking
E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0.
This results in a ~3% performance improvement.
Hence,
* Changed the signature of the repack_X functions to take a
bool argument indicating if the repacking is done online and,
if so, apply modifications as appropriate while repacking.
* Added iqk_modify_tensor to apply modifications to models that
have already been repacked while loading the model. Caveat:
just like rtr, this needs to have mmap disabled (else one would
need to move the data to a not mmap-ed buffer, so much more
complicated).
* Apply platform specific modifications when repacking
On Zen4 we can pre-convert the signed quants in q8_0_r4 and
q8_k_r8 to unsigned thus avoiding these operations in matrix
multiplications. With this change we hit
PP-512 = 382.40 t/s (q8_k_r8)
PP-512 = 306.92 t/s (q8_0_r4)
for L3-8B on a Ryzen-7950X using q8_0 KV-cache.
* Process up to 16 columns per kernel call for q8_k_r8
This brings PP-512 up to 389 t/s.
* Be able to load Deepseek-v2-Lite
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
* Try interleaving 8 iq4_xs rows
It is also faster on AVX2.
This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).
So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
* Cleanup
* 8-rows interleaved q8_0 (AVX2)
* 8-rows interleaved q8_0 (Zen4)
* 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
* 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
* FA: repack Q8_0 to Q8_0_R8
* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)
* FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adopting chat template stuff from llama.cpp
* Removing missed conflict marker
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Add Falcon3 pre-tokinizer (same as llama3)
* q8_k16: use integer arithmetic to sum row values
The existing implementation that just sums up the f32 quantizations
works fine for the original BitNet models and also for the TriLM
ternary models. But for Falcon3 I see a significant difference between
the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum
up the values in a row, then the CPU-GPU PPL difference becomes much
smaller, and we get a lower PPL than Microsoft BitNet, which claims
to be "losless".
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_s_r4: WIP
* iq3_s_r4: Zen4
* iq3_s_r4: slightly better Zen4
* iq3_s_r4: AVX2
* iq3_s_r4: NEON
* iq3_s_r4: rearrange quants
* iq3_s_r4: rearranged quants - AVX2
* iq3_s_r4: rearranged quants - NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_s_r4: Zen4
* Minor
* iq2_s_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xs_r4: Zen4
* iq2_xs_r4: AVX2
* iq2_xs_r4: slightly better matrix x vector on AVX2
* iq2_xs_r4: NEON - not much better than iq2_xs
* iq2_xs_r4: slightly better NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xxs_r4: Zen4
Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread
* Minor
* iq2_xxs_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_xxs_r4: 1st shot on Zen4
PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s
* iq4_xxs_r4: WIP
* iq4_xxs_r4: 1st shot at AVX2
Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.
* iq3_xxs_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_ks_r4: Zen4
* iq4_ks_r4: AVX2
* iq4_ks_r4: WIP
* iq4_ks_r4: slightly better Zen4
* iq4_ks_r4: slightly better Zen4
* iq4_ks_r4: NEON
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_k_r4: Zen4
Much slower than the others.
* iq5_k_r5: WIP
* Minor
* iq5_k_r4: fix AVX2 nrc_y = 1 case
* iq5_k_r4: better Zen4
But TG is still slower than iq5_k
* iq5_k_r4: slightly better AVX2
* iq5_k_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Be able to repack tensors at run time
* Repack: also add bf16 as repackable type
* Repack: make sure number of rows is a multiple of the packing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_k_r4: Zen4
* iq2_k_r4: NEON
* iq2_k_r4: better matrix x vector multiplication on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_k_r4 WIP
* iq3_k_r4: Zen4
* iq3_k_r4: AVX2
* iq3_k_r4: NEON
* iq3_k_r4: faster matrix x vector multiplication on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Not working bf16_r4
* Adding bf16_r8
Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.
* bf16_rx: Very slightly faster by interleaving 16 rows
258 t/s -> 263 t/s
* Rename bf16_r4 to bf16_r16
We are interleaving 16 rows now.
* Cleanup unused stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q8_k_r8: fastest matrix multiplication known to human kind
We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!
* q8_k_r8: AVX2
I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.
* q8_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.
* q8_k_r4: go to signed ints
Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
stay within the signed int range and use _mm256_sign_epi8.
Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
needed tto convert from unsigned to signed seems to be extremely
slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
or that we now use to convert fro signed to unsigned seems to be
much faster than on M2-Max)
* Shutup useless compiler warnings
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_k_r4: WIP
* iq4_k_r4: Zen4 and hopefully AVX2
On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.
* iq4_k_r4: AVX2
PP-512 = 227.60 t/s. The shifts are really costly.
* iq4_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q2_k_r4: Zen4
PP-512(LLaMA-3.1-8B) = 256 t/s
* q3_k_r4: AVX2
* q2_k_r4: AVX2
We get PP-512(LLaMA-3.1-8B) = 287 t/s.
Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow
forgot to push upstream.
* q2_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 106.2 t/s.
TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S.
* Make sure rows per thread are a multiple of 4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q3_k_r4: Zen4 works, but not as good as it should be
238 t/s, so sloghtly slower than q6_k_r4.
* q3_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 106.9 t/s.
This is 1.93X faster than q3_K_S!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q5_k_r4: WIP
* q5_k_r4: Zen4 and AVX2
We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4.
Q5_K_S has PP-512 = 190 t/s.
* q5_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 96.1 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q6_k_r4
* q6_k_r4: 1st functional AVX2 version
* q6_k_r4: AVX2 and simple Zen4
"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.
* q6_k_r4: 1st NEON version
PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.
* q6_k_r4: slightly faster NEON
PP-512(LLaMA-3.1-8B) = 83.25 t/s
* q6_k_r4: slightly faster Zen4
238.3 t/s -> 243.2 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Something is still wrong
* Simply don't see what is wrong
* q4_k_r4: finally works on Zen4
I had forgotten to prevent token_embd.weight being quantized
with q4_k_r4!
* q4_k_r4: AVX2
We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX.
This is ~30% better than Q4_K_S.
* q4_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 110 t/s.
Not quite as good as q4_0_r4, but still a massive
improvement compared to he 69 t/s for q4_K.
* q4_k_r4: slightly better AVX2
PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX
* Minor
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q4_0_r4: 6% faster PP on NEON
* qx_0_r4_q8_0 template
Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP
~7% faster.
* Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4
* Simplify
* Minor iq4_xs_r4 improvement on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding iq2_bn_r4
This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.
* Make sure rows per thread are a multiple of the number of interleaved rows
With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.
* iq2_bn_r4: 1st shot at NEON
PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.
* iq2_bn_r4: NEON
PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.
* iq2_bn_r4: Experimenting on NEON
The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.
* Some cleanup
* iq2_bn_r4: AVX2
As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).
* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector
It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.
* iq2_bn_r4: simdify q8_K16 quantization (AVX2)
PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.
* iq2_bn_r4: simdify q8_K16 quantization (NEON)
PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)
* iq2_bn_r4: fix AVX2 after breaking it two commits ago
* iq2_bn_r4: better AVX2
As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding iq4_xs_r4
This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.
* iq4_xs_r4: WIP
* iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4
* iq4_xs_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!
* DRY
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|