Age | Commit message (Collapse) | Author |
|
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
|
|
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
|
|
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
|
|
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
|
|
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
|
|
169.2 t/s vs 167.8 t/s before.
|
|
Almost on par with iq2_xs (168 t/s vs 172 t/s).
|
|
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs
172 t/s for iq2_xs.
|
|
|
|
I cannot possibly wait for a 5 minutes nvcc compilation
each time I touch vecdotq.cuh.
Also, cmake was adding --options-file X.rsp to the nvcc
compile commands, which confuses clangd, so I have turned
that off.
|
|
|
|
Performance is roughly on par with q5_0.
|
|
|
|
|
|
|
|
Quantize/dequantize, CUDA dequantize
|
|
|
|
|
|
|
|
|
|
|
|
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
|
|
* iq4_k: basics
* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
* iq4_k: TG now works on CUDA
* iq4_k: AVX512 implementation
For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.
* iq4_k: AVX2 implementation
For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.
* iq4_k: NEON implementation
For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.
* iq4_k: Metal implementation
For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.
* iq4_k: scalar dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
It seemed Gemma-2 performance is lower than expected for its size.
Looking at the architecture, I noticed that tanh is used in each layer,
and then at the end for softcaping the final output. ggml had tanh
set to be computed with a single thread. Combined with tanh(x) being a
pretty expensive operation, this resulted in a significant fraction
of the time being spent in the tanh operation.
After multi-threading ggml_vec_soft_max_f32 and simd-ifying the
tanh computation, I observe a 33% gain in prompt processing speed (!!!)
TG is of course memory bound, but despite this, we still get a
~2% boost at 4 threads (which gives max TG performance on my
Ryzen-7950X).
Simd-ifying:
We have
tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
so we can just use Justine Tunney's SIMD exp implementation.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
OK, I should have checked how it was done for Gemma and do
the same for Bitnet. But better late than never.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* bitnet: put token embeddings on the GPU
* Update README with the new CUDA/Meat performance
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
Trying to avoid line breaks in table
|
|
|
|
Only on the files where I have contributed in a significant way,
or the files I wrote myself.
|
|
|
|
|
|
|
|
Adding some more details
|
|
Adding MoE and Bitnet performance tables
|
|
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
|
|
Added performance comparison tables
|
|
Else fp16 PP performance drops by nearly a factor of 2 compared to
what we had before.
|
|
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
__asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
|
|
|
|
|
|
|
|
|
|
Here the performance gain is more modest compared to AVX2: we get
PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B
running on M2 Max.
|
|
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications.
Before this PR, this meant n_head calls to iqk_mul_mat to perform
n_kv_embed x n_token 2D multiplications, each using nth threads.
Instead, in this PR, if n_head is a multiple of nth, each thread
does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices.
This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from
409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B,
we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from
139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
|
|
I was trying to understand where the Bitnet bottleneck is, and at
some point noticed the Q*K matrixt multiplication where Q and K
have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for
floats rerquiers that the row size is a multiple of the SIMD vector size
(so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this
matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat
float kernel to handle row sizes that are a multiple of 4 (via __m128
for the last values in a row) resulted in nearly a 20% performance boost
for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance
increases by nearly 70%!
|
|
|
|
At the end of the day, lookup is still better when not using simd.
This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X
with 16 threads (up from 10.5 t/s).
|