Age | Commit message (Collapse) | Author |
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Only on the files where I have contributed in a significant way,
or the files I wrote myself.
|
|
|
|
|
|
At the end of the day, lookup is still better when not using simd.
This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X
with 16 threads (up from 10.5 t/s).
|
|
The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s)
but slower on the M2 (6.8 t/s vs 8.6 t/s before).
|
|
|
|
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).
Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:
| model | size | params | backend | threads | test | t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | pp512 | 9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | tg128 | 229.85 ± 0.33 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | pp512 | 322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | tg128 | 59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 8 | tg128 | 57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 4 | tg128 | 33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 2 | tg128 | 18.30 ± 0.01 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | pp512 | 698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | tg128 | 68.88 ± 0.24 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | pp512 | 196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | tg128 | 51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 4 | tg128 | 30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 2 | tg128 | 16.89 ± 0.01 |
It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
|
|
|
|
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.
With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
|
|
Doesn't make a real difference in performance.
|
|
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
|
|
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
|
|
I had not adjusted after going to 4 q8 scales per row.
|
|
Q8_0 fails because as per design the reference quantization
is different from the vecdot quantization.
|
|
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:
Systema | quant | PP-512 | TG-128a | quant | PP-512 | TG-12s |
M2 Max | iq2bn 229.02 ± 0.37 78.75 ± 0.61 | iq1bn | 146.67 ± 2.85 33.12 ± 0.03
Ryzen7950| iq2bn 379.36 ± 1.03 49.08 ± 0.18 | iq1bn | 247.12 ± 1.53 32.80 ± 0.02
Ryzen5975| iq2bn 465.28 ± 0.57 39.17 ± 0.02 | iq1bn | 325.86 ± 0.46 26.60 ± 0.10
|
|
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat
to deal with that. This improves PP speef by a few percent.
|
|
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
|
|
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
|
|
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
|
|
The scalar dot product already chieves 37 t/s for TG!
|
|
|
|
I had forgotten to adjust for the change to q8_K64.
On the M2 I'm getting 10.8 t/s with the scalar version!
|
|
|