Age | Commit message (Collapse) | Author |
|
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d.
When using CUDA, the tensor contents have not been loaded yet,
so we crash when trying to access the scale when building the
graph. There must be a better way.
|
|
This recovers part of the performance loss. On Metal TG-128 is now
92 t/s, still short of the ~100 t/s with scales applied on the fly.
|
|
iq2_bn TG-128 drops to 84 t/s, while I see in the logs
that we had 97 t/s. If true, that's a pretty massive
performance penalty for TG. Let me guess: ggml_mul is not
exactly the most performant operation on Metal.
|
|
|
|
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
|
|
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
|
|
We get 205 t/s, so ~13% slower than 2 bit.
|
|
With this we get TG-128 = 97 t/s.
|
|
We get PP-512 = 702 t/s, TG-128 = 84 t/s.
This is almost on par with q4_0, which is rare on Metal
(to not say it does not exist).
For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet.
TG is kind of funny because we hit 72 t/s on the CPU.
|
|
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
|
|
We get PP-512 = 192 t/s, TG-128 = 72 t/s
|
|
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
|
|
We get PP-512 = 190 t/s and TG-128 = 75 t/s.
2 bpw TG on the CPU beats 1.75 bpw on the GPU!
|
|
|
|
We get PP-512 = 322 t/s.
TG is already 51.6 t/s at 4 threads, then it saturates and
starts going down for more than 8 threads.
|
|
The scalar dot product already chieves 37 t/s for TG!
|
|
|
|
With the last change (which added the typo), I'm now getting
PP-512 = 300 t/s on the Ryzen-5975WX.
|
|
We now get 214 t/s on the Ryzen-7950X
|
|
PP is decent with 131 t/s (q4_0 has 150 t/s).
TG is better than last commit but still bad at 33.1 t/s
(in comparison q4_0 gets 52.3 t/s).
I had to go to the (0, 1, 2) table. Apple Silicon clearly
does not like operations with signs.
|
|
Basically 2X slower tan q4_0.
|
|
|
|
This should be good enough. One cannot ask
Apple Silicon to do too much work.
|
|
|
|
|
|
|
|
PP performance is decent (668 t/s v 724 t/s for q4_0),
but TG is kind of low (60 t/s vs 81 t/s for q4_0).
|
|
|
|
|
|
I now get PP-512 = 270 t/s on the Ryzen-5975WX
|
|
We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
|
|
I had forgotten to adjust for the change to q8_K64.
On the M2 I'm getting 10.8 t/s with the scalar version!
|
|
|
|
We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
|
|
|
|
|
|
|
|
Verified that it works on AVX2.
Also turned on any combination of f16 and f32
(i.e., added f16 x 16 and f32 x f32).
|
|
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat
It is marked WIP because only tested on __aarch64__
|
|
|
|
|
|
But only turning on f16 x f32 and f32 x f16 for now.
|
|
It makes no difference on my Ryzen-7950X, but perhaps
it will be beneficial for CPU's with real AVX512.
|
|
2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX
up from 138 t/s. We use Nx registers to preload the fp16 weights,
so total registers required is Nx * (Ny + 1), so 15 in the case
of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare
register helps. But maybe it is just a matter of how things get
loaded into the cache. On the 7950X I did try 3 x 8 and it did
not perform as well as 5 x 5.
|
|
Basically use what I did for Arm.
Improves PP performance to 141.7 t/s up from 136 t/s
on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling).
This is now 10% faster than tinyBLAS.
There is a minor improvement also on the Ryzen-5975WX
(16 vector registers, so we use 4x3 tiling): we get
138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.
|
|
~2% slower than tinyBLAS - not sure why.
|
|
About 2% faster for q4_K.
|
|
I was happily using _mm256_packs_epi32() to pack the
q8_0 x q8_0 dot products back to int16_t, and getting useful
results. But theoretically this can overflow, so it is
better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine
the 4 dot products using int32_t additions. This is (almost)
as fast, unlike _mm256_hadd_epi32(), which seems excessively
slow on the Ryzen-7950X.
|
|
|
|
|