summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-26RoPE(Neox, Metal): don't use power functions in a loopIwan Kawrakow
Speeds up Bitnet by ~2% on Metal.
2024-06-25TypoIwan Kawrakow
2024-06-25bitnet: remove iq1_bn lookup table storing +/- signsIwan Kawrakow
The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.
2024-06-25bitnet: simdify q8_K64 quantization on AVXIwan Kawrakow
Doesn't make a real difference in performance.
2024-06-25bitnet: NEON improvements for iq1_bnIwan Kawrakow
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25bitnet: remove the now unused iq1bn_grid_u16Iwan Kawrakow
2024-06-25Bitnet: adapt NEON and Metal to the alternative gridIwan Kawrakow
2024-06-25Bitnet: trying an alternative iq1_bn gridIwan Kawrakow
Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.
2024-06-25bitnet: fix scalar dot product for 1.625 bpwIwan Kawrakow
I had not adjusted after going to 4 q8 scales per row.
2024-06-25Bitnet: slightly faster 1.625 bpw variant for AVX512VLIwan Kawrakow
2024-06-24Bitnet: tiny bity faster 1.625 bpw variant on MetalIwan Kawrakow
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-24Adding add_4, mul_4, div_4 kernels to MetalIwan Kawrakow
This gives ~2% speedup for Bitnet on Metal
2024-06-22bitnet: qnfs testsIwan Kawrakow
Q8_0 fails because as per design the reference quantization is different from the vecdot quantization.
2024-06-22bitnet: replace ggml_mul with ggml_scale to apply the scalesIwan Kawrakow
Also save one scale operation in the ffn network by adjusting rms_eps. We gain up to 3% in performance by doing this, but it is a bit of a hack (we store the tensor scales in op_params while loading the model).
2024-06-22iqk_mul_mat: add IQ4_NL also on NEONIwan Kawrakow
PPL seems somewhat higher? For llama-v2-7B iwe are still ~0.04 higher compared to hat we expect after ~30 batches.
2024-06-22iqk_mul_mat: add IQ4_NLIwan Kawrakow
I never use it, so I had completely forgotten about it.
2024-06-22bitnet(scale in a separate tensor): CPU tweaksIwan Kawrakow
A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22bitnet(scale in a separate tensor): CPU tweaksIwan Kawrakow
I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.
2024-06-22bitnet(scale in a separate tensor): more CPU improvementsIwan Kawrakow
It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema | quant | PP-512 | TG-128a | quant | PP-512 | TG-12s | M2 Max | iq2bn 229.02 ± 0.37 78.75 ± 0.61 | iq1bn | 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950| iq2bn 379.36 ± 1.03 49.08 ± 0.18 | iq1bn | 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975| iq2bn 465.28 ± 0.57 39.17 ± 0.02 | iq1bn | 325.86 ± 0.46 26.60 ± 0.10
2024-06-22bitnet(scale in a separate tensor): CPU improvementsIwan Kawrakow
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.
2024-06-22bitnet(scale in a separate tensor): mul -> scale on the CPUIwan Kawrakow
2024-06-22bitnet(scale in a separate tensor): mul -> scale on CUDAIwan Kawrakow
On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.
2024-06-22bitnet(scale in a separate tensor): mul -> scale on MetalIwan Kawrakow
Do the mul -> scale replacement on the fly in the Metal backend. This recovers the PP performace and cuts the TG performance degradation in half.
2024-06-22Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"Iwan Kawrakow
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.
2024-06-22bitnet(scale in a separate tensor): replace ggml_mul with ggml_scaleIwan Kawrakow
This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22bitnet(scale in a separate tensor): MetalIwan Kawrakow
iq2_bn TG-128 drops to 84 t/s, while I see in the logs that we had 97 t/s. If true, that's a pretty massive performance penalty for TG. Let me guess: ggml_mul is not exactly the most performant operation on Metal.
2024-06-22bitnet(scale in a separate tensor): CUDAIwan Kawrakow
2024-06-22bitnet: put the scale in a separate tensorIwan Kawrakow
and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.
2024-06-22Bitnet(1.75 bpw): higher precision fp8 scaleIwan Kawrakow
Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).
2024-06-22Bitnet(1.75 bpw): slightly faster CUDA dot productIwan Kawrakow
We get 205 t/s, so ~13% slower than 2 bit.
2024-06-22Bitnet(2.25 bpw): faster Metal dot productIwan Kawrakow
With this we get TG-128 = 97 t/s.
2024-06-22Bitnet(2.25 bpw): MetalIwan Kawrakow
We get PP-512 = 702 t/s, TG-128 = 84 t/s. This is almost on par with q4_0, which is rare on Metal (to not say it does not exist). For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet. TG is kind of funny because we hit 72 t/s on the CPU.
2024-06-22Bitnet(2.25 bpw): CUDAIwan Kawrakow
We get PP-512 = 9600 t/s, TG-128 = 234 t/s (but we need to use 8 CPU threads, else results are lower, so clearly there is something being computed on the CPU). PP-512 is very close to PP-512(fp16) = 9800 t/s
2024-06-22Bitnet(2.25 bpw): NEONIwan Kawrakow
We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22Bitnet: 2.25 bpw versionIwan Kawrakow
Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.
2024-06-22bitnet 2 bpw: NEON implementationIwan Kawrakow
We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22Removed extra columnIwan Kawrakow
2024-06-22bitnet 2 bpw: AVX2 implementationIwan Kawrakow
We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.
2024-06-22bitnet: add 2 bpw quantizationIwan Kawrakow
The scalar dot product already chieves 37 t/s for TG!
2024-06-22Move Q8_K64 quantization to iqk-quantize.cpp and add copyright noticeIwan Kawrakow
2024-06-22iqk_mul_mat(bitnet): fix typoIwan Kawrakow
With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.
2024-06-22iqk_mul_mat(bitnet): slightly faster AVX2Iwan Kawrakow
We now get 214 t/s on the Ryzen-7950X
2024-06-22iq1_bn: better NEON implementationIwan Kawrakow
PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.
2024-06-22iq1_bn(NEON): works now, but very slowIwan Kawrakow
Basically 2X slower tan q4_0.
2024-06-22iq1_bn(Metal): 66.2 -> 67.1 t/sIwan Kawrakow
2024-06-22iq1_bn(Metal): 64 -> 66.2 t/s for TGIwan Kawrakow
This should be good enough. One cannot ask Apple Silicon to do too much work.
2024-06-22iq1_bn(Metal): 64 -> 66.2 t/s for TGIwan Kawrakow
2024-06-22iq1_bn(Metal): 60 -> 64 t/s for TGIwan Kawrakow
2024-06-22iq1_bn: very slightly better Metal dot productIwan Kawrakow
2024-06-22iq1_bn: Metal now worksIwan Kawrakow
PP performance is decent (668 t/s v 724 t/s for q4_0), but TG is kind of low (60 t/s vs 81 t/s for q4_0).