ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-06-22	bitnet(scale in a separate tensor): more CPU improvements	Iwan Kawrakow
	It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10
2024-06-22	bitnet(scale in a separate tensor): CPU improvements	Iwan Kawrakow
	Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.
2024-06-22	bitnet(scale in a separate tensor): mul -> scale on the CPU	Iwan Kawrakow

2024-06-22	bitnet(scale in a separate tensor): mul -> scale on CUDA	Iwan Kawrakow
	On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.
2024-06-22	bitnet(scale in a separate tensor): mul -> scale on Metal	Iwan Kawrakow
	Do the mul -> scale replacement on the fly in the Metal backend. This recovers the PP performace and cuts the TG performance degradation in half.
2024-06-22	Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"	Iwan Kawrakow
	This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.
2024-06-22	bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale	Iwan Kawrakow
	This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22	bitnet(scale in a separate tensor): Metal	Iwan Kawrakow
	iq2_bn TG-128 drops to 84 t/s, while I see in the logs that we had 97 t/s. If true, that's a pretty massive performance penalty for TG. Let me guess: ggml_mul is not exactly the most performant operation on Metal.
2024-06-22	bitnet(scale in a separate tensor): CUDA	Iwan Kawrakow

2024-06-22	bitnet: put the scale in a separate tensor	Iwan Kawrakow
	and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.
2024-06-22	Bitnet(1.75 bpw): higher precision fp8 scale	Iwan Kawrakow
	Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).
2024-06-22	Bitnet(1.75 bpw): slightly faster CUDA dot product	Iwan Kawrakow
	We get 205 t/s, so ~13% slower than 2 bit.
2024-06-22	Bitnet(2.25 bpw): faster Metal dot product	Iwan Kawrakow
	With this we get TG-128 = 97 t/s.
2024-06-22	Bitnet(2.25 bpw): Metal	Iwan Kawrakow
	We get PP-512 = 702 t/s, TG-128 = 84 t/s. This is almost on par with q4_0, which is rare on Metal (to not say it does not exist). For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet. TG is kind of funny because we hit 72 t/s on the CPU.
2024-06-22	Bitnet(2.25 bpw): CUDA	Iwan Kawrakow
	We get PP-512 = 9600 t/s, TG-128 = 234 t/s (but we need to use 8 CPU threads, else results are lower, so clearly there is something being computed on the CPU). PP-512 is very close to PP-512(fp16) = 9800 t/s
2024-06-22	Bitnet(2.25 bpw): NEON	Iwan Kawrakow
	We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22	Bitnet: 2.25 bpw version	Iwan Kawrakow
	Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.
2024-06-22	bitnet 2 bpw: NEON implementation	Iwan Kawrakow
	We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22	Removed extra column	Iwan Kawrakow

2024-06-22	bitnet 2 bpw: AVX2 implementation	Iwan Kawrakow
	We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.
2024-06-22	bitnet: add 2 bpw quantization	Iwan Kawrakow
	The scalar dot product already chieves 37 t/s for TG!
2024-06-22	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	Iwan Kawrakow

2024-06-22	iqk_mul_mat(bitnet): fix typo	Iwan Kawrakow
	With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.
2024-06-22	iqk_mul_mat(bitnet): slightly faster AVX2	Iwan Kawrakow
	We now get 214 t/s on the Ryzen-7950X
2024-06-22	iq1_bn: better NEON implementation	Iwan Kawrakow
	PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.
2024-06-22	iq1_bn(NEON): works now, but very slow	Iwan Kawrakow
	Basically 2X slower tan q4_0.
2024-06-22	iq1_bn(Metal): 66.2 -> 67.1 t/s	Iwan Kawrakow

2024-06-22	iq1_bn(Metal): 64 -> 66.2 t/s for TG	Iwan Kawrakow
	This should be good enough. One cannot ask Apple Silicon to do too much work.
2024-06-22	iq1_bn(Metal): 64 -> 66.2 t/s for TG	Iwan Kawrakow

2024-06-22	iq1_bn(Metal): 60 -> 64 t/s for TG	Iwan Kawrakow

2024-06-22	iq1_bn: very slightly better Metal dot product	Iwan Kawrakow

2024-06-22	iq1_bn: Metal now works	Iwan Kawrakow
	PP performance is decent (668 t/s v 724 t/s for q4_0), but TG is kind of low (60 t/s vs 81 t/s for q4_0).
2024-06-22	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	Iwan Kawrakow

2024-06-22	iqk_mul_mat(iq1_bn): WIP NEON (not working)	Iwan Kawrakow

2024-06-22	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2	Iwan Kawrakow
	I now get PP-512 = 270 t/s on the Ryzen-5975WX
2024-06-22	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2	Iwan Kawrakow
	We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
2024-06-22	bitnet: fix scalar dot product	Iwan Kawrakow
	I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!
2024-06-22	bitnet: scale is per row, not per tensor	Iwan Kawrakow

2024-06-22	iqk_mul_mat: add iq1_bn (bitnet)	Iwan Kawrakow
	We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
2024-06-22	bitnet: CUDA, scalar, AVX2	Iwan Kawrakow

2024-06-22	bitnet: python + llama	Iwan Kawrakow

2024-06-22	iqk_mul_mat: cleanup	Iwan Kawrakow

2024-06-22	iqk_mul_mat: be independent of llamafile_sgemm	Iwan Kawrakow
	Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).
2024-06-22	iqk_mul_mat: be independent of llamafile_sgemm (WIP)	Iwan Kawrakow
	* Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__
2024-06-22	Fix nb4	Iwan Kawrakow

2024-06-22	iqk_mul_mat: add ability to disable it	Iwan Kawrakow

2024-06-22	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2	Iwan Kawrakow
	But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22	iqk_mul_mat: turn on AVX512	Iwan Kawrakow
	It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.
2024-06-22	iqk_mul_mat: slightly better fp16 with 16 vector registers	Iwan Kawrakow
	2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.
2024-06-22	iqk_mul_mat: better fp16 for AVX2	Iwan Kawrakow
	Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.