ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-07-23	Update README.md	Kawrakow

2024-07-19	When tokenizer info is missing in the model, use llama3 by default	Iwan Kawrakow

2024-07-18	iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON	Iwan Kawrakow
	Here the performance gain is more modest compared to AVX2: we get PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B running on M2 Max.
2024-07-18	iqk_mul_mat: attentions matrix multiplications	Iwan Kawrakow
	KQ and KQV are n_kv_embed x n_token x n_head matrix multiplications. Before this PR, this meant n_head calls to iqk_mul_mat to perform n_kv_embed x n_token 2D multiplications, each using nth threads. Instead, in this PR, if n_head is a multiple of nth, each thread does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices. This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from 409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B, we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from 139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
2024-07-18	iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2	Iwan Kawrakow
	I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!
2024-07-17	Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize	Iwan Kawrakow

2024-07-17	iq1bn: faster scalar dot product	Iwan Kawrakow
	At the end of the day, lookup is still better when not using simd. This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X with 16 threads (up from 10.5 t/s).
2024-07-17	iq1bn: fix scalar dot product	Iwan Kawrakow
	The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s) but slower on the M2 (6.8 t/s vs 8.6 t/s before).
2024-07-17	iq1bn: faster AVX2	Iwan Kawrakow
	Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s
2024-07-17	Remove the no longer used iq1bn_grid_u16	Iwan Kawrakow

2024-07-17	iq1bn: adjust scalar dot product and some cleanup	Iwan Kawrakow

2024-07-17	iq1bn(no lookup): better version	Iwan Kawrakow
	We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ---------------- \| ---------: \| ---------: \| ---------- \| ------: \| ------------: \| ---------------: \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| pp512 \| 9613.02 ± 24.54 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| tg128 \| 229.85 ± 0.33 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| pp512 \| 322.59 ± 1.00 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| tg128 \| 59.79 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 8 \| tg128 \| 57.62 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 4 \| tg128 \| 33.66 ± 0.29 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 2 \| tg128 \| 18.30 ± 0.01 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| pp512 \| 698.13 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| tg128 \| 68.88 ± 0.24 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| pp512 \| 196.80 ± 0.50 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| tg128 \| 51.58 ± 0.41 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 4 \| tg128 \| 30.80 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 2 \| tg128 \| 16.89 ± 0.01 \| It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.
2024-07-16	iq1bn(no lookup): Metal	Iwan Kawrakow
	In summary, compared to lookup, the multiplication based approach is * Much better on AVX2 * Slightly better on CUDA * Slightly worse on Metal * Much worse on NEON
2024-07-16	iq1bn(no lookup): NEON attempts	Iwan Kawrakow
	We are at TG-128 = 25.7 t/s, which is quite a bit worse than lookup.
2024-07-15	iq1bn(no lookup): NEON	Iwan Kawrakow
	Pretty bad.
2024-07-15	iq1bn(no lookup): CUDA	Iwan Kawrakow
	Not good. We only get ~160 t/s.
2024-07-15	iq1bn(no lookup): somewhat better	Iwan Kawrakow
	We now have for Bitnet-3B: \| threads \| test \| t/s \| \| ------: \| ------------: \| ---------------: \| \| 16 \| pp512 \| 308.97 ± 1.89 \| \| 16 \| tg128 \| 58.80 ± 0.07 \| \| 8 \| tg128 \| 49.79 ± 1.23 \| \| 4 \| tg128 \| 28.85 ± 0.02 \| \| 2 \| tg128 \| 15.39 ± 0.01 \|
2024-07-15	iq1bn: attempt without a lookup table	Iwan Kawrakow

2024-06-27	Remove all workflows	Iwan Kawrakow

2024-06-26	imatrix: be able to specify the name of the output tensor	Iwan Kawrakow
	For some models the same tensor is used for token embeddings and output. This tensor tends to be named token_embedding.weight rather than output.weight, which prevernts us from collecting imatrix data for this tensor. With this commit we can tell the name of the output tensor to the imatrix tool.
2024-06-26	bitnet: fold V scale into rms_norm	Iwan Kawrakow

2024-06-26	RoPE(Neox, Metal): don't use power functions in a loop	Iwan Kawrakow
	Speeds up Bitnet by ~2% on Metal.
2024-06-25	Typo	Iwan Kawrakow

2024-06-25	bitnet: remove iq1_bn lookup table storing +/- signs	Iwan Kawrakow
	The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.
2024-06-25	bitnet: simdify q8_K64 quantization on AVX	Iwan Kawrakow
	Doesn't make a real difference in performance.
2024-06-25	bitnet: NEON improvements for iq1_bn	Iwan Kawrakow
	With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25	bitnet: remove the now unused iq1bn_grid_u16	Iwan Kawrakow

2024-06-25	Bitnet: adapt NEON and Metal to the alternative grid	Iwan Kawrakow

2024-06-25	Bitnet: trying an alternative iq1_bn grid	Iwan Kawrakow
	Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.
2024-06-25	bitnet: fix scalar dot product for 1.625 bpw	Iwan Kawrakow
	I had not adjusted after going to 4 q8 scales per row.
2024-06-25	Bitnet: slightly faster 1.625 bpw variant for AVX512VL	Iwan Kawrakow

2024-06-24	Bitnet: tiny bity faster 1.625 bpw variant on Metal	Iwan Kawrakow
	We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-24	Adding add_4, mul_4, div_4 kernels to Metal	Iwan Kawrakow
	This gives ~2% speedup for Bitnet on Metal
2024-06-22	bitnet: qnfs tests	Iwan Kawrakow
	Q8_0 fails because as per design the reference quantization is different from the vecdot quantization.
2024-06-22	bitnet: replace ggml_mul with ggml_scale to apply the scales	Iwan Kawrakow
	Also save one scale operation in the ffn network by adjusting rms_eps. We gain up to 3% in performance by doing this, but it is a bit of a hack (we store the tensor scales in op_params while loading the model).
2024-06-22	iqk_mul_mat: add IQ4_NL also on NEON	Iwan Kawrakow
	PPL seems somewhat higher? For llama-v2-7B iwe are still ~0.04 higher compared to hat we expect after ~30 batches.
2024-06-22	iqk_mul_mat: add IQ4_NL	Iwan Kawrakow
	I never use it, so I had completely forgotten about it.
2024-06-22	bitnet(scale in a separate tensor): CPU tweaks	Iwan Kawrakow
	A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22	bitnet(scale in a separate tensor): CPU tweaks	Iwan Kawrakow
	I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.
2024-06-22	bitnet(scale in a separate tensor): more CPU improvements	Iwan Kawrakow
	It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10
2024-06-22	bitnet(scale in a separate tensor): CPU improvements	Iwan Kawrakow
	Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.
2024-06-22	bitnet(scale in a separate tensor): mul -> scale on the CPU	Iwan Kawrakow

2024-06-22	bitnet(scale in a separate tensor): mul -> scale on CUDA	Iwan Kawrakow
	On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.
2024-06-22	bitnet(scale in a separate tensor): mul -> scale on Metal	Iwan Kawrakow
	Do the mul -> scale replacement on the fly in the Metal backend. This recovers the PP performace and cuts the TG performance degradation in half.
2024-06-22	Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"	Iwan Kawrakow
	This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.
2024-06-22	bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale	Iwan Kawrakow
	This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22	bitnet(scale in a separate tensor): Metal	Iwan Kawrakow
	iq2_bn TG-128 drops to 84 t/s, while I see in the logs that we had 97 t/s. If true, that's a pretty massive performance penalty for TG. Let me guess: ggml_mul is not exactly the most performant operation on Metal.
2024-06-22	bitnet(scale in a separate tensor): CUDA	Iwan Kawrakow

2024-06-22	bitnet: put the scale in a separate tensor	Iwan Kawrakow
	and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.
2024-06-22	Bitnet(1.75 bpw): higher precision fp8 scale	Iwan Kawrakow
	Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).