ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-08-01	Factor out iqk CUDA dot products	Iwan Kawrakow
	I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.
2024-08-01	iq5_k: CUDA dot product still not working	Iwan Kawrakow

2024-08-01	iq5_k: Metal	Iwan Kawrakow
	Performance is roughly on par with q5_0.
2024-08-01	iq5_k: NEON	Iwan Kawrakow

2024-08-01	iq5_k: AVX512	Iwan Kawrakow

2024-08-01	iq5_k: AVX2	Iwan Kawrakow

2024-08-01	iq5_k: Basics	Iwan Kawrakow
	Quantize/dequantize, CUDA dequantize
2024-08-01	iq2_k: Metal. Dot product is wrong	Iwan Kawrakow

2024-08-01	iq2_k: NEON	Iwan Kawrakow

2024-08-01	iq2_k: slightly faster AVX512	Iwan Kawrakow

2024-08-01	iq2_k: simplify AVX512	Iwan Kawrakow

2024-08-01	iq2_k: AVX2	Iwan Kawrakow

2024-08-01	iq2_k: Basics	Iwan Kawrakow
	Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-07-28	IQ4_K: SOTA 4-bit quantization (#6)	Kawrakow
	* iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27	Simdify and multi-thread tanh (#4)	Kawrakow
	It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for softcaping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation. After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X). Simd-ifying: We have tanh(x) = (exp(2x) - 1)/(exp(2x) + 1) so we can just use Justine Tunney's SIMD exp implementation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27	Merge mainline llama.cpp (#3)	Kawrakow
	* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26	Offload Bitnet token embeddings to the GPU - the right way (#2)	Kawrakow
	OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26	Offload Bitnet token embeddings to the GPU (#1)	Kawrakow
	* bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-25	iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementation	Iwan Kawrakow

2024-07-24	Update README.md	Kawrakow

2024-07-24	Update README.md	Kawrakow
	Trying to avoid line breaks in table
2024-07-24	Update README.md	Kawrakow

2024-07-24	Add copyright notices	Iwan Kawrakow
	Only on the files where I have contributed in a significant way, or the files I wrote myself.
2024-07-24	Remove unused file	Iwan Kawrakow

2024-07-24	Remove security	Iwan Kawrakow

2024-07-24	Correct spelling in README	Iwan Kawrakow

2024-07-24	Update README.md	Kawrakow
	Adding some more details
2024-07-24	Update README.md	Kawrakow
	Adding MoE and Bitnet performance tables
2024-07-24	Update README.md	Kawrakow
	I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
2024-07-24	Update README.md	Kawrakow
	Added performance comparison tables
2024-07-24	iqk_mul_mat(NEON): special case for n not divisible by 8	Iwan Kawrakow
	Else fp16 PP performance drops by nearly a factor of 2 compared to what we had before.
2024-07-24	ggml: thread syncronization on Arm	Iwan Kawrakow
	For x86 slaren was genereous enough to add _mm_pause() to the busy spin wait loop in ggml_barrier(), but everything else just busy spins, loading an atomic int on every iteration, thus forcing cache sync between the cores. This results in a massive drop in performance on my M2-Max laptop when using 8 threads. The closest approximation to _mm_pause() on Arm seems to be __asm__ __volatile__("isb\n"); After adding this to the busy spin loop, performance for 8 threads recovers back to expected levels.
2024-07-24	Fix "make it work for row sizes that are multiple of 4 on NEON"	Iwan Kawrakow

2024-07-23	Update README.md	Kawrakow

2024-07-23	Update README.md	Kawrakow

2024-07-19	When tokenizer info is missing in the model, use llama3 by default	Iwan Kawrakow

2024-07-18	iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON	Iwan Kawrakow
	Here the performance gain is more modest compared to AVX2: we get PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B running on M2 Max.
2024-07-18	iqk_mul_mat: attentions matrix multiplications	Iwan Kawrakow
	KQ and KQV are n_kv_embed x n_token x n_head matrix multiplications. Before this PR, this meant n_head calls to iqk_mul_mat to perform n_kv_embed x n_token 2D multiplications, each using nth threads. Instead, in this PR, if n_head is a multiple of nth, each thread does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices. This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from 409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B, we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from 139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
2024-07-18	iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2	Iwan Kawrakow
	I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!
2024-07-17	Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize	Iwan Kawrakow

2024-07-17	iq1bn: faster scalar dot product	Iwan Kawrakow
	At the end of the day, lookup is still better when not using simd. This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X with 16 threads (up from 10.5 t/s).
2024-07-17	iq1bn: fix scalar dot product	Iwan Kawrakow
	The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s) but slower on the M2 (6.8 t/s vs 8.6 t/s before).
2024-07-17	iq1bn: faster AVX2	Iwan Kawrakow
	Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s
2024-07-17	Remove the no longer used iq1bn_grid_u16	Iwan Kawrakow

2024-07-17	iq1bn: adjust scalar dot product and some cleanup	Iwan Kawrakow

2024-07-17	iq1bn(no lookup): better version	Iwan Kawrakow
	We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ---------------- \| ---------: \| ---------: \| ---------- \| ------: \| ------------: \| ---------------: \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| pp512 \| 9613.02 ± 24.54 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| tg128 \| 229.85 ± 0.33 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| pp512 \| 322.59 ± 1.00 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| tg128 \| 59.79 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 8 \| tg128 \| 57.62 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 4 \| tg128 \| 33.66 ± 0.29 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 2 \| tg128 \| 18.30 ± 0.01 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| pp512 \| 698.13 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| tg128 \| 68.88 ± 0.24 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| pp512 \| 196.80 ± 0.50 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| tg128 \| 51.58 ± 0.41 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 4 \| tg128 \| 30.80 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 2 \| tg128 \| 16.89 ± 0.01 \| It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.
2024-07-16	iq1bn(no lookup): Metal	Iwan Kawrakow
	In summary, compared to lookup, the multiplication based approach is * Much better on AVX2 * Slightly better on CUDA * Slightly worse on Metal * Much worse on NEON
2024-07-16	iq1bn(no lookup): NEON attempts	Iwan Kawrakow
	We are at TG-128 = 25.7 t/s, which is quite a bit worse than lookup.
2024-07-15	iq1bn(no lookup): NEON	Iwan Kawrakow
	Pretty bad.
2024-07-15	iq1bn(no lookup): CUDA	Iwan Kawrakow
	Not good. We only get ~160 t/s.