summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-08-09Fix MakefileIwan Kawrakow
I always use cmake, so had forgotten to pay attention to the Makefile.
2024-08-09Fix Zen4 implementation of iq3_k, iq4_k, iq5_kIwan Kawrakow
See comments in f3a823ce729a7db33e7d4375eae7291bbe6196db
2024-08-09iq6_k: AVX2Iwan Kawrakow
2024-08-09iq6_k: MetalIwan Kawrakow
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.
2024-08-09iq6_k: NEONIwan Kawrakow
Respectable performance, only slightly slower than Q6_K.
2024-08-09iq6_k: slightly better Zen4 iqk_mul_matIwan Kawrakow
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.
2024-08-09iq6_k: Zen4 iqk_mul_matIwan Kawrakow
We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.
2024-08-09iq6_k: CUDA dot productIwan Kawrakow
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
2024-08-09iq6_k: CUDA dequantizeIwan Kawrakow
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).
2024-08-09iq6_k: WIP (quantize/dequantize)Iwan Kawrakow
2024-08-09iq6_k: WIP (nothing works)Iwan Kawrakow
2024-08-07Adding IQ2_TN for use with ternary models (#13)Kawrakow
* iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-05q2_K: allow it to detect ternary nets and quantize accordinglyIwan Kawrakow
2024-08-05Update README.mdKawrakow
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
2024-08-05iq3_k, iq5_k: faster quantizationIwan Kawrakow
Just use the same trick as iq4_k
2024-08-03iq4_k: speedup quantization by a factor of ~2Iwan Kawrakow
2024-08-01Add copyright noticeIwan Kawrakow
2024-08-01iq2/3_k: tiny bit faster Metal dot productsIwan Kawrakow
2024-08-01iq3_k: slightly faster Metal dequantize kernelIwan Kawrakow
PP-512 goes to 473 t/s up from 452 t/s.
2024-08-01iq3_k: Metal dot productIwan Kawrakow
Quite slow: 43 t/s for a 7B model
2024-08-01iq2_k: Metal dot product finally worksIwan Kawrakow
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.
2024-08-01iq3_k: Metal dequantizeIwan Kawrakow
2024-08-01iq3_k: NEONIwan Kawrakow
2024-08-01iq3_k: AVX2 iqk_mul_matIwan Kawrakow
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01iq3_k: AVX512 iqk_mul_matIwan Kawrakow
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01iq3_k: faster CUDA dot productIwan Kawrakow
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
2024-08-01iq3_k: CUDA dot productIwan Kawrakow
Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.
2024-08-01iq3_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01iq2_k: very slightly better CUDA dot productIwan Kawrakow
169.2 t/s vs 167.8 t/s before.
2024-08-01iq2_k: better CUDA dot productIwan Kawrakow
Almost on par with iq2_xs (168 t/s vs 172 t/s).
2024-08-01iq2_k: CUDA dot product finally worksIwan Kawrakow
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.
2024-08-01iq5_k: CUDA dot product finally worksIwan Kawrakow
2024-08-01Factor out iqk CUDA dot productsIwan Kawrakow
I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.
2024-08-01iq5_k: CUDA dot product still not workingIwan Kawrakow
2024-08-01iq5_k: MetalIwan Kawrakow
Performance is roughly on par with q5_0.
2024-08-01iq5_k: NEONIwan Kawrakow
2024-08-01iq5_k: AVX512Iwan Kawrakow
2024-08-01iq5_k: AVX2Iwan Kawrakow
2024-08-01iq5_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize
2024-08-01iq2_k: Metal. Dot product is wrongIwan Kawrakow
2024-08-01iq2_k: NEONIwan Kawrakow
2024-08-01iq2_k: slightly faster AVX512Iwan Kawrakow
2024-08-01iq2_k: simplify AVX512Iwan Kawrakow
2024-08-01iq2_k: AVX2Iwan Kawrakow
2024-08-01iq2_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-07-28IQ4_K: SOTA 4-bit quantization (#6)Kawrakow
* iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Simdify and multi-thread tanh (#4)Kawrakow
It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for softcaping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation. After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X). Simd-ifying: We have tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1) so we can just use Justine Tunney's SIMD exp implementation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26Offload Bitnet token embeddings to the GPU - the right way (#2)Kawrakow
OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26Offload Bitnet token embeddings to the GPU (#1)Kawrakow
* bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>