summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26Offload Bitnet token embeddings to the GPU - the right way (#2)Kawrakow
OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26Offload Bitnet token embeddings to the GPU (#1)Kawrakow
* bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-25iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementationIwan Kawrakow
2024-07-24Update README.mdKawrakow
2024-07-24Update README.mdKawrakow
Trying to avoid line breaks in table
2024-07-24Update README.mdKawrakow
2024-07-24Add copyright noticesIwan Kawrakow
Only on the files where I have contributed in a significant way, or the files I wrote myself.
2024-07-24Remove unused fileIwan Kawrakow
2024-07-24Remove securityIwan Kawrakow
2024-07-24Correct spelling in READMEIwan Kawrakow
2024-07-24Update README.mdKawrakow
Adding some more details
2024-07-24Update README.mdKawrakow
Adding MoE and Bitnet performance tables
2024-07-24Update README.mdKawrakow
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
2024-07-24Update README.mdKawrakow
Added performance comparison tables
2024-07-24iqk_mul_mat(NEON): special case for n not divisible by 8Iwan Kawrakow
Else fp16 PP performance drops by nearly a factor of 2 compared to what we had before.
2024-07-24ggml: thread syncronization on ArmIwan Kawrakow
For x86 slaren was genereous enough to add _mm_pause() to the busy spin wait loop in ggml_barrier(), but everything else just busy spins, loading an atomic int on every iteration, thus forcing cache sync between the cores. This results in a massive drop in performance on my M2-Max laptop when using 8 threads. The closest approximation to _mm_pause() on Arm seems to be __asm__ __volatile__("isb\n"); After adding this to the busy spin loop, performance for 8 threads recovers back to expected levels.
2024-07-24Fix "make it work for row sizes that are multiple of 4 on NEON"Iwan Kawrakow
2024-07-23Update README.mdKawrakow
2024-07-23Update README.mdKawrakow
2024-07-19When tokenizer info is missing in the model, use llama3 by defaultIwan Kawrakow
2024-07-18iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEONIwan Kawrakow
Here the performance gain is more modest compared to AVX2: we get PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B running on M2 Max.
2024-07-18iqk_mul_mat: attentions matrix multiplicationsIwan Kawrakow
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications. Before this PR, this meant n_head calls to iqk_mul_mat to perform n_kv_embed x n_token 2D multiplications, each using nth threads. Instead, in this PR, if n_head is a multiple of nth, each thread does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices. This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from 409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B, we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from 139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
2024-07-18iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2Iwan Kawrakow
I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!
2024-07-17Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantizeIwan Kawrakow
2024-07-17iq1bn: faster scalar dot productIwan Kawrakow
At the end of the day, lookup is still better when not using simd. This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X with 16 threads (up from 10.5 t/s).
2024-07-17iq1bn: fix scalar dot productIwan Kawrakow
The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s) but slower on the M2 (6.8 t/s vs 8.6 t/s before).
2024-07-17iq1bn: faster AVX2Iwan Kawrakow
Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s
2024-07-17Remove the no longer used iq1bn_grid_u16Iwan Kawrakow
2024-07-17iq1bn: adjust scalar dot product and some cleanupIwan Kawrakow
2024-07-17iq1bn(no lookup): better versionIwan Kawrakow
We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: | model | size | params | backend | threads | test | t/s | | ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | pp512 | 9613.02 ± 24.54 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | tg128 | 229.85 ± 0.33 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | pp512 | 322.59 ± 1.00 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | tg128 | 59.79 ± 0.03 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 8 | tg128 | 57.62 ± 0.21 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 4 | tg128 | 33.66 ± 0.29 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 2 | tg128 | 18.30 ± 0.01 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | pp512 | 698.13 ± 0.21 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | tg128 | 68.88 ± 0.24 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | pp512 | 196.80 ± 0.50 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | tg128 | 51.58 ± 0.41 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 4 | tg128 | 30.80 ± 0.03 | | 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 2 | tg128 | 16.89 ± 0.01 | It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.
2024-07-16iq1bn(no lookup): MetalIwan Kawrakow
In summary, compared to lookup, the multiplication based approach is * Much better on AVX2 * Slightly better on CUDA * Slightly worse on Metal * Much worse on NEON
2024-07-16iq1bn(no lookup): NEON attemptsIwan Kawrakow
We are at TG-128 = 25.7 t/s, which is quite a bit worse than lookup.
2024-07-15iq1bn(no lookup): NEONIwan Kawrakow
Pretty bad.
2024-07-15iq1bn(no lookup): CUDAIwan Kawrakow
Not good. We only get ~160 t/s.
2024-07-15iq1bn(no lookup): somewhat betterIwan Kawrakow
We now have for Bitnet-3B: | threads | test | t/s | | ------: | ------------: | ---------------: | | 16 | pp512 | 308.97 ± 1.89 | | 16 | tg128 | 58.80 ± 0.07 | | 8 | tg128 | 49.79 ± 1.23 | | 4 | tg128 | 28.85 ± 0.02 | | 2 | tg128 | 15.39 ± 0.01 |
2024-07-15iq1bn: attempt without a lookup tableIwan Kawrakow
2024-06-27Remove all workflowsIwan Kawrakow
2024-06-26imatrix: be able to specify the name of the output tensorIwan Kawrakow
For some models the same tensor is used for token embeddings and output. This tensor tends to be named token_embedding.weight rather than output.weight, which prevernts us from collecting imatrix data for this tensor. With this commit we can tell the name of the output tensor to the imatrix tool.
2024-06-26bitnet: fold V scale into rms_normIwan Kawrakow
2024-06-26RoPE(Neox, Metal): don't use power functions in a loopIwan Kawrakow
Speeds up Bitnet by ~2% on Metal.
2024-06-25TypoIwan Kawrakow
2024-06-25bitnet: remove iq1_bn lookup table storing +/- signsIwan Kawrakow
The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.
2024-06-25bitnet: simdify q8_K64 quantization on AVXIwan Kawrakow
Doesn't make a real difference in performance.
2024-06-25bitnet: NEON improvements for iq1_bnIwan Kawrakow
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25bitnet: remove the now unused iq1bn_grid_u16Iwan Kawrakow
2024-06-25Bitnet: adapt NEON and Metal to the alternative gridIwan Kawrakow
2024-06-25Bitnet: trying an alternative iq1_bn gridIwan Kawrakow
Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.
2024-06-25bitnet: fix scalar dot product for 1.625 bpwIwan Kawrakow
I had not adjusted after going to 4 q8 scales per row.
2024-06-25Bitnet: slightly faster 1.625 bpw variant for AVX512VLIwan Kawrakow