summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-22iqk_mul_mat: be independent of llamafile_sgemm (WIP)Iwan Kawrakow
* Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__
2024-06-22Fix nb4Iwan Kawrakow
2024-06-22iqk_mul_mat: add ability to disable itIwan Kawrakow
2024-06-22iqk_mul_mat: be able to handle any f16/f32 combination on AVX2Iwan Kawrakow
But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22iqk_mul_mat: turn on AVX512Iwan Kawrakow
It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.
2024-06-22iqk_mul_mat: slightly better fp16 with 16 vector registersIwan Kawrakow
2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.
2024-06-22iqk_mul_mat: better fp16 for AVX2Iwan Kawrakow
Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.
2024-06-22iqk_mul_mat: fp16 for ArmIwan Kawrakow
~2% slower than tinyBLAS - not sure why.
2024-06-22iqk_mul_mat: slightly faster FANCY_SIMD dot productIwan Kawrakow
About 2% faster for q4_K.
2024-06-22iqk_mul_mat: fix q8_0Iwan Kawrakow
I was happily using _mm256_packs_epi32() to pack the q8_0 x q8_0 dot products back to int16_t, and getting useful results. But theoretically this can overflow, so it is better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine the 4 dot products using int32_t additions. This is (almost) as fast, unlike _mm256_hadd_epi32(), which seems excessively slow on the Ryzen-7950X.
2024-06-22iqk_mul_mat: decouple from llamafile also in cmakeIwan Kawrakow
2024-06-22iqk_mul_mat: make it build with the MakefileIwan Kawrakow
2024-06-22iqk_mul_mat: use block_q8_1_x4 also for AVX2Iwan Kawrakow
Here the performance gain is more significant. E.g., for q4_1, PP-512 becomes 168 t/s up from 137 t/s. Now the performance gap to q4_0 is so significant that I wonder if I should change to using Q8_1 also for the qX_0 legacy quants.
2024-06-22iqk_mul_mat: use block_q8_0_x4 also for AVX2Iwan Kawrakow
2024-06-22iqk_mul_mat: delete unused stuffIwan Kawrakow
2024-06-22iqk_mul_mat: add q8_0Iwan Kawrakow
It was actually ready but not turned on. Having forgotten, I made a new implementation along the lines of the fp16 implementation (i.e., using tiling). That matched tiinyBLAS performance. But the existing implementation that I now turned on is faster: PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)
2024-06-22iqk_mul_mat: fp16 tweaksIwan Kawrakow
Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers). This works best for the Ryzen-5975WX.
2024-06-22iqk_mul_mat: fp16 implementation cleanupIwan Kawrakow
It turns out on my Ryzen-7950X CPU using AVX512 is slower.
2024-06-22iqk_mul_mat: fp16 implementation for AVX2Iwan Kawrakow
This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).
2024-06-22iqk_mul_mat: multi-thread quantization also for MoE modelsIwan Kawrakow
2024-06-22iqk_mul_mat: make it independent of sgemmIwan Kawrakow
2024-06-22iqk_mul_mat: minor improvementsIwan Kawrakow
Current performance: | model | size | threads | test | t/s | | ----------------- | ---------: | -------: | ------: | ---------------: | | llama 7B IQ3_S | 2.75 GiB | 16 | pp512 | 100.21 ± 0.32 | | llama 7B IQ3_XXS | 2.41 GiB | 16 | pp512 | 105.25 ± 0.75 | | llama 7B IQ2_M | 2.20 GiB | 16 | pp512 | 117.88 ± 0.15 | | llama 7B IQ2_XS | 1.89 GiB | 16 | pp512 | 136.38 ± 0.24 | | llama 7B IQ2_XXS | 1.73 GiB | 16 | pp512 | 128.47 ± 0.39 | mean: 117.64 | ----------------- | ---------: | -------: | ------: | ---------------: | | llama 7B IQ2_XXS | 1.73 GiB | 8 | tg128 | 23.94 ± 0.04 | | llama 7B IQ2_XS | 1.89 GiB | 8 | tg128 | 23.27 ± 0.03 | | llama 7B IQ2_M | 2.20 GiB | 8 | tg128 | 18.88 ± 0.03 | | llama 7B IQ3_XXS | 2.41 GiB | 8 | tg128 | 19.07 ± 0.04 | | llama 7B IQ3_S | 2.75 GiB | 8 | tg128 | 15.44 ± 0.05 | mean: 20.12
2024-06-22iqk_mul_mat: no more templates in the IQ dequantizersIwan Kawrakow
Also moved the quant specific code from the EvenSignHelper into the corresponding dequantizers. These two changes had a tiniy performance benefit (much too small compared to what I was expecting/hoping for).
2024-06-22iqk_mul_mat: remove template on one of the prepare() functionsIwan Kawrakow
2024-06-22iqk_mul_mat: experimenting with zen4Iwan Kawrakow
Nope, we cannot have good performance for iq2_xxs and iq3_xxs at the same time. If I don't force inline the sign functions, I get better performnce for iq2_xxs and bad performance for iq3_xxs. If I fore inline them, it is the other way around. Anyway, this is what we have now on Zen4 for all quants with forced inline EvenSignHelper methods: | model | size | threads | test | t/s | | -----------------| ---------: | ------: | -----: | ------------: | | llama 7B IQ3_S | 2.75 GiB | 16 | pp512 | 100.91 ± 0.26 | | llama 7B IQ3_XXS | 2.41 GiB | 16 | pp512 | 106.08 ± 0.78 | | llama 7B IQ2_M | 2.20 GiB | 16 | pp512 | 116.41 ± 0.25 | | llama 7B IQ2_XS | 1.89 GiB | 16 | pp512 | 132.54 ± 1.07 | | llama 7B IQ2_XXS | 1.73 GiB | 16 | pp512 | 125.53 ± 0.06 | arithmetic mean: 116.29 geometric mean: 115.70 | -----------------| ---------: | ------: | -----: | ------------: | | llama 7B IQ3_S | 2.75 GiB | 8 | tg128 | 15.69 ± 0.04 | | llama 7B IQ3_XXS | 2.41 GiB | 8 | tg128 | 18.02 ± 0.04 | | llama 7B IQ2_M | 2.20 GiB | 8 | tg128 | 18.94 ± 0.03 | | llama 7B IQ2_XS | 1.89 GiB | 8 | tg128 | 23.29 ± 0.02 | | llama 7B IQ2_XXS | 1.73 GiB | 8 | tg128 | 22.96 ± 0.09 | arithmetic mean: 19.78 geometric mean: 19.56 Without force-inlining, PP(iq3_xxs) drops to 98 t/s while PP(iq2_xxs) increases to 137 t/s.
2024-06-22iqk_mul_mat: experimenting with zen4 (iq2_xxs)Iwan Kawrakow
Observing again the wierdness of performance drop in a quant because of a change in another quant. After I added FANCY_SIMD implementations for ia3_s, iq2_s and iq2_xs, I'm observing that iq2_xxs PP performance dropped to 130 t/s from 139 t/s. Adding FANCY_SIMD implementation for applying the signs brings it back to 137 t/s and gives a small boost for TG as well (23.4 vs 23.0 t/s)
2024-06-22iqk_mul_mat: experimenting with zen4 (iq2_xs)Iwan Kawrakow
2024-06-22iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)Iwan Kawrakow
2024-06-22iqk_mul_mat: small improvement for iq3_sIwan Kawrakow
The same as in llamafile. We get PP-512 = 96.6 t/s TG-128 = 7.77 t/s @ 4 threads 14.4 t/s @ 8 threads 16.3 t/s @ 16 threads
2024-06-22iqk_mul_mat: better AVX2 implementation for iq2_xxsIwan Kawrakow
From here on switching to GCC 12. PP-512 is now 139.3 t/s. TG-128 is 13.5 t/s @ 4 threads 23.0 t/s @ 8 threads 25.1 t/s @ 16 threads
2024-06-22iqk_mul_mat: better AVX2 implementation for iq2_xxsIwan Kawrakow
2.41X for PP-512 (120.5 t/s). Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s). But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s. Very strange.
2024-06-22iqk_mul_mat: AVX2 implementation for iq2_xxsIwan Kawrakow
2.09X for PP-512 (104.7 t/s), worse than mainline for TG. I think it needs more work.
2024-06-22iqk_mul_mat: AVX2 implementation for iq2_xsIwan Kawrakow
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK (slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22iqk_mul_mat: AVX2 implementation for iq2_sIwan Kawrakow
We get 2.04X for PP-512 (107 t/s). TG againsuffers a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22Separate templates for TG and PP for i-quants on AVX2Iwan Kawrakow
2024-06-22iqk_mul_mat: AVX2 implementation for iq3_xxsIwan Kawrakow
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation. Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22iqk_mul_mat: AVX2 implementation for iq3_sIwan Kawrakow
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation.
2024-06-22Cleanup - Arm i-quants should be good nowIwan Kawrakow
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)Iwan Kawrakow
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22SimplifyIwan Kawrakow
2024-06-22iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)Iwan Kawrakow
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)Iwan Kawrakow
We get 2.2X for PP-512 (52 t/s)
2024-06-22iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)Iwan Kawrakow
We get only a 2.07X for PP-512 to get up to 31 t/s, so iq2_s remains slow.
2024-06-22Add Q8_0Iwan Kawrakow
2024-06-22CosmeticsIwan Kawrakow
2024-06-22iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)Iwan Kawrakow
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22iqk_mul_mat: faster q3_K TGIwan Kawrakow
We get 31 t/s up from 26 t/s, but we need to treat PP differently from TG, else we get a ~10% drop in PP performance.
2024-06-22iqk_mul_mat for llama.cppIwan Kawrakow
2024-06-21JSON Schema to GBNF integration tests (#7790)Clint Herron
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in #7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.
2024-06-21vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)k.h.lai
* vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query