diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-07-24 19:55:06 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-07-24 19:55:06 +0200 |
commit | 28fb349db49d090c9a430076dc454fa8c878c2ec (patch) | |
tree | cf477455c355a53552e1429c8b8a79030a2d692d | |
parent | eb246cd0ae94a3922f67577af05c822053836480 (diff) |
Update README.md
-rw-r--r-- | README.md | 188 |
1 files changed, 94 insertions, 94 deletions
@@ -13,7 +13,7 @@ If you are not already familiar with [llama.cpp](https://github.com/ggerganov/ll Note that I have published some, but not all, of the code in this repository in a series of [llamafile](https://github.com/Mozilla-Ocho/llamafile) PRs ([394](https://github.com/Mozilla-Ocho/llamafile/pull/394), [405](https://github.com/Mozilla-Ocho/llamafile/pull/405), [428](https://github.com/Mozilla-Ocho/llamafile/pull/428), [435](https://github.com/Mozilla-Ocho/llamafile/pull/435), [453](https://github.com/Mozilla-Ocho/llamafile/pull/453), and [464](https://github.com/Mozilla-Ocho/llamafile/pull/464)) -The implementation оф матриь мултиплицатионс is in a single C++ source file (`iqk_mul_mat.cpp`) with just two interface functions `iqk_mul_mat` (`fp16/fp32` and quantized matrix multiplications) and `iqk_mul_mat_moe` (as `iqk_mul_mat` but meant to be used for the FFN part of a MoE model). Under the hood `iqk_mul_mat_moe` uses the same implementation as `iqk_mul_mat`, with the only difference being where results are stored in memory. Bitnet quantization related stuff is in `iqk-quantize.cpp`. +The implementation of matrix-matrix and matrix-vector multiplications is in a single C++ source file (`iqk_mul_mat.cpp`) with just two interface functions `iqk_mul_mat` (`fp16/fp32` and quantized matrix multiplications) and `iqk_mul_mat_moe` (as `iqk_mul_mat` but meant to be used for the FFN part of a MoE model). Under the hood `iqk_mul_mat_moe` uses the same implementation as `iqk_mul_mat`, with the only difference being where results are stored in memory. Bitnet quantization related stuff is in `iqk-quantize.cpp`. ## Why? @@ -92,99 +92,99 @@ The command line to generate the data was ./bin/llama-bench -m $model -p 0 -n 128 -t $num_threads -ngl 0 ``` -| Quantization | size | backend | threads | t/s (llama.cpp) | t/s (iqk_mul_mat)| Speedup | -| ------------------------ | ---------: | ---------- | ------: | ---------------: | ---------------: | ------: | -| 8B F16 | 14.96 GiB | AVX2 | 1 | 2.20 ± 0.00 | 2.25 ± 0.00 | 1.023 | -| | | | 2 | 3.63 ± 0.00 | 3.68 ± 0.00 | 1.014 | -| | | | 4 | 4.20 ± 0.00 | 4.20 ± 0.00 | 1.000 | -| 7B F16 | 12.55 GiB | NEON | 2 | 6.94 ± 0.27 | 7.40 ± 0.01 | 1.066 | -| | | | 4 | 8.73 ± 0.01 | 8.83 ± 0.01 | 1.011 | -| | | | 6 | 9.05 ± 0.02 | 9.05 ± 0.01 | 1.000 | -| 8B Q8_0 | 7.95 GiB | AVX2 | 2 | 5.03 ± 0.00 | 7.87 ± 0.00 | 1.565 | -| | | | 4 | 7.40 ± 0.00 | 7.82 ± 0.00 | 1.057 | -| 7B Q8_0 | 6.67 GiB | NEON | 2 | 8.29 ± 0.44 | 12.07 ± 0.10 | 1.456 | -| | | | 4 | 13.53 ± 0.03 | 15.77 ± 0.08 | 1.166 | -| | | | 8 | 16.24 ± 0.10 | 16.94 ± 0.04 | 1.043 | -| 8B Q4_0 | 4.35 GiB | AVX2 | 2 | 6.36 ± 0.00 | 10.28 ± 0.00 | 1.616 | -| | | | 4 | 10.97 ± 0.06 | 13.55 ± 0.07 | 1.235 | -| 7B Q4_0 | 3.57 GiB | NEON | 2 | 9.77 ± 0.02 | 13.69 ± 0.03 | 1.401 | -| | | | 4 | 17.82 ± 0.06 | 23.98 ± 0.11 | 1.346 | -| | | | 8 | 26.63 ± 0.41 | 29.86 ± 0.04 | 1.121 | -| 8B Q4_1 | 4.77 GiB | AVX2 | 2 | 5.11 ± 0.00 | 11.45 ± 0.00 | 2.241 | -| | | | 4 | 9.08 ± 0.02 | 12.58 ± 0.00 | 1.385 | -| 7B Q4_1 | 3.95 GiB | NEON | 2 | 9.11 ± 0.06 | 14.62 ± 0.04 | 1.605 | -| | | | 4 | 17.04 ± 0.09 | 24.08 ± 0.28 | 1.413 | -| | | | 8 | 25.26 ± 0.24 | 27.23 ± 0.14 | 1.078 | -| 8B Q5_0 | 5.22 GiB | AVX2 | 2 | 5.31 ± 0.01 | 8.30 ± 0.01 | 1.563 | -| | | | 4 | 9.40 ± 0.01 | 11.47 ± 0.00 | 1.220 | -| 7B Q5_0 | 4.34 GiB | NEON | 2 | 7.26 ± 0.06 | 7.52 ± 0.00 | 1.036 | -| | | | 4 | 13.63 ± 0.18 | 14.16 ± 0.10 | 1.039 | -| | | | 8 | 22.55 ± 0.35 | 24.34 ± 0.22 | 1.079 | -| 8B Q5_1 | 5.64 GiB | AVX2 | 2 | 4.52 ± 0.00 | 8.86 ± 0.00 | 1.960 | -| | | | 4 | 7.72 ± 0.05 | 10.68 ± 0.03 | 1.383 | -| 7B Q5_1 | 4.72 GiB | NEON | 2 | 6.51 ± 0.01 | 6.42 ± 0.03 | 0.986 | -| | | | 4 | 12.26 ± 0.18 | 12.21 ± 0.14 | 0.996 | -| | | | 8 | 20.33 ± 0.52 | 21.85 ± 0.22 | 1.075 | -| 8B Q2_K - Small | 2.78 GiB | AVX2 | 2 | 11.30 ± 0.00 | 13.06 ± 0.01 | 1.156 | -| | | | 4 | 18.70 ± 0.00 | 19.04 ± 0.65 | 1.014 | -| 7B Q2_K - Small | 2.16 GiB | NEON | 2 | 8.42 ± 0.05 | 11.97 ± 0.10 | 1.422 | -| | | | 4 | 15.74 ± 0.01 | 22.09 ± 0.08 | 1.403 | -| | | | 8 | 27.35 ± 0.05 | 38.32 ± 0.05 | 1.401 | -| 8B Q3_K - Small | 3.41 GiB | AVX2 | 2 | 8.58 ± 0.00 | 10.82 ± 0.00 | 1.261 | -| | | | 4 | 15.26 ± 0.01 | 16.25 ± 0.01 | 1.065 | -| 7B Q3_K - Small | 2.75 GiB | NEON | 2 | 6.40 ± 0.02 | 9.12 ± 0.09 | 1.425 | -| | | | 4 | 12.17 ± 0.00 | 17.11 ± 0.03 | 1.406 | -| | | | 8 | 22.04 ± 0.08 | 31.39 ± 0.31 | 1.424 | -| 8B Q4_K - Small | 4.36 GiB | AVX2 | 2 | 9.61 ± 0.00 | 10.72 ± 0.01 | 1.116 | -| | | | 4 | 13.24 ± 0.31 | 13.28 ± 0.01 | 1.003 | -| 7B Q4_K - Small | 3.59 GiB | NEON | 2 | 11.15 ± 0.05 | 12.93 ± 0.09 | 1.160 | -| | | | 4 | 20.24 ± 0.16 | 23.49 ± 0.29 | 1.161 | -| | | | 8 | 25.76 ± 0.07 | 28.31 ± 0.22 | 1.099 | -| 8B Q5_K - Small | 5.21 GiB | AVX2 | 2 | 7.45 ± 0.00 | 9.73 ± 0.00 | 1.306 | -| | | | 4 | 11.05 ± 0.33 | 11.43 ± 0.02 | 1.034 | -| 7B Q5_K - Small | 4.33 GiB | NEON | 2 | 7.20 ± 0.04 | 8.81 ± 0.04 | 1.224 | -| | | | 4 | 13.62 ± 0.15 | 16.81 ± 0.16 | 1.234 | -| | | | 8 | 20.56 ± 0.19 | 23.96 ± 0.14 | 1.165 | -| 8B Q6_K | 6.14 GiB | AVX2 | 2 | 7.53 ± 0.00 | 9.42 ± 0.00 | 1.251 | -| | | | 4 | 9.74 ± 0.00 | 9.97 ± 0.01 | 1.024 | -| 7B Q6_K | 5.15 GiB | NEON | 2 | 6.85 ± 0.04 | 8.30 ± 0.06 | 1.212 | -| | | | 4 | 13.03 ± 0.05 | 15.47 ± 0.17 | 1.187 | -| | | | 8 | 18.52 ± 0.07 | 20.67 ± 0.08 | 1.116 | -| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | AVX2 | 2 | 5.33 ± 0.01 | 6.40 ± 0.00 | 1.201 | -| | | | 4 | 10.06 ± 0.03 | 11.76 ± 0.03 | 1.169 | -| 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | NEON | 2 | 5.07 ± 0.04 | 5.22 ± 0.05 | 1.030 | -| | | | 4 | 9.63 ± 0.00 | 9.91 ± 0.07 | 1.029 | -| | | | 8 | 17.40 ± 0.50 | 18.65 ± 0.22 | 1.072 | -| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | AVX2 | 2 | 5.83 ± 0.00 | 6.55 ± 0.00 | 1.123 | -| | | | 4 | 10.88 ± 0.09 | 12.07 ± 0.07 | 1.109 | -| 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | NEON | 2 | 5.52 ± 0.01 | 5.60 ± 0.00 | 1.014 | -| | | | 4 | 10.50 ± 0.01 | 11.15 ± 0.00 | 1.062 | -| | | | 8 | 18.19 ± 1.30 | 20.94 ± 0.19 | 1.151 | -| 8B IQ2_M - 2.7 bpw | 2.74 GiB | AVX2 | 2 | 5.12 ± 0.01 | 5.17 ± 0.00 | 1.010 | -| | | | 4 | 9.60 ± 0.28 | 9.68 ± 0.16 | 1.008 | -| 7B IQ2_M - 2.7 bpw | 2.20 GiB | NEON | 2 | 3.73 ± 0.02 | 4.53 ± 0.00 | 1.214 | -| | | | 4 | 7.14 ± 0.05 | 8.70 ± 0.06 | 1.218 | -| | | | 8 | 11.99 ± 0.48 | 16.41 ± 0.05 | 1.369 | -| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | AVX2 | 2 | 4.06 ± 0.01 | 5.00 ± 0.00 | 1.232 | -| | | | 4 | 7.75 ± 0.02 | 9.13 ± 0.45 | 1.178 | -| 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | NEON | 2 | 3.53 ± 0.00 | 3.82 ± 0.00 | 1.082 | -| | | | 4 | 6.74 ± 0.04 | 7.42 ± 0.07 | 1.103 | -| | | | 8 | 11.96 ± 0.40 | 13.19 ± 0.29 | 1.103 | -| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | AVX2 | 2 | 3.62 ± 0.00 | 4.06 ± 0.00 | 1.122 | -| | | | 4 | 6.80 ± 0.01 | 7.62 ± 0.10 | 1.121 | -| 7B IQ3_S - 3.4375 bpw | 2.75 GiB | NEON | 2 | 2.96 ± 0.01 | 3.21 ± 0.03 | 1.084 | -| | | | 4 | 5.68 ± 0.01 | 6.25 ± 0.05 | 1.100 | -| | | | 8 | 10.32 ± 0.25 | 11.11 ± 0.37 | 1.077 | -| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | AVX2 | 2 | 8.08 ± 0.00 | 11.35 ± 0.00 | 1.405 | -| | | | 4 | 13.36 ± 0.72 | 14.32 ± 0.24 | 1.072 | -| 7B IQ4_XS - 4.25 bpw | 3.37 GiB | NEON | 2 | 9.87 ± 0.03 | 12.06 ± 0.00 | 1.222 | -| | | | 4 | 17.78 ± 0.23 | 22.06 ± 0.28 | 1.241 | -| | | | 8 | 27.62 ± 0.09 | 29.70 ± 0.39 | 1.075 | -| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | AVX2 | 2 | 5.52 ± 0.00 | 10.26 ± 0.00 | 1.859 | -| | | | 4 | 10.78 ± 0.01 | 13.69 ± 0.08 | 1.270 | -| 7B IQ4_NL - 4.5 bpw | 3.56 GiB | NEON | 2 | 8.32 ± 0.01 | 13.54 ± 0.01 | 1.627 | -| | | | 4 | 15.89 ± 0.00 | 24.28 ± 0.29 | 1.528 | -| | | | 8 | 26.56 ± 0.36 | 29.87 ± 0.08 | 1.125 | +| Quantization| size | backend | threads | t/s (llama.cpp) | t/s (iqk_mul_mat)| Speedup | +| ---------- | ---------: | ---------- | ------: | ---------------: | ---------------: | ------: | +| 8B F16 | 14.96 GiB | AVX2 | 1 | 2.20 ± 0.00 | 2.25 ± 0.00 | 1.023 | +| | | | 2 | 3.63 ± 0.00 | 3.68 ± 0.00 | 1.014 | +| | | | 4 | 4.20 ± 0.00 | 4.20 ± 0.00 | 1.000 | +| 7B F16 | 12.55 GiB | NEON | 2 | 6.94 ± 0.27 | 7.40 ± 0.01 | 1.066 | +| | | | 4 | 8.73 ± 0.01 | 8.83 ± 0.01 | 1.011 | +| | | | 6 | 9.05 ± 0.02 | 9.05 ± 0.01 | 1.000 | +| 8B Q8_0 | 7.95 GiB | AVX2 | 2 | 5.03 ± 0.00 | 7.87 ± 0.00 | 1.565 | +| | | | 4 | 7.40 ± 0.00 | 7.82 ± 0.00 | 1.057 | +| 7B Q8_0 | 6.67 GiB | NEON | 2 | 8.29 ± 0.44 | 12.07 ± 0.10 | 1.456 | +| | | | 4 | 13.53 ± 0.03 | 15.77 ± 0.08 | 1.166 | +| | | | 8 | 16.24 ± 0.10 | 16.94 ± 0.04 | 1.043 | +| 8B Q4_0 | 4.35 GiB | AVX2 | 2 | 6.36 ± 0.00 | 10.28 ± 0.00 | 1.616 | +| | | | 4 | 10.97 ± 0.06 | 13.55 ± 0.07 | 1.235 | +| 7B Q4_0 | 3.57 GiB | NEON | 2 | 9.77 ± 0.02 | 13.69 ± 0.03 | 1.401 | +| | | | 4 | 17.82 ± 0.06 | 23.98 ± 0.11 | 1.346 | +| | | | 8 | 26.63 ± 0.41 | 29.86 ± 0.04 | 1.121 | +| 8B Q4_1 | 4.77 GiB | AVX2 | 2 | 5.11 ± 0.00 | 11.45 ± 0.00 | 2.241 | +| | | | 4 | 9.08 ± 0.02 | 12.58 ± 0.00 | 1.385 | +| 7B Q4_1 | 3.95 GiB | NEON | 2 | 9.11 ± 0.06 | 14.62 ± 0.04 | 1.605 | +| | | | 4 | 17.04 ± 0.09 | 24.08 ± 0.28 | 1.413 | +| | | | 8 | 25.26 ± 0.24 | 27.23 ± 0.14 | 1.078 | +| 8B Q5_0 | 5.22 GiB | AVX2 | 2 | 5.31 ± 0.01 | 8.30 ± 0.01 | 1.563 | +| | | | 4 | 9.40 ± 0.01 | 11.47 ± 0.00 | 1.220 | +| 7B Q5_0 | 4.34 GiB | NEON | 2 | 7.26 ± 0.06 | 7.52 ± 0.00 | 1.036 | +| | | | 4 | 13.63 ± 0.18 | 14.16 ± 0.10 | 1.039 | +| | | | 8 | 22.55 ± 0.35 | 24.34 ± 0.22 | 1.079 | +| 8B Q5_1 | 5.64 GiB | AVX2 | 2 | 4.52 ± 0.00 | 8.86 ± 0.00 | 1.960 | +| | | | 4 | 7.72 ± 0.05 | 10.68 ± 0.03 | 1.383 | +| 7B Q5_1 | 4.72 GiB | NEON | 2 | 6.51 ± 0.01 | 6.42 ± 0.03 | 0.986 | +| | | | 4 | 12.26 ± 0.18 | 12.21 ± 0.14 | 0.996 | +| | | | 8 | 20.33 ± 0.52 | 21.85 ± 0.22 | 1.075 | +| 8B Q2_K_S | 2.78 GiB | AVX2 | 2 | 11.30 ± 0.00 | 13.06 ± 0.01 | 1.156 | +| | | | 4 | 18.70 ± 0.00 | 19.04 ± 0.65 | 1.014 | +| 7B Q2_K_S | 2.16 GiB | NEON | 2 | 8.42 ± 0.05 | 11.97 ± 0.10 | 1.422 | +| | | | 4 | 15.74 ± 0.01 | 22.09 ± 0.08 | 1.403 | +| | | | 8 | 27.35 ± 0.05 | 38.32 ± 0.05 | 1.401 | +| 8B Q3_K_S | 3.41 GiB | AVX2 | 2 | 8.58 ± 0.00 | 10.82 ± 0.00 | 1.261 | +| | | | 4 | 15.26 ± 0.01 | 16.25 ± 0.01 | 1.065 | +| 7B Q3_K_S | 2.75 GiB | NEON | 2 | 6.40 ± 0.02 | 9.12 ± 0.09 | 1.425 | +| | | | 4 | 12.17 ± 0.00 | 17.11 ± 0.03 | 1.406 | +| | | | 8 | 22.04 ± 0.08 | 31.39 ± 0.31 | 1.424 | +| 8B Q4_K_S | 4.36 GiB | AVX2 | 2 | 9.61 ± 0.00 | 10.72 ± 0.01 | 1.116 | +| | | | 4 | 13.24 ± 0.31 | 13.28 ± 0.01 | 1.003 | +| 7B Q4_K_S | 3.59 GiB | NEON | 2 | 11.15 ± 0.05 | 12.93 ± 0.09 | 1.160 | +| | | | 4 | 20.24 ± 0.16 | 23.49 ± 0.29 | 1.161 | +| | | | 8 | 25.76 ± 0.07 | 28.31 ± 0.22 | 1.099 | +| 8B Q5_K_S | 5.21 GiB | AVX2 | 2 | 7.45 ± 0.00 | 9.73 ± 0.00 | 1.306 | +| | | | 4 | 11.05 ± 0.33 | 11.43 ± 0.02 | 1.034 | +| 7B Q5_K_S | 4.33 GiB | NEON | 2 | 7.20 ± 0.04 | 8.81 ± 0.04 | 1.224 | +| | | | 4 | 13.62 ± 0.15 | 16.81 ± 0.16 | 1.234 | +| | | | 8 | 20.56 ± 0.19 | 23.96 ± 0.14 | 1.165 | +| 8B Q6_K | 6.14 GiB | AVX2 | 2 | 7.53 ± 0.00 | 9.42 ± 0.00 | 1.251 | +| | | | 4 | 9.74 ± 0.00 | 9.97 ± 0.01 | 1.024 | +| 7B Q6_K | 5.15 GiB | NEON | 2 | 6.85 ± 0.04 | 8.30 ± 0.06 | 1.212 | +| | | | 4 | 13.03 ± 0.05 | 15.47 ± 0.17 | 1.187 | +| | | | 8 | 18.52 ± 0.07 | 20.67 ± 0.08 | 1.116 | +| 8B IQ2_XXS | 2.23 GiB | AVX2 | 2 | 5.33 ± 0.01 | 6.40 ± 0.00 | 1.201 | +| | | | 4 | 10.06 ± 0.03 | 11.76 ± 0.03 | 1.169 | +| 7B IQ2_XXS | 1.73 GiB | NEON | 2 | 5.07 ± 0.04 | 5.22 ± 0.05 | 1.030 | +| | | | 4 | 9.63 ± 0.00 | 9.91 ± 0.07 | 1.029 | +| | | | 8 | 17.40 ± 0.50 | 18.65 ± 0.22 | 1.072 | +| 8B IQ2_XS | 2.42 GiB | AVX2 | 2 | 5.83 ± 0.00 | 6.55 ± 0.00 | 1.123 | +| | | | 4 | 10.88 ± 0.09 | 12.07 ± 0.07 | 1.109 | +| 7B IQ2_XS | 1.89 GiB | NEON | 2 | 5.52 ± 0.01 | 5.60 ± 0.00 | 1.014 | +| | | | 4 | 10.50 ± 0.01 | 11.15 ± 0.00 | 1.062 | +| | | | 8 | 18.19 ± 1.30 | 20.94 ± 0.19 | 1.151 | +| 8B IQ2_M | 2.74 GiB | AVX2 | 2 | 5.12 ± 0.01 | 5.17 ± 0.00 | 1.010 | +| | | | 4 | 9.60 ± 0.28 | 9.68 ± 0.16 | 1.008 | +| 7B IQ2_M | 2.20 GiB | NEON | 2 | 3.73 ± 0.02 | 4.53 ± 0.00 | 1.214 | +| | | | 4 | 7.14 ± 0.05 | 8.70 ± 0.06 | 1.218 | +| | | | 8 | 11.99 ± 0.48 | 16.41 ± 0.05 | 1.369 | +| 8B IQ3_XXS | 3.04 GiB | AVX2 | 2 | 4.06 ± 0.01 | 5.00 ± 0.00 | 1.232 | +| | | | 4 | 7.75 ± 0.02 | 9.13 ± 0.45 | 1.178 | +| 7B IQ3_XXS | 2.41 GiB | NEON | 2 | 3.53 ± 0.00 | 3.82 ± 0.00 | 1.082 | +| | | | 4 | 6.74 ± 0.04 | 7.42 ± 0.07 | 1.103 | +| | | | 8 | 11.96 ± 0.40 | 13.19 ± 0.29 | 1.103 | +| 8B IQ3_S | 3.42 GiB | AVX2 | 2 | 3.62 ± 0.00 | 4.06 ± 0.00 | 1.122 | +| | | | 4 | 6.80 ± 0.01 | 7.62 ± 0.10 | 1.121 | +| 7B IQ3_S | 2.75 GiB | NEON | 2 | 2.96 ± 0.01 | 3.21 ± 0.03 | 1.084 | +| | | | 4 | 5.68 ± 0.01 | 6.25 ± 0.05 | 1.100 | +| | | | 8 | 10.32 ± 0.25 | 11.11 ± 0.37 | 1.077 | +| 8B IQ4_XS | 4.13 GiB | AVX2 | 2 | 8.08 ± 0.00 | 11.35 ± 0.00 | 1.405 | +| | | | 4 | 13.36 ± 0.72 | 14.32 ± 0.24 | 1.072 | +| 7B IQ4_XS | 3.37 GiB | NEON | 2 | 9.87 ± 0.03 | 12.06 ± 0.00 | 1.222 | +| | | | 4 | 17.78 ± 0.23 | 22.06 ± 0.28 | 1.241 | +| | | | 8 | 27.62 ± 0.09 | 29.70 ± 0.39 | 1.075 | +| 8B IQ4_NL | 4.35 GiB | AVX2 | 2 | 5.52 ± 0.00 | 10.26 ± 0.00 | 1.859 | +| | | | 4 | 10.78 ± 0.01 | 13.69 ± 0.08 | 1.270 | +| 7B IQ4_NL | 3.56 GiB | NEON | 2 | 8.32 ± 0.01 | 13.54 ± 0.01 | 1.627 | +| | | | 4 | 15.89 ± 0.00 | 24.28 ± 0.29 | 1.528 | +| | | | 8 | 26.56 ± 0.36 | 29.87 ± 0.08 | 1.125 | Here gains are generally lower compared to PP due to TG performance being limited by memory bandwidth. Nevertheless, for some quants/architectures/threads the speedup is quite remarkable (e.g., almost a factor of 2 for `Q5_1` on `AVX2` with 2 threads). |