diff options
-rw-r--r-- | README.md | 161 |
1 files changed, 139 insertions, 22 deletions
@@ -47,28 +47,145 @@ The results in the following tables are obtained with these parameters: Here I set the number of threads to be equal to the number of (performance) cores of the CPU, so 16 threads for the Ryzen-7950X and 8 threads for the M2-Max. The following table summarizes the results. To not make the table too long, I have listed only quantized models containing predominantly one quantization type (i.e., excluded the `QX_K - Medium` quants, which are typically a mix of `QX_K` and `Q(X+1)_K`, as well as `IQ2_S` and `IQ3_XS`). -| Quantization | size | backend | threads | test | t/s (llama.cpp) | t/s (iqk_mul_mat)| Speedup | -| --------------------- | ---------: | ---------- | ------: | ------------: | ---------------: | ---------------: | ------: | -| F16 | 14.96 GiB | AVX2 | 16 | pp512 | 112.37 ± 0.40 | 131.27 ± 0.38 | 1.168 | -| Q8_0 | 7.95 GiB | AVX2 | 16 | pp512 | 118.07 ± 0.53 | 134.00 ± 0.47 | 1.135 | -| Q4_0 | 4.35 GiB | AVX2 | 16 | pp512 | 104.46 ± 0.33 | 130.20 ± 0.29 | 1.246 | -| Q4_1 | 4.77 GiB | AVX2 | 16 | pp512 | 57.83 ± 0.24 | 160.69 ± 0.49 | 2.779 | -| Q5_0 | 5.22 GiB | AVX2 | 16 | pp512 | 53.50 ± 0.35 | 122.62 ± 0.48 | 2.292 | -| Q5_1 | 5.64 GiB | AVX2 | 16 | pp512 | 50.85 ± 0.36 | 147.15 ± 0.47 | 2.894 | -| Q2_K - Small | 2.78 GiB | AVX2 | 16 | pp512 | 110.11 ± 0.28 | 192.47 ± 1.35 | 1.748 | -| Q3_K - Small | 3.41 GiB | AVX2 | 16 | pp512 | 77.42 ± 0.36 | 181.64 ± 0.44 | 2.346 | -| Q4_K - Small | 4.36 GiB | AVX2 | 16 | pp512 | 98.92 ± 0.34 | 185.35 ± 0.39 | 1.874 | -| Q5_K - Small | 5.21 GiB | AVX2 | 16 | pp512 | 69.44 ± 0.31 | 179.62 ± 0.69 | 2.587 | -| Q6_K | 6.14 GiB | AVX2 | 16 | pp512 | 74.89 ± 0.26 | 181.86 ± 0.55 | 2.428 | -| IQ2_XXS - 2.0625 bpw | 2.23 GiB | AVX2 | 16 | pp512 | 42.57 ± 0.16 | 126.63 ± 0.55 | 2.975 | -| IQ2_XS - 2.3125 bpw | 2.42 GiB | AVX2 | 16 | pp512 | 46.45 ± 0.27 | 125.46 ± 0.43 | 2.701 | -| IQ2_M - 2.7 bpw | 2.74 GiB | AVX2 | 16 | pp512 | 40.76 ± 0.18 | 113.07 ± 0.48 | 2.774 | -| IQ3_XXS - 3.0625 bpw | 3.04 GiB | AVX2 | 16 | pp512 | 31.95 ± 0.20 | 109.86 ± 0.45 | 3.438 | -| IQ3_S - 3.4375 bpw | 3.42 GiB | AVX2 | 16 | pp512 | 28.04 ± 0.08 | 96.28 ± 0.45 | 3.434 | -| IQ4_XS - 4.25 bpw | 4.13 GiB | AVX2 | 16 | pp512 | 68.98 ± 0.31 | 180.34 ± 0.55 | 2.614 | -| IQ4_NL - 4.5 bpw | 4.35 GiB | AVX2 | 16 | pp512 | 59.94 ± 0.21 | 129.06 ± 0.43 | 2.153 | - -We see that `llama.cpp` achieves respectable performance for `fp16`, `Q8_0`, and `Q4_0`, being only up to 20% slower than this implementation. This is thanks to the use of Justine Tunney's `tinyBLAS`, which is utilized for these quantization types. For all other quants we observe performance gains in the `1.75X - 3.5X` range, which is not a small feat considering that the `ggml` matrix multiplication functions has been rewritten several times since `llama.cpp` was first published. +| Quantization | size | backend | threads | test | t/s (llama.cpp) | t/s (iqk_mul_mat)| Speedup | +| ------------------------ | ---------: | ---------- | ------: | ------------: | ---------------: | ---------------: | ------: | +| 8B F16 | 14.96 GiB | AVX2 | 16 | pp512 | 112.37 ± 0.40 | 131.27 ± 0.38 | 1.168 | +| 7B F16 | 12.55 GiB | NEON | 8 | pp512 | 90.28 ± 1.25 | 95.34 ± 0.15 | 1.056 | +| 8B Q8_0 | 7.95 GiB | AVX2 | 16 | pp512 | 118.07 ± 0.53 | 134.00 ± 0.47 | 1.135 | +| 7B Q8_0 | 6.67 GiB | NEON | 8 | pp512 | 77.25 ± 1.81 | 94.14 ± 1.15 | 1.219 | +| 8B Q4_0 | 4.35 GiB | AVX2 | 16 | pp512 | 104.46 ± 0.33 | 130.20 ± 0.29 | 1.246 | +| 7B Q4_0 | 3.57 GiB | NEON | 8 | pp512 | 65.46 ± 0.79 | 76.22 ± 0.71 | 1.164 | +| 8B Q4_1 | 4.77 GiB | AVX2 | 16 | pp512 | 57.83 ± 0.24 | 160.69 ± 0.49 | 2.779 | +| 7B Q4_1 | 3.95 GiB | NEON | 8 | pp512 | 37.40 ± 0.50 | 65.83 ± 0.98 | 1.760 | +| 8B Q5_0 | 5.22 GiB | AVX2 | 16 | pp512 | 53.50 ± 0.35 | 122.62 ± 0.48 | 2.292 | +| 7B Q5_0 | 4.34 GiB | NEON | 8 | pp512 | 29.31 ± 0.51 | 67.51 ± 1.17 | 2.303 | +| 8B Q5_1 | 5.64 GiB | AVX2 | 16 | pp512 | 50.85 ± 0.36 | 147.15 ± 0.47 | 2.894 | +| 7B Q5_1 | 4.72 GiB | NEON | 8 | pp512 | 26.02 ± 0.37 | 58.49 ± 0.85 | 2.248 | +| 8B Q2_K - Small | 2.78 GiB | AVX2 | 16 | pp512 | 110.11 ± 0.28 | 192.47 ± 1.35 | 1.748 | +| 7B Q2_K - Small | 2.16 GiB | NEON | 8 | pp512 | 35.44 ± 0.06 | 77.93 ± 1.64 | 2.199 | +| 8B Q3_K - Small | 3.41 GiB | AVX2 | 16 | pp512 | 77.42 ± 0.36 | 181.64 ± 0.44 | 2.346 | +| 7B Q3_K - Small | 2.75 GiB | NEON | 8 | pp512 | 26.79 ± 0.03 | 59.38 ± 1.08 | 2.216 | +| 8B Q4_K - Small | 4.36 GiB | AVX2 | 16 | pp512 | 98.92 ± 0.34 | 185.35 ± 0.39 | 1.874 | +| 7B Q4_K - Small | 3.59 GiB | NEON | 8 | pp512 | 46.55 ± 0.67 | 76.31 ± 0.38 | 1.639 | +| 8B Q5_K - Small | 5.21 GiB | AVX2 | 16 | pp512 | 69.44 ± 0.31 | 179.62 ± 0.69 | 2.587 | +| 7B Q5_K - Small | 4.33 GiB | NEON | 8 | pp512 | 30.18 ± 0.23 | 65.34 ± 0.79 | 2.165 | +| 8B Q6_K | 6.14 GiB | AVX2 | 16 | pp512 | 74.89 ± 0.26 | 181.86 ± 0.55 | 2.428 | +| 7B Q6_K | 5.15 GiB | NEON | 8 | pp512 | 28.12 ± 1.24 | 60.75 ± 1.15 | 2.160 | +| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | AVX2 | 16 | pp512 | 42.57 ± 0.16 | 126.63 ± 0.55 | 2.975 | +| 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | NEON | 8 | pp512 | 20.87 ± 0.20 | 64.29 ± 1.12 | 3.080 | +| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | AVX2 | 16 | pp512 | 46.45 ± 0.27 | 125.46 ± 0.43 | 2.701 | +| 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | NEON | 8 | pp512 | 22.77 ± 0.21 | 51.15 ± 0.24 | 2.246 | +| 8B IQ2_M - 2.7 bpw | 2.74 GiB | AVX2 | 16 | pp512 | 40.76 ± 0.18 | 113.07 ± 0.48 | 2.774 | +| 7B IQ2_M - 2.7 bpw | 2.20 GiB | NEON | 8 | pp512 | 14.95 ± 0.26 | 44.87 ± 0.50 | 3.001 | +| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | AVX2 | 16 | pp512 | 31.95 ± 0.20 | 109.86 ± 0.45 | 3.438 | +| 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | NEON | 8 | pp512 | 14.40 ± 0.10 | 53.58 ± 0.85 | 3.721 | +| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | AVX2 | 16 | pp512 | 28.04 ± 0.08 | 96.28 ± 0.45 | 3.434 | +| 7B IQ3_S - 3.4375 bpw | 2.75 GiB | NEON | 8 | pp512 | 12.08 ± 0.30 | 49.72 ± 0.06 | 4.116 | +| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | AVX2 | 16 | pp512 | 68.98 ± 0.31 | 180.34 ± 0.55 | 2.614 | +| 7B IQ4_XS - 4.25 bpw | 3.37 GiB | NEON | 8 | pp512 | 40.67 ± 1.97 | 75.11 ± 1.97 | 1.847 | +| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | AVX2 | 16 | pp512 | 59.94 ± 0.21 | 129.06 ± 0.43 | 2.153 | +| 7B IQ4_NL - 4.5 bpw | 3.56 GiB | NEON | 8 | pp512 | 34.36 ± 0.81 | 76.02 ± 1.36 | 2.212 | + +We see that `llama.cpp` achieves respectable performance for `fp16`, `Q8_0`, and `Q4_0`, being only up to 25% slower than this implementation. This is thanks to the use of Justine Tunney's `tinyBLAS`, which is utilized for these quantization types. For all other quants we observe performance gains in the `1.75X - 4X` range, which is not a small feat considering that the `ggml` matrix multiplication functions has been rewritten several times since `llama.cpp` was first published. Performance gains are larger for i-quants due to the higher quant unpacking cost (see discussion "To tile or not to tile") + +### Token generation + +On the Ryzen-7950X TG is memory bound. For many quantization types peak performance is achieved at just 4 threads. Hence, only results for 2 and 4 threads are shown for `AVX2`. The M2-Max has a much more capable memory subsystem and as a result performance keep increasing up to 8 threads. Thus, results are given for up to 8 threads for `ARM_NEON`. + +| Quantization | size | backend | threads | test | t/s (llama.cpp) | t/s (iqk_mul_mat)| Speedup | +| ------------------------ | ---------: | ---------- | ------: | ------------: | ---------------: | ---------------: | ------: | +| 8B F16 | 14.96 GiB | CPU | 1 | tg128 | 2.20 ± 0.00 | 2.25 ± 0.00 | 1.023 | +| | | | 2 | tg128 | 3.63 ± 0.00 | 3.68 ± 0.00 | 1.014 | +| | | | 4 | tg128 | 4.20 ± 0.00 | 4.20 ± 0.00 | 1.000 | +| 7B F16 | 12.55 GiB | NEON | 2 | tg128 | 6.94 ± 0.27 | 7.40 ± 0.01 | 1.066 | +| | | | 4 | tg128 | 8.73 ± 0.01 | 8.83 ± 0.01 | 1.011 | +| | | | 6 | tg128 | 9.05 ± 0.02 | 9.05 ± 0.01 | 1.000 | +| 8B Q8_0 | 7.95 GiB | CPU | 2 | tg128 | 5.03 ± 0.00 | 7.87 ± 0.00 | 1.565 | +| | | | 4 | tg128 | 7.40 ± 0.00 | 7.82 ± 0.00 | 1.057 | +| 7B Q8_0 | 6.67 GiB | NEON | 2 | tg128 | 8.29 ± 0.44 | 12.07 ± 0.10 | 1.456 | +| | | | 4 | tg128 | 13.53 ± 0.03 | 15.77 ± 0.08 | 1.166 | +| | | | 8 | tg128 | 16.24 ± 0.10 | 16.94 ± 0.04 | 1.043 | +| 8B Q4_0 | 4.35 GiB | CPU | 2 | tg128 | 6.36 ± 0.00 | 10.28 ± 0.00 | 1.616 | +| | | | 4 | tg128 | 10.97 ± 0.06 | 13.55 ± 0.07 | 1.235 | +| 7B Q4_0 | 3.57 GiB | NEON | 2 | tg128 | 9.77 ± 0.02 | 13.69 ± 0.03 | 1.401 | +| | | | 4 | tg128 | 17.82 ± 0.06 | 23.98 ± 0.11 | 1.346 | +| | | | 8 | tg128 | 26.63 ± 0.41 | 29.86 ± 0.04 | 1.121 | +| 8B Q4_1 | 4.77 GiB | CPU | 2 | tg128 | 5.11 ± 0.00 | 11.45 ± 0.00 | 2.241 | +| | | | 4 | tg128 | 9.08 ± 0.02 | 12.58 ± 0.00 | 1.385 | +| 7B Q4_1 | 3.95 GiB | NEON | 2 | tg128 | 9.11 ± 0.06 | 14.62 ± 0.04 | 1.605 | +| | | | 4 | tg128 | 17.04 ± 0.09 | 24.08 ± 0.28 | 1.413 | +| | | | 8 | tg128 | 25.26 ± 0.24 | 27.23 ± 0.14 | 1.078 | +| 8B Q5_0 | 5.22 GiB | CPU | 2 | tg128 | 5.31 ± 0.01 | 8.30 ± 0.01 | 1.563 | +| | | | 4 | tg128 | 9.40 ± 0.01 | 11.47 ± 0.00 | 1.220 | +| 7B Q5_0 | 4.34 GiB | NEON | 2 | tg128 | 7.26 ± 0.06 | 7.52 ± 0.00 | 1.036 | +| | | | 4 | tg128 | 13.63 ± 0.18 | 14.16 ± 0.10 | 1.039 | +| | | | 8 | tg128 | 22.55 ± 0.35 | 24.34 ± 0.22 | 1.079 | +| 8B Q5_1 | 5.64 GiB | CPU | 2 | tg128 | 4.52 ± 0.00 | 8.86 ± 0.00 | 1.960 | +| | | | 4 | tg128 | 7.72 ± 0.05 | 10.68 ± 0.03 | 1.383 | +| 7B Q5_1 | 4.72 GiB | NEON | 2 | tg128 | 6.51 ± 0.01 | 6.42 ± 0.03 | 0.986 | +| | | | 4 | tg128 | 12.26 ± 0.18 | 12.21 ± 0.14 | 0.996 | +| | | | 8 | tg128 | 20.33 ± 0.52 | 21.85 ± 0.22 | 1.075 | +| 8B Q2_K - Small | 2.78 GiB | CPU | 2 | tg128 | 11.30 ± 0.00 | 13.06 ± 0.01 | 1.156 | +| | | | 4 | tg128 | 18.70 ± 0.00 | 19.04 ± 0.65 | 1.014 | +| 7B Q2_K - Small | 2.16 GiB | NEON | 2 | tg128 | 8.42 ± 0.05 | 11.97 ± 0.10 | 1.422 | +| | | | 4 | tg128 | 15.74 ± 0.01 | 22.09 ± 0.08 | 1.403 | +| | | | 8 | tg128 | 27.35 ± 0.05 | 38.32 ± 0.05 | 1.401 | +| 8B Q3_K - Small | 3.41 GiB | CPU | 2 | tg128 | 8.58 ± 0.00 | 10.82 ± 0.00 | 1.261 | +| | | | 4 | tg128 | 15.26 ± 0.01 | 16.25 ± 0.01 | 1.065 | +| 7B Q3_K - Small | 2.75 GiB | NEON | 2 | tg128 | 6.40 ± 0.02 | 9.12 ± 0.09 | 1.425 | +| | | | 4 | tg128 | 12.17 ± 0.00 | 17.11 ± 0.03 | 1.406 | +| | | | 8 | tg128 | 22.04 ± 0.08 | 31.39 ± 0.31 | 1.424 | +| 8B Q4_K - Small | 4.36 GiB | CPU | 2 | tg128 | 9.61 ± 0.00 | 10.72 ± 0.01 | 1.116 | +| | | | 4 | tg128 | 13.24 ± 0.31 | 13.28 ± 0.01 | 1.003 | +| 7B Q4_K - Small | 3.59 GiB | NEON | 2 | tg128 | 11.15 ± 0.05 | 12.93 ± 0.09 | 1.160 | +| | | | 4 | tg128 | 20.24 ± 0.16 | 23.49 ± 0.29 | 1.161 | +| | | | 8 | tg128 | 25.76 ± 0.07 | 28.31 ± 0.22 | 1.099 | +| 8B Q5_K - Small | 5.21 GiB | CPU | 2 | tg128 | 7.45 ± 0.00 | 9.73 ± 0.00 | 1.306 | +| | | | 4 | tg128 | 11.05 ± 0.33 | 11.43 ± 0.02 | 1.034 | +| 7B Q5_K - Small | 4.33 GiB | NEON | 2 | tg128 | 7.20 ± 0.04 | 8.81 ± 0.04 | 1.224 | +| | | | 4 | tg128 | 13.62 ± 0.15 | 16.81 ± 0.16 | 1.234 | +| | | | 8 | tg128 | 20.56 ± 0.19 | 23.96 ± 0.14 | 1.165 | +| 8B Q6_K | 6.14 GiB | CPU | 2 | tg128 | 7.53 ± 0.00 | 9.42 ± 0.00 | 1.251 | +| | | | 4 | tg128 | 9.74 ± 0.00 | 9.97 ± 0.01 | 1.024 | +| 7B Q6_K | 5.15 GiB | NEON | 2 | tg128 | 6.85 ± 0.04 | 8.30 ± 0.06 | 1.212 | +| | | | 4 | tg128 | 13.03 ± 0.05 | 15.47 ± 0.17 | 1.187 | +| | | | 8 | tg128 | 18.52 ± 0.07 | 20.67 ± 0.08 | 1.116 | +| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | CPU | 2 | tg128 | 5.33 ± 0.01 | 6.40 ± 0.00 | 1.201 | +| | | | 4 | tg128 | 10.06 ± 0.03 | 11.76 ± 0.03 | 1.169 | +| 7B IQ2_XXS - 2.0625 bpw | 1.73 GiB | NEON | 2 | tg128 | 5.07 ± 0.04 | 5.22 ± 0.05 | 1.030 | +| | | | 4 | tg128 | 9.63 ± 0.00 | 9.91 ± 0.07 | 1.029 | +| | | | 8 | tg128 | 17.40 ± 0.50 | 18.65 ± 0.22 | 1.072 | +| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | CPU | 2 | tg128 | 5.83 ± 0.00 | 6.55 ± 0.00 | 1.123 | +| | | | 4 | tg128 | 10.88 ± 0.09 | 12.07 ± 0.07 | 1.109 | +| 7B IQ2_XS - 2.3125 bpw | 1.89 GiB | NEON | 2 | tg128 | 5.52 ± 0.01 | 5.60 ± 0.00 | 1.014 | +| | | | 4 | tg128 | 10.50 ± 0.01 | 11.15 ± 0.00 | 1.062 | +| | | | 8 | tg128 | 18.19 ± 1.30 | 20.94 ± 0.19 | 1.151 | +| 8B IQ2_M - 2.7 bpw | 2.74 GiB | CPU | 2 | tg128 | 5.12 ± 0.01 | 5.17 ± 0.00 | 1.010 | +| | | | 4 | tg128 | 9.60 ± 0.28 | 9.68 ± 0.16 | 1.008 | +| 7B IQ2_M - 2.7 bpw | 2.20 GiB | NEON | 2 | tg128 | 3.73 ± 0.02 | 4.53 ± 0.00 | 1.214 | +| | | | 4 | tg128 | 7.14 ± 0.05 | 8.70 ± 0.06 | 1.218 | +| | | | 8 | tg128 | 11.99 ± 0.48 | 16.41 ± 0.05 | 1.369 | +| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | CPU | 2 | tg128 | 4.06 ± 0.01 | 5.00 ± 0.00 | 1.232 | +| | | | 4 | tg128 | 7.75 ± 0.02 | 9.13 ± 0.45 | 1.178 | +| 7B IQ3_XXS - 3.0625 bpw | 2.41 GiB | NEON | 2 | tg128 | 3.53 ± 0.00 | 3.82 ± 0.00 | 1.082 | +| | | | 4 | tg128 | 6.74 ± 0.04 | 7.42 ± 0.07 | 1.103 | +| | | | 8 | tg128 | 11.96 ± 0.40 | 13.19 ± 0.29 | 1.103 | +| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | CPU | 2 | tg128 | 3.62 ± 0.00 | 4.06 ± 0.00 | 1.122 | +| | | | 4 | tg128 | 6.80 ± 0.01 | 7.62 ± 0.10 | 1.121 | +| 7B IQ3_S - 3.4375 bpw | 2.75 GiB | NEON | 2 | tg128 | 2.96 ± 0.01 | 3.21 ± 0.03 | 1.084 | +| | | | 4 | tg128 | 5.68 ± 0.01 | 6.25 ± 0.05 | 1.100 | +| | | | 8 | tg128 | 10.32 ± 0.25 | 11.11 ± 0.37 | 1.077 | +| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | CPU | 2 | tg128 | 8.08 ± 0.00 | 11.35 ± 0.00 | 1.405 | +| | | | 4 | tg128 | 13.36 ± 0.72 | 14.32 ± 0.24 | 1.072 | +| 7B IQ4_XS - 4.25 bpw | 3.37 GiB | NEON | 2 | tg128 | 9.87 ± 0.03 | 12.06 ± 0.00 | 1.222 | +| | | | 4 | tg128 | 17.78 ± 0.23 | 22.06 ± 0.28 | 1.241 | +| | | | 8 | tg128 | 27.62 ± 0.09 | 29.70 ± 0.39 | 1.075 | +| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | CPU | 2 | tg128 | 5.52 ± 0.00 | 10.26 ± 0.00 | 1.859 | +| | | | 4 | tg128 | 10.78 ± 0.01 | 13.69 ± 0.08 | 1.270 | +| 7B IQ4_NL - 4.5 bpw | 3.56 GiB | NEON | 2 | tg128 | 8.32 ± 0.01 | 13.54 ± 0.01 | 1.627 | +| | | | 4 | tg128 | 15.89 ± 0.00 | 24.28 ± 0.29 | 1.528 | +| | | | 8 | tg128 | 26.56 ± 0.36 | 29.87 ± 0.08 | 1.125 | + ## MoE models |