ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-06-24 13:05:01 +0200
committer	GitHub <noreply@github.com>	2025-06-24 13:05:01 +0200
commit	64f6c2dead0768049837ac6562c0c176fabc055e (patch)
tree	238cc3bf6201d7089703fb2f339d827c7c24023c /ggml/src/ggml-blas.cpp
parent	ddda4d9e64fa889389b784f28da6453f14137452 (diff)

Much faster prompt processing for k-quants (ARM_NEON) (#552)

* iq2_xxs 55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s * iq2_xs 46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s. * iq2_s 42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s. * iq3_xxs 51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s. * iq3_s 46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s * q2_k 85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s. * q3_K 45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s. * q6_k 47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s. * q4_k 58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s. As I had to add a new implementation for q8_1-quantized activations, TG became slightly faster too (25.1 -> 25.9 t/s). * q5_k 54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s. * iq4_xs 71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'ggml/src/ggml-blas.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: