ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-06-01 15:24:05 +0300
committer	GitHub <noreply@github.com>	2025-06-01 15:24:05 +0300
commit	3df1a3a44d69490d074f22aa04ca542f2e72996f (patch)
tree	b762a4ee4aa4bc8f1eea02a4782d23578555f414 /ggml/src/ggml-kompute.cpp
parent	35374bc7e8de2b221ed4eabe426e05d8b9a7f99b (diff)

Trellis quants: faster CPU prompt processing (#482)

* Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'ggml/src/ggml-kompute.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: