summaryrefslogtreecommitdiff
path: root/ggml/src/ggml-kompute.cpp
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-06-01 15:24:05 +0300
committerGitHub <noreply@github.com>2025-06-01 15:24:05 +0300
commit3df1a3a44d69490d074f22aa04ca542f2e72996f (patch)
treeb762a4ee4aa4bc8f1eea02a4782d23578555f414 /ggml/src/ggml-kompute.cpp
parent35374bc7e8de2b221ed4eabe426e05d8b9a7f99b (diff)
Trellis quants: faster CPU prompt processing (#482)
* Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-kompute.cpp')
0 files changed, 0 insertions, 0 deletions