diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-06-01 15:24:05 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-06-01 15:24:05 +0300 |
commit | 3df1a3a44d69490d074f22aa04ca542f2e72996f (patch) | |
tree | b762a4ee4aa4bc8f1eea02a4782d23578555f414 /ggml/src/ggml-impl.h | |
parent | 35374bc7e8de2b221ed4eabe426e05d8b9a7f99b (diff) |
Trellis quants: faster CPU prompt processing (#482)
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-impl.h')
0 files changed, 0 insertions, 0 deletions