diff options
author | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-08-08 11:17:42 +0300 |
---|---|---|
committer | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-08-09 16:00:31 +0200 |
commit | 849476acc79af52998316e421baa9befad3b8eb3 (patch) | |
tree | ed74783bec895d8ef20abc3ee2f9b0bee70366e7 /ggml/include/ggml-cuda.h | |
parent | 050bdfa101be5b78c2dc2286bad915e2eae21645 (diff) |
iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.
And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.
For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
Diffstat (limited to 'ggml/include/ggml-cuda.h')
0 files changed, 0 insertions, 0 deletions