ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-08 11:17:42 +0300
committer	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-08-09 16:00:31 +0200
commit	849476acc79af52998316e421baa9befad3b8eb3 (patch)
tree	ed74783bec895d8ef20abc3ee2f9b0bee70366e7 /ggml/include/ggml-cuda.h
parent	050bdfa101be5b78c2dc2286bad915e2eae21645 (diff)

iq6_k: Zen4 iqk_mul_mat

We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.

Diffstat (limited to 'ggml/include/ggml-cuda.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: