ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-09-12 19:03:20 +0300
committer	GitHub <noreply@github.com>	2024-09-12 19:03:20 +0300
commit	5017f8b3f04790de686c04e4ab23443c5ca02345 (patch)
tree	7d5776ebefbcdb351e4f34d649ab572b8b9bf119 /src
parent	c920195edd80ab24beb9a0fd3e2f4df582e735d0 (diff)

Quantized Flash Attention for all supported CPU platforms (#51)

* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 * NEON Flash Attention: quantized K*Q for q4_0 I could finally take advantage of the matrix multiplication templates. We get quite a bit of speedup that way for q4_0: For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step> results in PP-2048 = 287 t/s vs 268 t/s when converting the q4_0 k-cache and Q to fp16 and using fp16 multiplication. * NEON Flash Attention: quantized K*Q for q4_1 * NEON Flash Attention: quantized K*Q for q8_0 This makes quite a bit of difference: For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs 178 t/s when converting things to fp16 and using fp16 matrix multiplication. We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the performance of PP-512. In contrast, llama.cpp with Q8_0 cache is 38% of PP-512. * Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * Tidy up FlashMS * Delete no longer used stuff With the usage of quantized matrix multiplications for quantized k- and/or v-cache, we no longer need the helper methods loading entire rows. * Disallow mixing bf16 with other types for kv caches --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: