ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-01-20 08:57:38 +0200
committer	GitHub <noreply@github.com>	2025-01-20 08:57:38 +0200
commit	3c5f87225f0ddd379ab712ddb8ad0013c10167c2 (patch)
tree	7f339e1e1fe99218065a297cbf2632dcce8804a9 /src/llama.cpp
parent	0b74397d596bbcdfba27299393406d2b6330b133 (diff)

More Flash Attention improvements (#173)

* FA: slightly faster V*softmax(K*Q)) on Zen4 * FA: it is also faster on AVX2 and ARM_NEON * Deleted forgotten commented out code * FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache * FA: slightly faster V*softmax(K*Q)) on Zen4 We now get 130.9 t/s for a context of 32k tokens. * FA: don't store sum scaling factor in SIMD registers * FA: timing * FA: faster q8_0 cache via run-time-repacking On Zen4 q8_0 KV-cache now slightly outperforms BF16. We get 134 t/s for 32k tokens, which is ~30% better than the main branch, and ~18% better than the last commit. We simply repack the K-cache to q8_0_r4 before the K*Q multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication template. * FA: Fix AVX2 * FA: fix ARN_NEON * FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON * FA: dedicated mat mul for D = 128 also for ARM_NEON * FA: turn off performance timer --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'src/llama.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: