diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-01-20 08:57:38 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-01-20 08:57:38 +0200 |
commit | 3c5f87225f0ddd379ab712ddb8ad0013c10167c2 (patch) | |
tree | 7f339e1e1fe99218065a297cbf2632dcce8804a9 /src/llama.cpp | |
parent | 0b74397d596bbcdfba27299393406d2b6330b133 (diff) |
More Flash Attention improvements (#173)
* FA: slightly faster V*softmax(K*Q)) on Zen4
* FA: it is also faster on AVX2 and ARM_NEON
* Deleted forgotten commented out code
* FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache
* FA: slightly faster V*softmax(K*Q)) on Zen4
We now get 130.9 t/s for a context of 32k tokens.
* FA: don't store sum scaling factor in SIMD registers
* FA: timing
* FA: faster q8_0 cache via run-time-repacking
On Zen4 q8_0 KV-cache now slightly outperforms BF16.
We get 134 t/s for 32k tokens, which is ~30% better than
the main branch, and ~18% better than the last commit.
We simply repack the K-cache to q8_0_r4 before the K*Q
multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication
template.
* FA: Fix AVX2
* FA: fix ARN_NEON
* FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON
* FA: dedicated mat mul for D = 128 also for ARM_NEON
* FA: turn off performance timer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'src/llama.cpp')
0 files changed, 0 insertions, 0 deletions