diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-09-11 10:26:49 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-09-11 10:26:49 +0300 |
commit | d98a6753a63d970ebdc01c2b7b4f198644eef81c (patch) | |
tree | 7ab53c46ca940d3f459d7c8200c2179b8953ce08 /ggml/src/ggml-blas.cpp | |
parent | 72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (diff) |
ARM_NEON Flash Attention (#49)
* NEON Flash Attention - first working version
Simply reuse the Zen4/AVX2 implementation, but use
f16 for the K*Q multiplication and V*softmax(K*Q) accumulation.
This makes the FlashMS portion somewhat awkward because we
do not have fast f16 implementations for expf (and tanh when
softcap is enabled), so we need to convert back-and-fort
to f32.
FA is slightly faster than no-FA for the 4B TriLM model,
but lightly slower for Gemma-2b.
* NEON Flash Attention - convert Q to f16 before computing Q*K
* NEON Flash Attention - use fp32 for K*Q operations
Else I get wrong results for LLaMA-3.1-8B (but it works for
Gemma-2b).
* Delete commented out stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-blas.cpp')
0 files changed, 0 insertions, 0 deletions