summaryrefslogtreecommitdiff
path: root/ggml/src/ggml-backend.c
diff options
context:
space:
mode:
authorKawrakow <48489457+ikawrakow@users.noreply.github.com>2024-09-11 10:26:49 +0300
committerGitHub <noreply@github.com>2024-09-11 10:26:49 +0300
commitd98a6753a63d970ebdc01c2b7b4f198644eef81c (patch)
tree7ab53c46ca940d3f459d7c8200c2179b8953ce08 /ggml/src/ggml-backend.c
parent72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (diff)
ARM_NEON Flash Attention (#49)
* NEON Flash Attention - first working version Simply reuse the Zen4/AVX2 implementation, but use f16 for the K*Q multiplication and V*softmax(K*Q) accumulation. This makes the FlashMS portion somewhat awkward because we do not have fast f16 implementations for expf (and tanh when softcap is enabled), so we need to convert back-and-fort to f32. FA is slightly faster than no-FA for the 4B TriLM model, but lightly slower for Gemma-2b. * NEON Flash Attention - convert Q to f16 before computing Q*K * NEON Flash Attention - use fp32 for K*Q operations Else I get wrong results for LLaMA-3.1-8B (but it works for Gemma-2b). * Delete commented out stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-backend.c')
0 files changed, 0 insertions, 0 deletions