ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-09-11 10:26:49 +0300
committer	GitHub <noreply@github.com>	2024-09-11 10:26:49 +0300
commit	d98a6753a63d970ebdc01c2b7b4f198644eef81c (patch)
tree	7ab53c46ca940d3f459d7c8200c2179b8953ce08 /ggml/src/ggml-blas.cpp
parent	72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (diff)

ARM_NEON Flash Attention (#49)

* NEON Flash Attention - first working version Simply reuse the Zen4/AVX2 implementation, but use f16 for the K*Q multiplication and V*softmax(K*Q) accumulation. This makes the FlashMS portion somewhat awkward because we do not have fast f16 implementations for expf (and tanh when softcap is enabled), so we need to convert back-and-fort to f32. FA is slightly faster than no-FA for the 4B TriLM model, but lightly slower for Gemma-2b. * NEON Flash Attention - convert Q to f16 before computing Q*K * NEON Flash Attention - use fp32 for K*Q operations Else I get wrong results for LLaMA-3.1-8B (but it works for Gemma-2b). * Delete commented out stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'ggml/src/ggml-blas.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: