summaryrefslogtreecommitdiff
path: root/ggml/src/ggml-backend-impl.h
diff options
context:
space:
mode:
authorKawrakow <48489457+ikawrakow@users.noreply.github.com>2024-09-10 19:17:04 +0300
committerGitHub <noreply@github.com>2024-09-10 19:17:04 +0300
commit72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (patch)
treec12a902cb72f5120a6960fde25a26b83fe0c6b91 /ggml/src/ggml-backend-impl.h
parentd17d0c44267bd7d8040626d1006c8377dad4f502 (diff)
AVX2 Flash Attention (#48)
* First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-backend-impl.h')
0 files changed, 0 insertions, 0 deletions