ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-09-10 19:17:04 +0300
committer	GitHub <noreply@github.com>	2024-09-10 19:17:04 +0300
commit	72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (patch)
tree	c12a902cb72f5120a6960fde25a26b83fe0c6b91 /ggml/src/ggml-backend-impl.h
parent	d17d0c44267bd7d8040626d1006c8377dad4f502 (diff)

AVX2 Flash Attention (#48)

* First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'ggml/src/ggml-backend-impl.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: