diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-09-10 19:17:04 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-09-10 19:17:04 +0300 |
commit | 72f5dfe12ac2263e47df53daa0f39acd1e2e4fb6 (patch) | |
tree | c12a902cb72f5120a6960fde25a26b83fe0c6b91 /ggml/src/ggml-vulkan.cpp | |
parent | d17d0c44267bd7d8040626d1006c8377dad4f502 (diff) |
AVX2 Flash Attention (#48)
* First version of AVX2 Flash attention
I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.
Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.
It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.
* Fix Zenn4 parts broken via the AVX2 change
* Try smaller q_step - no improvement
* Fix ARM_NEON
I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-vulkan.cpp')
0 files changed, 0 insertions, 0 deletions