CPU FA improvements (#351)

* FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-04-29 07:19:43 +0200
committer: GitHub <noreply@github.com> 2025-04-29 07:19:43 +0200
commit: cda24b58cbef34154651d0083910fed860a506c1 (patch)
tree: 90cd3bd7f772c3b240a6553eca5e50edf95c53da /ggml/src/iqk/iqk_flash_impl.h
parent: baeefb4731fb24cdace168f6dbc74516d470efc0 (diff)
1 files changed, 4 insertions, 0 deletions
diff --git a/ggml/src/iqk/iqk_flash_impl.h b/ggml/src/iqk/iqk_flash_impl.h
index 68802927..6f62e56b 100644
--- a/ggml/src/iqk/iqk_flash_impl.h
+++ b/ggml/src/iqk/iqk_flash_impl.h
@@ -6,6 +6,8 @@
 
 #pragma once
 
+#include <cstdint>
+
 bool iqk_flash_attn_impl(int type_k,             // type of k
                          int type_v,             // type of v
                          int Dk,                 // K head size
@@ -27,3 +29,5 @@ bool iqk_flash_attn_impl(int type_k,             // type of k
                          float       * M,
                          float       * S);
 
+void * iqk_repack_k(int type_k, int nek0, int nek1, int nek2, int nek3, long nbk1, long nbk2, long nbk3,
+        const void * k, void * work, int ith, int nth, int& repacked_type, uint64_t& row_size);
author	Kawrakow <iwankawrakow@gmail.com>	2025-04-29 07:19:43 +0200
committer	GitHub <noreply@github.com>	2025-04-29 07:19:43 +0200
commit	cda24b58cbef34154651d0083910fed860a506c1 (patch)
tree	90cd3bd7f772c3b240a6553eca5e50edf95c53da /ggml/src/iqk/iqk_flash_impl.h
parent	baeefb4731fb24cdace168f6dbc74516d470efc0 (diff)