Faster DeepSeek FA on CUDA (#408)

* New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-05-12 07:49:00 +0300
committer: GitHub <noreply@github.com> 2025-05-12 07:49:00 +0300
commit: 465569dff8b49a195450a0eb1974fd72a32fcebc (patch)
tree: af7f5b4af3738318a28ad9c9de722231c41c3d63 /ggml/src/ggml-cuda/cp-async.cuh
parent: 8669c3db2b98f05775292778dd05f424ee0cd250 (diff)
1 files changed, 10 insertions, 0 deletions
diff --git a/ggml/src/ggml-cuda/cp-async.cuh b/ggml/src/ggml-cuda/cp-async.cuh
index ecb65999..a87dc247 100644
--- a/ggml/src/ggml-cuda/cp-async.cuh
+++ b/ggml/src/ggml-cuda/cp-async.cuh
@@ -2,6 +2,16 @@
 
 #include "common.cuh"
 
+static __device__ __forceinline__ unsigned int ggml_cuda_cvta_generic_to_shared(void * generic_ptr) {
+#ifdef CP_ASYNC_AVAILABLE
+    return __cvta_generic_to_shared(generic_ptr);
+#else
+    GGML_UNUSED(generic_ptr);
+    NO_DEVICE_CODE;
+    return 0;
+#endif // CP_ASYNC_AVAILABLE
+}
+
 // Copies data from global to shared memory, cg == cache global.
 // Both the src and dst pointers must be aligned to 16 bit.
 // Shared memory uses 32 bit addressing, the pointer is passed as unsigned int.
author	Kawrakow <iwankawrakow@gmail.com>	2025-05-12 07:49:00 +0300
committer	GitHub <noreply@github.com>	2025-05-12 07:49:00 +0300
commit	465569dff8b49a195450a0eb1974fd72a32fcebc (patch)
tree	af7f5b4af3738318a28ad9c9de722231c41c3d63 /ggml/src/ggml-cuda/cp-async.cuh
parent	8669c3db2b98f05775292778dd05f424ee0cd250 (diff)