FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)

* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. * FlashMLA-2: avoid conversions to f32 also on CUDA * Be able to compute for more than 65535 tokens On CUDA just a quick hack that allows us to cancatenate tensors with more than 65535 rows along zroth dimension as needed by FlashMLA-2. Also needed some care in the perplexity tool to avoid int overflows when evaluating the computed logits. * Reduce memory usage for FlashMLA-2 Oh, also fix int overflow in the CUDA concat implementation. It is funny how the llama.cpp 64-bit police has gone (almost) everywhere and replaced 32-bit ints with 64-bit ints, needed or not, but hasn't done it where it is actually needed. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-03-18 07:36:42 +0100
committer: GitHub <noreply@github.com> 2025-03-18 07:36:42 +0100
commit: dcdfad29f7d2b831f1c84751f00bda14cc359a84 (patch)
tree: 7576224579bf2c95734a407e29ac16fabc8efc9d /ggml/src/ggml-cuda.cu
parent: f91b2e38d028c77cc5631295ba0937749e684749 (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/ggml/src/ggml-cuda.cu b/ggml/src/ggml-cuda.cu
index 1bb869c3..58a44cf7 100644
--- a/ggml/src/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda.cu
@@ -3354,7 +3354,7 @@ GGML_CALL static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, cons
                 if (op->op == GGML_OP_MOE_FUSED_UP_GATE && a->type != op->src[1]->type) {
                     return false;
                 }
-                if (b->type == GGML_TYPE_F16 && a->type != GGML_TYPE_F16) {
+                if (b->type == GGML_TYPE_F16 && a->type != GGML_TYPE_F16 && !ggml_is_quantized(a->type)) {
                     return false;
                 }
                 if (op->op == GGML_OP_MUL_MAT && a->ne[3] != b->ne[3]) {
author	Kawrakow <iwankawrakow@gmail.com>	2025-03-18 07:36:42 +0100
committer	GitHub <noreply@github.com>	2025-03-18 07:36:42 +0100
commit	dcdfad29f7d2b831f1c84751f00bda14cc359a84 (patch)
tree	7576224579bf2c95734a407e29ac16fabc8efc9d /ggml/src/ggml-cuda.cu
parent	f91b2e38d028c77cc5631295ba0937749e684749 (diff)