CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes
author: Johannes Gäßler <johannesg@5d6.de> 2024-06-14 18:41:49 +0200
committer: GitHub <noreply@github.com> 2024-06-14 18:41:49 +0200
commit: 76d66ee0be91e2bec93206e821ee1db8d023cff5 (patch)
tree: 9bf121667539f91b90b54b237e54bdbd9a16161c /ggml-cuda/softmax.cu
parent: 66ef1ceedf983773c8ceb4d925285d41d4e50e2a (diff)
1 files changed, 1 insertions, 0 deletions
diff --git a/ggml-cuda/softmax.cu b/ggml-cuda/softmax.cu
index ce64f2f2..c24abae1 100644
--- a/ggml-cuda/softmax.cu
+++ b/ggml-cuda/softmax.cu
@@ -130,6 +130,7 @@ static void soft_max_f32_cuda(const float * x, const T * mask, float * dst, cons
     const float m0 = powf(2.0f, -(max_bias       ) / n_head_log2);
     const float m1 = powf(2.0f, -(max_bias / 2.0f) / n_head_log2);
 
+    // FIXME: this limit could be raised by ~2-4x on Ampere or newer
     if (shmem < ggml_cuda_info().devices[ggml_cuda_get_device()].smpb) {
         switch (ncols_x) {
             case 32:
author	Johannes Gäßler <johannesg@5d6.de>	2024-06-14 18:41:49 +0200
committer	GitHub <noreply@github.com>	2024-06-14 18:41:49 +0200
commit	76d66ee0be91e2bec93206e821ee1db8d023cff5 (patch)
tree	9bf121667539f91b90b54b237e54bdbd9a16161c /ggml-cuda/softmax.cu
parent	66ef1ceedf983773c8ceb4d925285d41d4e50e2a (diff)