ggml : add ggml_soft_max_ext (#4256)

* metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug
author: Georgi Gerganov <ggerganov@gmail.com> 2023-12-01 10:51:24 +0200
committer: GitHub <noreply@github.com> 2023-12-01 10:51:24 +0200
commit: ef47ec18da469423c276b683dd9b5741cee7023e (patch)
tree: ec3b4780dbe8f629425de499b298e8eadfd1aa4d /ggml-alloc.c
parent: 1d144112c0fbbb4ecc07dbcf4f05a380148bd6de (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/ggml-alloc.c b/ggml-alloc.c
index cdfe4caf..0d4e12ae 100644
--- a/ggml-alloc.c
+++ b/ggml-alloc.c
@@ -137,7 +137,7 @@ void ggml_tallocr_alloc(ggml_tallocr_t alloc, struct ggml_tensor * tensor) {
 
 #ifdef GGML_ALLOCATOR_DEBUG
     add_allocated_tensor(alloc, tensor);
-    size_t cur_max = (char*)addr - (char*)alloc->data + size;
+    size_t cur_max = (char*)addr - (char*)alloc->base + size;
     if (cur_max > alloc->max_size) {
         printf("max_size = %.2f MB: tensors: ", cur_max / 1024.0 / 1024.0);
         for (int i = 0; i < 1024; i++) {
author	Georgi Gerganov <ggerganov@gmail.com>	2023-12-01 10:51:24 +0200
committer	GitHub <noreply@github.com>	2023-12-01 10:51:24 +0200
commit	ef47ec18da469423c276b683dd9b5741cee7023e (patch)
tree	ec3b4780dbe8f629425de499b298e8eadfd1aa4d /ggml-alloc.c
parent	1d144112c0fbbb4ecc07dbcf4f05a380148bd6de (diff)