cuda : fix vmm pool with multi GPU (#4620)

* cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device
author: slaren <slarengh@gmail.com> 2023-12-26 21:23:59 +0100
committer: GitHub <noreply@github.com> 2023-12-26 21:23:59 +0100
commit: dc68f0054cd279cddddb0cae0c9ef4f9cbaa512a (patch)
tree: 1c437ea7e78a09d3a1fc7786f42fd3ea8615b292 /llama.cpp
parent: de8e496437c59e7d1cc84109e3e49a3478aee25a (diff)
1 files changed, 2 insertions, 1 deletions
diff --git a/llama.cpp b/llama.cpp
index 0b99f1e0..4aa59c4c 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -9519,7 +9519,8 @@ struct llama_context * llama_new_context_with_model(
             ctx->alloc = ggml_allocr_new_from_buffer(ctx->buf_alloc);
 #if defined(GGML_USE_CUBLAS) && !defined(LLAMA_GGML_BACKEND_CUDA_TEST)
             if (model->n_gpu_layers > 0) {
-                ggml_cuda_set_scratch_size(alloc_size);
+                // the CPU buffer adds this padding in case the malloc buffer is not aligned, so we need to do the same for the GPU buffer, since we use the same offsets
+                ggml_cuda_set_scratch_size(alloc_size + 64);
                 LLAMA_LOG_INFO("%s: VRAM scratch buffer: %.2f MiB\n", __func__, alloc_size / 1024.0 / 1024.0);
 
                 // calculate total VRAM usage
author	slaren <slarengh@gmail.com>	2023-12-26 21:23:59 +0100
committer	GitHub <noreply@github.com>	2023-12-26 21:23:59 +0100
commit	dc68f0054cd279cddddb0cae0c9ef4f9cbaa512a (patch)
tree	1c437ea7e78a09d3a1fc7786f42fd3ea8615b292 /llama.cpp
parent	de8e496437c59e7d1cc84109e3e49a3478aee25a (diff)