Avoid unnecessarily disabling CUDA graphs (#7302)

As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.
author: agray3 <agray3@users.noreply.github.com> 2024-05-15 14:44:49 +0100
committer: GitHub <noreply@github.com> 2024-05-15 15:44:49 +0200
commit: dc020985b8755dd6aa93a2f002f43c3ede808cce (patch)
tree: a4be81a8ce9f08fbafbc92c3e38ee892192bfe91
parent: 344f9126cc0d15891fde9472fe40b8572628ad7d (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 75a2ad48..04b6e528 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -2558,7 +2558,7 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
         }
 
         // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
-        if (cuda_graph_update_required) {
+        if (use_cuda_graph && cuda_graph_update_required) {
             cuda_ctx->cuda_graph->number_consecutive_updates++;
         } else {
             cuda_ctx->cuda_graph->number_consecutive_updates = 0;
author	agray3 <agray3@users.noreply.github.com>	2024-05-15 14:44:49 +0100
committer	GitHub <noreply@github.com>	2024-05-15 15:44:49 +0200
commit	dc020985b8755dd6aa93a2f002f43c3ede808cce (patch)
tree	a4be81a8ce9f08fbafbc92c3e38ee892192bfe91
parent	344f9126cc0d15891fde9472fe40b8572628ad7d (diff)