From 280345968dabc00d212d43e31145f5c9961a7604 Mon Sep 17 00:00:00 2001 From: slaren Date: Tue, 26 Mar 2024 01:16:01 +0100 Subject: cuda : rename build flag to LLAMA_CUDA (#6299) --- docs/token_generation_performance_tips.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'docs/token_generation_performance_tips.md') diff --git a/docs/token_generation_performance_tips.md b/docs/token_generation_performance_tips.md index d7e863df..3c434314 100644 --- a/docs/token_generation_performance_tips.md +++ b/docs/token_generation_performance_tips.md @@ -1,7 +1,7 @@ # Token generation performance troubleshooting -## Verifying that the model is running on the GPU with cuBLAS -Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: +## Verifying that the model is running on the GPU with CUDA +Make sure you compiled llama with the correct env variables according to [this guide](../README.md#CUDA), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: ```shell ./main -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some " ``` -- cgit v1.2.3