diff options
Diffstat (limited to 'docs/token_generation_performance_tips.md')
-rw-r--r-- | docs/token_generation_performance_tips.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/token_generation_performance_tips.md b/docs/token_generation_performance_tips.md index c9acff7d..d7e863df 100644 --- a/docs/token_generation_performance_tips.md +++ b/docs/token_generation_performance_tips.md @@ -17,7 +17,7 @@ llama_model_load_internal: [cublas] total VRAM used: 17223 MB If you see these lines, then the GPU is being used. ## Verifying that the CPU is not oversaturated -llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down. +llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physical CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down. # Example of runtime flags effect on inference speed benchmark These runs were tested on the following machine: |