diff options
Diffstat (limited to 'examples/sweep-bench/README.md')
-rw-r--r-- | examples/sweep-bench/README.md | 64 |
1 files changed, 64 insertions, 0 deletions
diff --git a/examples/sweep-bench/README.md b/examples/sweep-bench/README.md new file mode 100644 index 00000000..608fd104 --- /dev/null +++ b/examples/sweep-bench/README.md @@ -0,0 +1,64 @@ +# ik_llama.cpp/example/sweep-bench + +Benchmark the prompt processing and token generation performance of `ik_llama.cpp` +by doing a sweep over a whole context size and gathering performance metrics +in each ubatch-sized window. Only a single token sequence is used. + +The benchmark steps are: + +for each ubatch-sized window in context: + 1. generate ubatch/4 tokens (not the whole window to save some time) + 2. measure generation performance + 3. remove generated tokens from KV cache + 4. prepare a ubatch-sized batch of random tokens + 4. process prepated batch + 5. measure prompt processing performance + +The purpose of the benchmark is to visualize how the performance changes with +the context size without averaging the metrics values over the whole context. + +## Usage + +./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf + +## Sample results + +- `PP` - prompt tokens per ubatch +- `TG` - generated tokens per ubatch +- `N_KV` - current KV cache size +- `T_PP` - prompt processing time (i.e. time to first token) +- `S_PP` - prompt processing speed (`(B*PP)/T_PP` or `PP/T_PP`) +- `T_TG` - time to generate all batches +- `S_TG` - text generation speed (`(B*TG)/T_TG`) + +| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | +|-------|--------|--------|----------|----------|----------|----------| +| 512 | 128 | 0 | 1.100 | 465.51 | 2.311 | 55.38 | +| 512 | 128 | 512 | 1.183 | 432.97 | 1.895 | 67.55 | +| 512 | 128 | 1024 | 1.305 | 392.38 | 2.071 | 61.81 | +| 512 | 128 | 1536 | 1.279 | 400.42 | 2.164 | 59.14 | +| 512 | 128 | 2048 | 1.571 | 325.96 | 2.280 | 56.14 | +| 512 | 128 | 2560 | 1.431 | 357.87 | 2.418 | 52.94 | +| 512 | 128 | 3072 | 1.515 | 337.93 | 2.566 | 49.88 | +| 512 | 128 | 3584 | 1.588 | 322.34 | 2.722 | 47.03 | +| 512 | 128 | 4096 | 1.675 | 305.70 | 2.864 | 44.69 | +| 512 | 128 | 4608 | 1.769 | 289.50 | 2.999 | 42.68 | +| 512 | 128 | 5120 | 1.845 | 277.48 | 3.102 | 41.26 | +| 512 | 128 | 5632 | 1.893 | 270.46 | 3.219 | 39.76 | +| 512 | 128 | 6144 | 1.953 | 262.20 | 3.348 | 38.23 | +| 512 | 128 | 6656 | 2.018 | 253.71 | 3.474 | 36.84 | +| 512 | 128 | 7168 | 2.078 | 246.34 | 3.589 | 35.66 | +| 512 | 128 | 7680 | 2.140 | 239.22 | 3.717 | 34.43 | +| 512 | 128 | 8192 | 2.196 | 233.15 | 3.854 | 33.21 | + +### JSONL output + +Pass `--output-format jsonl` to output JSONL instead of Markdown, รก la + +```json lines +{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 } +{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 } +{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 } +{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 } +{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 } +``` |