summaryrefslogtreecommitdiff
path: root/examples/sweep-bench/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'examples/sweep-bench/README.md')
-rw-r--r--examples/sweep-bench/README.md64
1 files changed, 64 insertions, 0 deletions
diff --git a/examples/sweep-bench/README.md b/examples/sweep-bench/README.md
new file mode 100644
index 00000000..608fd104
--- /dev/null
+++ b/examples/sweep-bench/README.md
@@ -0,0 +1,64 @@
+# ik_llama.cpp/example/sweep-bench
+
+Benchmark the prompt processing and token generation performance of `ik_llama.cpp`
+by doing a sweep over a whole context size and gathering performance metrics
+in each ubatch-sized window. Only a single token sequence is used.
+
+The benchmark steps are:
+
+for each ubatch-sized window in context:
+ 1. generate ubatch/4 tokens (not the whole window to save some time)
+ 2. measure generation performance
+ 3. remove generated tokens from KV cache
+ 4. prepare a ubatch-sized batch of random tokens
+ 4. process prepated batch
+ 5. measure prompt processing performance
+
+The purpose of the benchmark is to visualize how the performance changes with
+the context size without averaging the metrics values over the whole context.
+
+## Usage
+
+./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf
+
+## Sample results
+
+- `PP` - prompt tokens per ubatch
+- `TG` - generated tokens per ubatch
+- `N_KV` - current KV cache size
+- `T_PP` - prompt processing time (i.e. time to first token)
+- `S_PP` - prompt processing speed (`(B*PP)/T_PP` or `PP/T_PP`)
+- `T_TG` - time to generate all batches
+- `S_TG` - text generation speed (`(B*TG)/T_TG`)
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.100 | 465.51 | 2.311 | 55.38 |
+| 512 | 128 | 512 | 1.183 | 432.97 | 1.895 | 67.55 |
+| 512 | 128 | 1024 | 1.305 | 392.38 | 2.071 | 61.81 |
+| 512 | 128 | 1536 | 1.279 | 400.42 | 2.164 | 59.14 |
+| 512 | 128 | 2048 | 1.571 | 325.96 | 2.280 | 56.14 |
+| 512 | 128 | 2560 | 1.431 | 357.87 | 2.418 | 52.94 |
+| 512 | 128 | 3072 | 1.515 | 337.93 | 2.566 | 49.88 |
+| 512 | 128 | 3584 | 1.588 | 322.34 | 2.722 | 47.03 |
+| 512 | 128 | 4096 | 1.675 | 305.70 | 2.864 | 44.69 |
+| 512 | 128 | 4608 | 1.769 | 289.50 | 2.999 | 42.68 |
+| 512 | 128 | 5120 | 1.845 | 277.48 | 3.102 | 41.26 |
+| 512 | 128 | 5632 | 1.893 | 270.46 | 3.219 | 39.76 |
+| 512 | 128 | 6144 | 1.953 | 262.20 | 3.348 | 38.23 |
+| 512 | 128 | 6656 | 2.018 | 253.71 | 3.474 | 36.84 |
+| 512 | 128 | 7168 | 2.078 | 246.34 | 3.589 | 35.66 |
+| 512 | 128 | 7680 | 2.140 | 239.22 | 3.717 | 34.43 |
+| 512 | 128 | 8192 | 2.196 | 233.15 | 3.854 | 33.21 |
+
+### JSONL output
+
+Pass `--output-format jsonl` to output JSONL instead of Markdown, รก la
+
+```json lines
+{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 }
+{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 }
+{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 }
+{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 }
+{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 }
+```