# ik_llama.cpp/example/sweep-bench

Benchmark the prompt processing and token generation performance of `ik_llama.cpp`
by doing a sweep over a whole context size and gathering performance metrics
in each ubatch-sized window. Only a single token sequence is used.

The benchmark steps are:

for each ubatch-sized window in context:

    1. generate ubatch/4 tokens (not the whole window to save some time)
    2. measure generation performance
    3. remove generated tokens from KV cache
    4. prepare a ubatch-sized batch of random tokens
    4. process prepated batch
    5. measure prompt processing performance

The purpose of the benchmark is to visualize how the performance changes with
the context size without averaging the metrics values over the whole context.

## Usage

./llama-sweep-bench -c 8704 -ub 512 -m models/Meta-Llama-3.2-3B-Instruct-Q8_0.gguf

## Sample results

- `PP` - prompt tokens per ubatch
- `TG` - generated tokens per ubatch
- `N_KV` - current KV cache size
- `T_PP` - prompt processing time (i.e. time to first token)
- `S_PP` - prompt processing speed (`(B*PP)/T_PP` or `PP/T_PP`)
- `T_TG` - time to generate all batches
- `S_TG` - text generation speed (`(B*TG)/T_TG`)

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.100 |   465.51 |    2.311 |    55.38 |
|   512 |    128 |    512 |    1.183 |   432.97 |    1.895 |    67.55 |
|   512 |    128 |   1024 |    1.305 |   392.38 |    2.071 |    61.81 |
|   512 |    128 |   1536 |    1.279 |   400.42 |    2.164 |    59.14 |
|   512 |    128 |   2048 |    1.571 |   325.96 |    2.280 |    56.14 |
|   512 |    128 |   2560 |    1.431 |   357.87 |    2.418 |    52.94 |
|   512 |    128 |   3072 |    1.515 |   337.93 |    2.566 |    49.88 |
|   512 |    128 |   3584 |    1.588 |   322.34 |    2.722 |    47.03 |
|   512 |    128 |   4096 |    1.675 |   305.70 |    2.864 |    44.69 |
|   512 |    128 |   4608 |    1.769 |   289.50 |    2.999 |    42.68 |
|   512 |    128 |   5120 |    1.845 |   277.48 |    3.102 |    41.26 |
|   512 |    128 |   5632 |    1.893 |   270.46 |    3.219 |    39.76 |
|   512 |    128 |   6144 |    1.953 |   262.20 |    3.348 |    38.23 |
|   512 |    128 |   6656 |    2.018 |   253.71 |    3.474 |    36.84 |
|   512 |    128 |   7168 |    2.078 |   246.34 |    3.589 |    35.66 |
|   512 |    128 |   7680 |    2.140 |   239.22 |    3.717 |    34.43 |
|   512 |    128 |   8192 |    2.196 |   233.15 |    3.854 |    33.21 |

### JSONL output

Pass `--output-format jsonl` to output JSONL instead of Markdown, á la

```json lines
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 0, "t_pp": 1.093814, "speed_pp": 468.086884, "t_tg": 1.780312, "speed_tg": 71.897514 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 512, "t_pp": 1.169302, "speed_pp": 437.868073, "t_tg": 1.897474, "speed_tg": 67.458099 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1024, "t_pp": 1.183700, "speed_pp": 432.542053, "t_tg": 2.059179, "speed_tg": 62.160694 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 1536, "t_pp": 1.428625, "speed_pp": 358.386566, "t_tg": 2.160639, "speed_tg": 59.241734 }
{"n_kv_max": 8704, "n_batch": 2048, "n_ubatch": 512, "flash_attn": 0, "n_gpu_layers": -1, "n_threads": 32, "n_threads_batch": 32, "pp": 512, "tg": 128, "n_kv": 2048, "t_pp": 1.360647, "speed_pp": 376.291595, "t_tg": 2.274003, "speed_tg": 56.288403 }
```