diff options
author | Pierrick Hymbert <pierrick.hymbert@gmail.com> | 2024-03-09 23:41:49 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-03-09 23:41:49 +0100 |
commit | 621e86b331f8b0e71f79fd82a4ae1cd54c3e4396 (patch) | |
tree | e667aa693df722aafbb5452054de261839d0dac1 /examples/server/bench/README.md | |
parent | 77d1ac7e00bf049b9f2bba1b5a310a78318c49c4 (diff) |
server: benchmark: chat/completions scenario and other llm servers comparison (#5941)
* server: bench: Init a bench scenario with K6
See #5827
* server: bench: EOL EOF
* server: bench: PR feedback and improved k6 script configuration
* server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading
server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS
server: bench: increase truncated rate to 80% before failing
* server: bench: fix doc
* server: bench: change gauge custom metrics to trend
* server: bench: change gauge custom metrics to trend
server: bench: add trend custom metrics for total tokens per second average
* server: bench: doc add an option to debug http request
* server: bench: filter dataset too short and too long sequences
* server: bench: allow to filter out conversation in the dataset based on env variable
* server: bench: fix assistant message sent instead of user message
* server: bench: fix assistant message sent instead of user message
* server : add defrag thold parameter
* server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Diffstat (limited to 'examples/server/bench/README.md')
-rw-r--r-- | examples/server/bench/README.md | 88 |
1 files changed, 88 insertions, 0 deletions
diff --git a/examples/server/bench/README.md b/examples/server/bench/README.md new file mode 100644 index 00000000..a53ad64d --- /dev/null +++ b/examples/server/bench/README.md @@ -0,0 +1,88 @@ +### Server benchmark tools + +Benchmark is using [k6](https://k6.io/). + +##### Install k6 + +Follow instruction from: https://k6.io/docs/get-started/installation/ + +Example for ubuntu: +```shell +snap install k6 +``` + +#### Download a dataset + +This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md). + +```shell +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json +``` + +#### Download a model +Example for PHI-2 + +```shell +../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf +``` + +#### Start the server +The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`. + +Example: +```shell +server --host localhost --port 8080 \ + --model ggml-model-q4_0.gguf \ + --cont-batching \ + --metrics \ + --parallel 8 \ + --batch-size 512 \ + --ctx-size 4096 \ + --log-format text \ + -ngl 33 +``` + +#### Run the benchmark + +For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run: +```shell +k6 run script.js --duration 10m --iterations 500 --vus 8 +``` + +The benchmark values can be overridden with: +- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1` +- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480` +- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model` +- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512` +- `SERVER_BENCH_DATASET` path to the benchmark dataset file +- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024` +- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048` + +Note: the local tokenizer is just a string space split, real number of tokens will differ. + +Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/): + +```shell +SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8 +``` + +To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`. + +#### Metrics + +Following metrics are available computed from the OAI chat completions response `usage`: +- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration` +- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens` +- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens` +- `llamacpp_completion_tokens` Trend of `usage.completion_tokens` +- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens` +- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'` +- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'` + +The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`. + +K6 metrics might be compared against [server metrics](../README.md), with: + +```shell +curl http://localhost:8080/metrics +``` |