ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)

* ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate
author: Pierrick Hymbert <pierrick.hymbert@gmail.com> 2024-04-06 05:40:47 +0200
committer: GitHub <noreply@github.com> 2024-04-06 05:40:47 +0200
commit: 75cd4c77292034ecec587ecb401366f57338f7c0 (patch)
tree: de137718780505410bc75ce219f4bc164961c4fd /examples/server/bench/README.md
parent: a8bd14d55717754a1f48313a846a2b16fa998ad2 (diff)
1 files changed, 37 insertions, 5 deletions
diff --git a/examples/server/bench/README.md b/examples/server/bench/README.md
index a53ad64d..23a3ec97 100644
--- a/examples/server/bench/README.md
+++ b/examples/server/bench/README.md
@@ -2,13 +2,15 @@
 
 Benchmark is using [k6](https://k6.io/).
 
-##### Install k6
+##### Install k6 and sse extension
 
-Follow instruction from: https://k6.io/docs/get-started/installation/
+SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.
 
-Example for ubuntu:
+Example:
 ```shell
-snap install k6
+go install go.k6.io/xk6/cmd/xk6@latest
+xk6 build master \
+--with github.com/phymbert/xk6-sse
 ```
 
 #### Download a dataset
@@ -46,7 +48,7 @@ server --host localhost --port 8080 \
 
 For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
 ```shell
-k6 run script.js --duration 10m --iterations 500 --vus 8
+./k6 run script.js --duration 10m --iterations 500 --vus 8
 ```
 
 The benchmark values can be overridden with:
@@ -86,3 +88,33 @@ K6 metrics might be compared against [server metrics](../README.md), with:
 ```shell
 curl http://localhost:8080/metrics
 ```
+
+### Using the CI python script
+The `bench.py` script does several steps:
+- start the server
+- define good variable for k6
+- run k6 script
+- extract metrics from prometheus
+
+It aims to be used in the CI, but you can run it manually:
+
+```shell
+LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/server python bench.py \
+              --runner-label local \
+              --name local \
+              --branch `git rev-parse --abbrev-ref HEAD` \
+              --commit `git rev-parse HEAD` \
+              --scenario script.js \
+              --duration 5m \
+              --hf-repo ggml-org/models	 \
+              --hf-file phi-2/ggml-model-q4_0.gguf \
+              --model-path-prefix models \
+              --parallel 4 \
+              -ngl 33 \
+              --batch-size 2048 \
+              --ubatch-size	256 \
+              --ctx-size 4096 \
+              --n-prompts 200 \
+              --max-prompt-tokens 256 \
+              --max-tokens 256
+```
author	Pierrick Hymbert <pierrick.hymbert@gmail.com>	2024-04-06 05:40:47 +0200
committer	GitHub <noreply@github.com>	2024-04-06 05:40:47 +0200
commit	75cd4c77292034ecec587ecb401366f57338f7c0 (patch)
tree	de137718780505410bc75ce219f4bc164961c4fd /examples/server/bench/README.md
parent	a8bd14d55717754a1f48313a846a2b16fa998ad2 (diff)