From 525213d2f5da1eaf4b922b6b792cb52b2c613368 Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sat, 24 Feb 2024 12:28:55 +0100 Subject: server: init functional tests (#5566) * server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov --- examples/server/tests/features/server.feature | 69 +++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 examples/server/tests/features/server.feature (limited to 'examples/server/tests/features/server.feature') diff --git a/examples/server/tests/features/server.feature b/examples/server/tests/features/server.feature new file mode 100644 index 00000000..fedcfe5a --- /dev/null +++ b/examples/server/tests/features/server.feature @@ -0,0 +1,69 @@ +@llama.cpp +Feature: llama.cpp server + + Background: Server startup + Given a server listening on localhost:8080 + And a model file stories260K.gguf + And a model alias tinyllama-2 + And 42 as server seed + # KV Cache corresponds to the total amount of tokens + # that can be stored across all independent sequences: #4130 + # see --ctx-size and #5568 + And 32 KV cache size + And 1 slots + And embeddings extraction + And 32 server max tokens to predict + Then the server is starting + Then the server is healthy + + Scenario: Health + Then the server is ready + And all slots are idle + + Scenario Outline: Completion + Given a prompt + And max tokens to predict + And a completion request with no api error + Then tokens are predicted matching + + Examples: Prompts + | prompt | n_predict | re_content | n_predicted | + | I believe the meaning of life is | 8 | read | 8 | + | Write a joke about AI | 64 | (parkfriendsscared)+ | 32 | + + Scenario Outline: OAI Compatibility + Given a model + And a system prompt + And a user prompt + And max tokens to predict + And streaming is + Given an OAI compatible chat completions request with no api error + Then tokens are predicted matching + + Examples: Prompts + | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming | + | llama-2 | Book | What is the best book | 8 | (Momwhat)+ | 8 | disabled | + | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thankshappybird)+ | 32 | enabled | + + Scenario: Embedding + When embeddings are computed for: + """ + What is the capital of Bulgaria ? + """ + Then embeddings are generated + + Scenario: OAI Embeddings compatibility + Given a model tinyllama-2 + When an OAI compatible embeddings computation request for: + """ + What is the capital of Spain ? + """ + Then embeddings are generated + + + Scenario: Tokenize / Detokenize + When tokenizing: + """ + What is the capital of France ? + """ + Then tokens can be detokenize -- cgit v1.2.3