From fd72d2d2a5e79d61ccef6af3d15f16e5e5cbc352 Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sat, 9 Mar 2024 10:30:04 +0100 Subject: server: tests: add truncated prompt tests, better kv cache size (#5933) * server: tests: add truncated prompt tests, better size * server, tests : update regex --------- Co-authored-by: Georgi Gerganov --- examples/server/tests/features/server.feature | 41 ++++++++++++++++++++------- 1 file changed, 30 insertions(+), 11 deletions(-) (limited to 'examples/server/tests/features/server.feature') diff --git a/examples/server/tests/features/server.feature b/examples/server/tests/features/server.feature index 878ac136..aa132fa3 100644 --- a/examples/server/tests/features/server.feature +++ b/examples/server/tests/features/server.feature @@ -10,11 +10,10 @@ Feature: llama.cpp server # KV Cache corresponds to the total amount of tokens # that can be stored across all independent sequences: #4130 # see --ctx-size and #5568 - And 32 KV cache size - And 512 as batch size - And 1 slots - And embeddings extraction - And 32 server max tokens to predict + And 256 KV cache size + And 32 as batch size + And 2 slots + And 64 server max tokens to predict And prometheus compatible metrics exposed Then the server is starting Then the server is healthy @@ -23,18 +22,35 @@ Feature: llama.cpp server Then the server is ready And all slots are idle + Scenario Outline: Completion Given a prompt And max tokens to predict And a completion request with no api error Then tokens are predicted matching + And the completion is truncated + And prompt tokens are processed And prometheus metrics are exposed And metric llamacpp:tokens_predicted is Examples: Prompts - | prompt | n_predict | re_content | n_predicted | - | I believe the meaning of life is | 8 | (read\|going)+ | 8 | - | Write a joke about AI | 64 | (park\|friends\|scared\|always)+ | 32 | + | prompt | n_predict | re_content | n_prompt | n_predicted | truncated | + | I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not | + | Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids)+ | 46 | 64 | not | + + Scenario: Completion prompt truncated + Given a prompt: + """ + Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. + Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. + Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. + Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. + """ + And a completion request with no api error + Then 64 tokens are predicted matching fun|Annaks|popcorns + And the completion is truncated + And 109 prompt tokens are processed + Scenario Outline: OAI Compatibility Given a model @@ -44,11 +60,14 @@ Feature: llama.cpp server And streaming is Given an OAI compatible chat completions request with no api error Then tokens are predicted matching + And prompt tokens are processed + And the completion is truncated Examples: Prompts - | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming | - | llama-2 | Book | What is the best book | 8 | (Mom\|what)+ | 8 | disabled | - | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks\|happy\|bird)+ | 32 | enabled | + | model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated | + | llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not | + | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird)+ | -1 | 64 | enabled | | + Scenario: Tokenize / Detokenize When tokenizing: -- cgit v1.2.3