From fd72d2d2a5e79d61ccef6af3d15f16e5e5cbc352 Mon Sep 17 00:00:00 2001
From: Pierrick Hymbert <pierrick.hymbert@gmail.com>
Date: Sat, 9 Mar 2024 10:30:04 +0100
Subject: server: tests: add truncated prompt tests, better kv cache size
 (#5933)

* server: tests: add truncated prompt tests, better size

* server, tests : update regex

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---
 examples/server/tests/features/server.feature | 41 ++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 11 deletions(-)

(limited to 'examples/server/tests/features/server.feature')
diff --git a/examples/server/tests/features/server.feature b/examples/server/tests/features/server.feature
index 878ac136..aa132fa3 100644
--- a/examples/server/tests/features/server.feature
+++ b/examples/server/tests/features/server.feature
@@ -10,11 +10,10 @@ Feature: llama.cpp server
       # KV Cache corresponds to the total amount of tokens
       # that can be stored across all independent sequences: #4130
       # see --ctx-size and #5568
-    And   32 KV cache size
-    And   512 as batch size
-    And   1 slots
-    And   embeddings extraction
-    And   32 server max tokens to predict
+    And   256 KV cache size
+    And   32 as batch size
+    And   2 slots
+    And   64 server max tokens to predict
     And   prometheus compatible metrics exposed
     Then  the server is starting
     Then  the server is healthy
@@ -23,18 +22,35 @@ Feature: llama.cpp server
     Then the server is ready
     And  all slots are idle
 
+
   Scenario Outline: Completion
     Given a prompt <prompt>
     And   <n_predict> max tokens to predict
     And   a completion request with no api error
     Then  <n_predicted> tokens are predicted matching <re_content>
+    And   the completion is <truncated> truncated
+    And   <n_prompt> prompt tokens are processed
     And   prometheus metrics are exposed
     And   metric llamacpp:tokens_predicted is <n_predicted>
 
     Examples: Prompts
-      | prompt                           | n_predict | re_content                       | n_predicted |
-      | I believe the meaning of life is | 8         | (read\|going)+                   | 8           |
-      | Write a joke about AI            | 64        | (park\|friends\|scared\|always)+ | 32          |
+      | prompt                                                                    | n_predict | re_content                    | n_prompt | n_predicted | truncated |
+      | I believe the meaning of life is                                          | 8         | (read\|going)+                | 18       | 8           | not       |
+      | Write a joke about AI from a very long prompt which will not be truncated | 256       | (princesses\|everyone\|kids)+ | 46       | 64          | not       |
+
+  Scenario: Completion prompt truncated
+    Given a prompt:
+    """
+    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
+    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
+    Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
+    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
+    """
+    And   a completion request with no api error
+    Then  64 tokens are predicted matching fun|Annaks|popcorns
+    And   the completion is  truncated
+    And   109 prompt tokens are processed
+
 
   Scenario Outline: OAI Compatibility
     Given a model <model>
@@ -44,11 +60,14 @@ Feature: llama.cpp server
     And   streaming is <enable_streaming>
     Given an OAI compatible chat completions request with no api error
     Then  <n_predicted> tokens are predicted matching <re_content>
+    And   <n_prompt> prompt tokens are processed
+    And   the completion is <truncated> truncated
 
     Examples: Prompts
-      | model        | system_prompt               | user_prompt                          | max_tokens | re_content             | n_predicted | enable_streaming |
-      | llama-2      | Book                        | What is the best book                | 8          | (Mom\|what)+           | 8           | disabled         |
-      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64         | (thanks\|happy\|bird)+ | 32          | enabled          |
+      | model        | system_prompt               | user_prompt                          | max_tokens | re_content             | n_prompt | n_predicted | enable_streaming | truncated |
+      | llama-2      | Book                        | What is the best book                | 8          | (Here\|what)+          | 77       | 8           | disabled         | not       |
+      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128        | (thanks\|happy\|bird)+ | -1       | 64          | enabled          |           |
+
 
   Scenario: Tokenize / Detokenize
     When tokenizing:
-- 
cgit v1.2.3