diff options
Diffstat (limited to 'examples/server/tests/features/results.feature')
-rw-r--r-- | examples/server/tests/features/results.feature | 52 |
1 files changed, 44 insertions, 8 deletions
diff --git a/examples/server/tests/features/results.feature b/examples/server/tests/features/results.feature index aa0b8d0c..5deb278c 100644 --- a/examples/server/tests/features/results.feature +++ b/examples/server/tests/features/results.feature @@ -70,12 +70,48 @@ Feature: Results Then all predictions are equal Examples: | n_parallel | temp | - | 1 | 0.0 | - | 2 | 0.0 | - | 4 | 0.0 | - | 1 | 1.0 | - # FIXME: These tests fail on master. The problem seems to be the unified KV cache. + | 1 | 0.0 | + | 2 | 0.0 | + | 4 | 0.0 | + | 1 | 1.0 | + # FIXME: These tests fail on master. + # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 - # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 . - # | 2 | 1.0 | - # | 4 | 1.0 | + # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 + # and https://github.com/ggerganov/llama.cpp/pull/7347 . + # | 2 | 1.0 | + # | 4 | 1.0 | + + Scenario Outline: consistent token probs with same seed and prompt + Given <n_slots> slots + And <n_kv> KV cache size + And 1.0 temperature + And <n_predict> max tokens to predict + Then the server is starting + Then the server is healthy + + Given 1 prompts "The meaning of life is" with seed 42 + And concurrent completion requests + # Then the server is busy # Not all slots will be utilized. + Then the server is idle + And all slots are idle + + Given <n_parallel> prompts "The meaning of life is" with seed 42 + And concurrent completion requests + # Then the server is busy # Not all slots will be utilized. + Then the server is idle + And all slots are idle + + Then all token probabilities are equal + Examples: + | n_slots | n_kv | n_predict | n_parallel | + | 4 | 1024 | 1 | 1 | + | 4 | 1024 | 1 | 4 | + # FIXME: These tests fail on master. + # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism. + # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 + # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 + # and https://github.com/ggerganov/llama.cpp/pull/7347 . + # | 4 | 1024 | 100 | 1 | + # This test still fails even the above patches; the first token probabilities are already different. + # | 4 | 1024 | 100 | 4 | |