summaryrefslogtreecommitdiff
path: root/examples/server
diff options
context:
space:
mode:
authorKawrakow <48489457+ikawrakow@users.noreply.github.com>2024-08-12 15:14:32 +0200
committerGitHub <noreply@github.com>2024-08-12 15:14:32 +0200
commit8f43e551038af2547b5c01d0e9edd641c0e4bd29 (patch)
tree07a4373620a9381d0b5c7189a475990a6feb48a5 /examples/server
parentf5d1af61d79fb53ccfbac2e665e43208c07b083d (diff)
Merge mainline - Aug 12 2024 (#17)
* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/server')
-rw-r--r--examples/server/README.md130
-rw-r--r--examples/server/server.cpp75
-rw-r--r--examples/server/tests/features/lora.feature36
-rw-r--r--examples/server/tests/features/steps/steps.py21
-rw-r--r--examples/server/tests/requirements.txt1
-rw-r--r--examples/server/utils.hpp18
6 files changed, 195 insertions, 86 deletions
diff --git a/examples/server/README.md b/examples/server/README.md
index 33a2b95c..e17595fe 100644
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -207,47 +207,12 @@ model:
-hff, --hf-file FILE Hugging Face model file (default: unused)
-hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment variable)
-retrieval:
-
- --context-file FNAME file to load context from (repeat to specify multiple files)
- --chunk-size N minimum length of embedded text chunks (default: 64)
- --chunk-separator STRING
- separator between chunks (default: '
- ')
-
-passkey:
-
- --junk N number of times to repeat the junk text (default: 250)
- --pos N position of the passkey in the junk text (default: -1)
-
-imatrix:
-
- -o, --output FNAME output file (default: 'imatrix.dat')
- --output-frequency N output the imatrix every N iterations (default: 10)
- --save-frequency N save an imatrix copy every N iterations (default: 0)
- --process-output collect data for the output tensor (default: false)
- --no-ppl do not compute perplexity (default: true)
- --chunk N start processing the input from chunk N (default: 0)
-
-bench:
-
- -pps is the prompt shared across parallel sequences (default: false)
- -npp n0,n1,... number of prompt tokens
- -ntg n0,n1,... number of text generation tokens
- -npl n0,n1,... number of parallel prompts
-
-embedding:
-
- --embd-normalize normalisation for embendings (default: 2) (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)
- --embd-output-format empty = default, "array" = [[],[]...], "json" = openai style, "json+" = same "json" + cosine similarity matrix
- --embd-separator separator of embendings (default \n) for example "<#sep#>"
-
server:
--host HOST ip address to listen (default: 127.0.0.1)
--port PORT port to listen (default: 8080)
--path PATH path to serve static files from (default: )
- --embedding(s) enable embedding endpoint (default: disabled)
+ --embedding(s) restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)
--api-key KEY API key to use for authentication (default: none)
--api-key-file FNAME path to file containing API keys (default: none)
--ssl-key-file FNAME path to file a PEM-encoded SSL private key
@@ -267,7 +232,8 @@ server:
https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
-sps, --slot-prompt-similarity SIMILARITY
how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)
-
+ --lora-init-without-apply
+ load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled)
logging:
@@ -279,15 +245,6 @@ logging:
--log-file FNAME Specify a log filename (without extension)
--log-new Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"
--log-append Don't truncate the old log file.
-
-cvector:
-
- -o, --output FNAME output file (default: 'control_vector.gguf')
- --positive-file FNAME positive prompts file, one prompt per line (default: 'examples/cvector-generator/positive.txt')
- --negative-file FNAME negative prompts file, one prompt per line (default: 'examples/cvector-generator/negative.txt')
- --pca-batch N batch size used for PCA. Larger batch runs faster, but uses more memory (default: 100)
- --pca-iter N number of iterations used for PCA (default: 1000)
- --method {pca,mean} dimensionality reduction method to be used (default: pca)
```
@@ -411,7 +368,8 @@ node index.js
## API Endpoints
-- **GET** `/health`: Returns the current state of the server:
+### GET `/health`: Returns the current state of the server
+
- 503 -> `{"status": "loading model"}` if the model is still being loaded.
- 500 -> `{"status": "error"}` if the model failed to load.
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
@@ -420,7 +378,7 @@ node index.js
If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
-- **POST** `/completion`: Given a `prompt`, it returns the predicted completion.
+### POST `/completion`: Given a `prompt`, it returns the predicted completion.
*Options:*
@@ -498,7 +456,7 @@ node index.js
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
-### Result JSON
+**Response format**
- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.
@@ -537,7 +495,7 @@ Notice that each `probs` is an array of length `n_probs`.
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
-- **POST** `/tokenize`: Tokenize a given text.
+### POST `/tokenize`: Tokenize a given text
*Options:*
@@ -545,13 +503,15 @@ Notice that each `probs` is an array of length `n_probs`.
`add_special`: Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
-- **POST** `/detokenize`: Convert tokens to text.
+### POST `/detokenize`: Convert tokens to text
*Options:*
`tokens`: Set the tokens to detokenize.
-- **POST** `/embedding`: Generate embedding of a given text just as [the embedding example](../embedding) does.
+### POST `/embedding`: Generate embedding of a given text
+
+The same as [the embedding example](../embedding) does.
*Options:*
@@ -559,7 +519,9 @@ Notice that each `probs` is an array of length `n_probs`.
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
-- **POST** `/infill`: For code infilling. Takes a prefix and a suffix and returns the predicted completion as stream.
+### POST `/infill`: For code infilling.
+
+Takes a prefix and a suffix and returns the predicted completion as stream.
*Options:*
@@ -571,7 +533,7 @@ Notice that each `probs` is an array of length `n_probs`.
- **GET** `/props`: Return current server settings.
-### Result JSON
+**Response format**
```json
{
@@ -589,7 +551,9 @@ Notice that each `probs` is an array of length `n_probs`.
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
- `chat_template` - the model's original Jinja2 prompt template
-- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
+### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
+
+Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
*Options:*
@@ -641,7 +605,7 @@ Notice that each `probs` is an array of length `n_probs`.
}'
```
-- **POST** `/v1/embeddings`: OpenAI-compatible embeddings API.
+### POST `/v1/embeddings`: OpenAI-compatible embeddings API
*Options:*
@@ -675,9 +639,9 @@ Notice that each `probs` is an array of length `n_probs`.
}'
```
-- **GET** `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
+### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
-### Result JSON
+**Response format**
```json
[
@@ -738,7 +702,7 @@ Notice that each `probs` is an array of length `n_probs`.
]
```
-- **GET** `/metrics`: [Prometheus](https://prometheus.io/) compatible metrics exporter endpoint if `--metrics` is enabled:
+### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
Available metrics:
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
@@ -750,13 +714,13 @@ Available metrics:
- `llamacpp:requests_processing`: Number of requests processing.
- `llamacpp:requests_deferred`: Number of requests deferred.
-- **POST** `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
+### POST `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
*Options:*
`filename`: Name of the file to save the slot's prompt cache. The file will be saved in the directory specified by the `--slot-save-path` server parameter.
-### Result JSON
+**Response format**
```json
{
@@ -770,13 +734,13 @@ Available metrics:
}
```
-- **POST** `/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
+### POST `/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
*Options:*
`filename`: Name of the file to restore the slot's prompt cache from. The file should be located in the directory specified by the `--slot-save-path` server parameter.
-### Result JSON
+**Response format**
```json
{
@@ -790,9 +754,9 @@ Available metrics:
}
```
-- **POST** `/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
+### POST `/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
-### Result JSON
+**Response format**
```json
{
@@ -801,6 +765,42 @@ Available metrics:
}
```
+### GET `/lora-adapters`: Get list of all LoRA adapters
+
+If an adapter is disabled, the scale will be set to 0.
+
+**Response format**
+
+```json
+[
+ {
+ "id": 0,
+ "path": "my_adapter_1.gguf",
+ "scale": 0.0
+ },
+ {
+ "id": 1,
+ "path": "my_adapter_2.gguf",
+ "scale": 0.0
+ }
+]
+```
+
+### POST `/lora-adapters`: Set list of LoRA adapters
+
+To disable an adapter, either remove it from the list below, or set scale to 0.
+
+**Request format**
+
+To know the `id` of the adapter, use GET `/lora-adapters`
+
+```json
+[
+ {"id": 0, "scale": 0.2},
+ {"id": 1, "scale": 0.8}
+]
+```
+
## More examples
### Change system prompt on runtime
diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 7813a295..360f571e 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -78,6 +78,7 @@ enum server_task_type {
SERVER_TASK_TYPE_SLOT_SAVE,
SERVER_TASK_TYPE_SLOT_RESTORE,
SERVER_TASK_TYPE_SLOT_ERASE,
+ SERVER_TASK_TYPE_SET_LORA,
};
struct server_task {
@@ -622,6 +623,7 @@ struct server_response {
struct server_context {
llama_model * model = nullptr;
llama_context * ctx = nullptr;
+ std::vector<llama_lora_adapter_container> lora_adapters;
gpt_params params;
@@ -677,7 +679,11 @@ struct server_context {
// dedicate one sequence to the system prompt
params.n_parallel += 1;
- std::tie(model, ctx) = llama_init_from_gpt_params(params);
+ llama_init_result llama_init = llama_init_from_gpt_params(params);
+
+ model = llama_init.model;
+ ctx = llama_init.context;
+ lora_adapters = llama_init.lora_adapters;
params.n_parallel -= 1; // but be sneaky about it
if (model == nullptr) {
LOG_ERROR("unable to load model", {{"model", params.model}});
@@ -900,7 +906,7 @@ struct server_context {
slot.params.stream = json_value(data, "stream", false);
slot.params.cache_prompt = json_value(data, "cache_prompt", false);
- slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);
+ slot.params.n_predict = json_value(data, "n_predict", json_value(data, "max_tokens", default_params.n_predict));
slot.sparams.top_k = json_value(data, "top_k", default_sparams.top_k);
slot.sparams.top_p = json_value(data, "top_p", default_sparams.top_p);
slot.sparams.min_p = json_value(data, "min_p", default_sparams.min_p);
@@ -969,6 +975,8 @@ struct server_context {
(prompt->is_array() && prompt->size() == 1 && prompt->at(0).is_string()) ||
(prompt->is_array() && !prompt->empty() && prompt->at(0).is_number_integer())) {
slot.prompt = *prompt;
+ } else if (prompt->is_array() && prompt->size() == 1 && prompt->at(0).is_array()) {
+ slot.prompt = prompt->at(0);
} else {
send_error(task, "\"prompt\" must be a string or an array of integers", ERROR_TYPE_INVALID_REQUEST);
return false;
@@ -1847,6 +1855,14 @@ struct server_context {
};
queue_results.send(result);
} break;
+ case SERVER_TASK_TYPE_SET_LORA:
+ {
+ llama_lora_adapters_apply(ctx, lora_adapters);
+ server_task_result result;
+ result.id = task.id;
+ result.data = json{{ "success", true }};
+ queue_results.send(result);
+ } break;
}
}
@@ -3325,6 +3341,55 @@ int main(int argc, char ** argv) {
return res.set_content(root.dump(), "application/json; charset=utf-8");
};
+ const auto handle_lora_adapters_list = [&](const httplib::Request & req, httplib::Response & res) {
+ res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
+ json result = json::array();
+ for (size_t i = 0; i < ctx_server.lora_adapters.size(); ++i) {
+ auto & la = ctx_server.lora_adapters[i];
+ result.push_back({
+ {"id", i},
+ {"path", la.path},
+ {"scale", la.scale},
+ });
+ }
+ res.set_content(result.dump(), "application/json");
+ res.status = 200; // HTTP OK
+ };
+
+ const auto handle_lora_adapters_apply = [&](const httplib::Request & req, httplib::Response & res) {
+ res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
+
+ const std::vector<json> body = json::parse(req.body);
+ int max_idx = ctx_server.lora_adapters.size();
+
+ // clear existing value
+ for (auto & la : ctx_server.lora_adapters) {
+ la.scale = 0.0f;
+ }
+
+ // set value
+ for (auto entry : body) {
+ int id = entry.at("id");
+ float scale = entry.at("scale");
+ if (0 <= id && id < max_idx) {
+ ctx_server.lora_adapters[id].scale = scale;
+ } else {
+ throw std::runtime_error("invalid adapter id");
+ }
+ }
+
+ server_task task;
+ task.type = SERVER_TASK_TYPE_SET_LORA;
+ const int id_task = ctx_server.queue_tasks.post(task);
+ ctx_server.queue_results.add_waiting_task_id(id_task);
+
+ server_task_result result = ctx_server.queue_results.recv(id_task);
+ ctx_server.queue_results.remove_waiting_task_id(id_task);
+
+ res.set_content(result.data.dump(), "application/json");
+ res.status = 200; // HTTP OK
+ };
+
auto handle_static_file = [](unsigned char * content, size_t len, const char * mime_type) {
return [content, len, mime_type](const httplib::Request &, httplib::Response & res) {
res.set_content(reinterpret_cast<const char*>(content), len, mime_type);
@@ -3363,7 +3428,6 @@ int main(int argc, char ** argv) {
// register API routes
svr->Get ("/health", handle_health);
- svr->Get ("/slots", handle_slots);
svr->Get ("/metrics", handle_metrics);
svr->Get ("/props", handle_props);
svr->Get ("/v1/models", handle_models);
@@ -3378,6 +3442,11 @@ int main(int argc, char ** argv) {
svr->Post("/v1/embeddings", handle_embeddings);
svr->Post("/tokenize", handle_tokenize);
svr->Post("/detokenize", handle_detokenize);
+ // LoRA adapters hotswap
+ svr->Get ("/lora-adapters", handle_lora_adapters_list);
+ svr->Post("/lora-adapters", handle_lora_adapters_apply);
+ // Save & load slots
+ svr->Get ("/slots", handle_slots);
if (!params.slot_save_path.empty()) {
// only enable slot endpoints if slot_save_path is set
svr->Post("/slots/:id_slot", handle_slots_action);
diff --git a/examples/server/tests/features/lora.feature b/examples/server/tests/features/lora.feature
new file mode 100644
index 00000000..7b85988a
--- /dev/null
+++ b/examples/server/tests/features/lora.feature
@@ -0,0 +1,36 @@
+@llama.cpp
+@lora
+Feature: llama.cpp server
+
+ Background: Server startup
+ Given a server listening on localhost:8080
+ And a model url https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/stories15M_MOE-F16.gguf
+ And a model file stories15M_MOE-F16.gguf
+ And a model alias stories15M_MOE
+ And a lora adapter file from https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/moe_shakespeare15M.gguf
+ And 42 as server seed
+ And 1024 as batch size
+ And 1024 as ubatch size
+ And 2048 KV cache size
+ And 64 max tokens to predict
+ And 0.0 temperature
+ Then the server is starting
+ Then the server is healthy
+
+ Scenario: Completion LoRA disabled
+ Given switch off lora adapter 0
+ Given a prompt:
+ """
+ Look in thy glass
+ """
+ And a completion request with no api error
+ Then 64 tokens are predicted matching little|girl|three|years|old
+
+ Scenario: Completion LoRA enabled
+ Given switch on lora adapter 0
+ Given a prompt:
+ """
+ Look in thy glass
+ """
+ And a completion request with no api error
+ Then 64 tokens are predicted matching eye|love|glass|sun
diff --git a/examples/server/tests/features/steps/steps.py b/examples/server/tests/features/steps/steps.py
index df0814cc..6705a34f 100644
--- a/examples/server/tests/features/steps/steps.py
+++ b/examples/server/tests/features/steps/steps.py
@@ -7,6 +7,7 @@ import subprocess
import sys
import threading
import time
+import requests
from collections.abc import Sequence
from contextlib import closing
from re import RegexFlag
@@ -70,6 +71,7 @@ def step_server_config(context, server_fqdn: str, server_port: str):
context.user_api_key = None
context.response_format = None
context.temperature = None
+ context.lora_file = None
context.tasks_result = []
context.concurrent_tasks = []
@@ -82,6 +84,12 @@ def step_download_hf_model(context, hf_file: str, hf_repo: str):
context.model_hf_file = hf_file
context.model_file = os.path.basename(hf_file)
+@step('a lora adapter file from {lora_file_url}')
+def step_download_lora_file(context, lora_file_url: str):
+ file_name = lora_file_url.split('/').pop()
+ context.lora_file = f'../../../{file_name}'
+ with open(context.lora_file, 'wb') as f:
+ f.write(requests.get(lora_file_url).content)
@step('a model file {model_file}')
def step_model_file(context, model_file: str):
@@ -849,6 +857,17 @@ async def step_erase_slot(context, slot_id):
context.response = response
+@step('switch {on_or_off} lora adapter {lora_id:d}')
+@async_run_until_complete
+async def toggle_lora_adapter(context, on_or_off: str, lora_id: int):
+ async with aiohttp.ClientSession() as session:
+ async with session.post(f'{context.base_url}/lora-adapters',
+ json=[{'id': lora_id, 'scale': 1 if on_or_off == 'on' else 0}],
+ headers={"Content-Type": "application/json"}) as response:
+ context.response = response
+ print([{'id': lora_id, 'scale': 1 if on_or_off == 'on' else 0}])
+
+
@step('the server responds with status code {status_code:d}')
def step_server_responds_with_status_code(context, status_code):
assert context.response.status == status_code
@@ -1326,6 +1345,8 @@ def start_server_background(context):
server_args.extend(['--grp-attn-w', context.n_ga_w])
if context.debug:
server_args.append('--verbose')
+ if context.lora_file:
+ server_args.extend(['--lora', context.lora_file])
if 'SERVER_LOG_FORMAT_JSON' not in os.environ:
server_args.extend(['--log-format', "text"])
diff --git a/examples/server/tests/requirements.txt b/examples/server/tests/requirements.txt
index 2c741ea1..f2d7e5c5 100644
--- a/examples/server/tests/requirements.txt
+++ b/examples/server/tests/requirements.txt
@@ -4,3 +4,4 @@ huggingface_hub~=0.20.3
numpy~=1.26.4
openai~=1.30.3
prometheus-client~=0.20.0
+requests~=2.32.3
diff --git a/examples/server/utils.hpp b/examples/server/utils.hpp
index db6b3b74..e6a1f069 100644
--- a/examples/server/utils.hpp
+++ b/examples/server/utils.hpp
@@ -355,24 +355,6 @@ static json oaicompat_completion_params_parse(
llama_params["__oaicompat"] = true;
- // Map OpenAI parameters to llama.cpp parameters
- //
- // For parameters that are defined by the OpenAI documentation (e.g.
- // temperature), we explicitly specify OpenAI's intended default; we
- // need to do that because sometimes OpenAI disagrees with llama.cpp
- //
- // https://platform.openai.com/docs/api-reference/chat/create
- llama_sampling_params default_sparams;
- llama_params["model"] = json_value(body, "model", std::string("unknown"));
- llama_params["frequency_penalty"] = json_value(body, "frequency_penalty", 0.0);
- llama_params["logit_bias"] = json_value(body, "logit_bias", json::object());
- llama_params["n_predict"] = json_value(body, "max_tokens", -1);
- llama_params["presence_penalty"] = json_value(body, "presence_penalty", 0.0);
- llama_params["seed"] = json_value(body, "seed", LLAMA_DEFAULT_SEED);
- llama_params["stream"] = json_value(body, "stream", false);
- llama_params["temperature"] = json_value(body, "temperature", 1.0);
- llama_params["top_p"] = json_value(body, "top_p", 1.0);
-
// Apply chat template to the list of messages
llama_params["prompt"] = format_chat(model, chat_template, body.at("messages"));