summaryrefslogtreecommitdiff
path: root/examples/server
AgeCommit message (Collapse)Author
2024-01-13server : fix crash with multimodal models without BOS token (#4904)makomk
2024-01-12llama : ggml-backend integration (#4766)slaren
* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-11server : fix infill when prompt is empty (#4833)Georgi Gerganov
2024-01-11server : implement credentialed CORS (#4514)Laura
* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage
2024-01-11server : support for multiple api keys (#4864)Michael Coppola
* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11server : add `LOG_INFO` when model is successfully loaded (#4881)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading
2024-01-11server : fix typo in model name (#4876)Isaac McFadyen
2024-01-11server : update readme to document the new `/health` endpoint (#4866)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too
2024-01-11server : fix build + rename enums (#4870)Georgi Gerganov
2024-01-10server : add a `/health` endpoint (#4860)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line
2024-01-09server : update readme about token probs (#4777)Behnam M
* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09server : add api-key flag to documentation (#4832)Zsapi
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-07server : fix n_predict check (#4798)Georgi Gerganov
2024-01-04server : send token probs for "stream == false" (#4714)Georgi Gerganov
2024-01-04server : fix options in README.md (#4765)Michael Coppola
* fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03server : throw an error when `slot unavailable` (#4741)Justin Parker
2024-01-02server : add token counts to html footer (#4738)Phil H
* server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>
2024-01-02editorconfig : fix whitespace and indentation #4710Georgi Gerganov
2024-01-02server : add --override-kv parameter (#4710)minarchist
* Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
2023-12-30clip : refactor + bug fixes (#4696)Georgi Gerganov
* clip : refactor + bug fixes ggml-ci * server : add log message
2023-12-29cmake : fix ld warning duplicate libraries libllama.a (#4671)Cuong Trinh Manh
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.
2023-12-29server : replace sleep with condition variables (#4673)Justine Tunney
The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b
2023-12-29server : fix OpenAI server sampling w.r.t. penalty. (#4675)SakuraUmi
2023-12-29server : allow to generate multimodal embeddings (#4681)Karthik Sethuraman
2023-12-28Fix OpenAI server sampling w.r.t. temp and seed (#4668)Justine Tunney
The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29
2023-12-23server : allow to specify custom prompt for penalty calculation (#3727)Alexey Parfenov
2023-12-17server : disable llm logs if SERVER_VERBOSE is off (#3792)olexiyb
2023-12-17server : fix grammar being ignored (#4494)AdithyanI
Fix bug in identifying the grammar.
2023-12-17server : fix possible ambiguity in content type charset (#4501)Alexey Parfenov
2023-12-17server : allow requests larger than 8K (#4500)mzcu
2023-12-15server : add optional API Key Authentication example (#4441)ShadovvBeast
* Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-13server : fix handling of characters that span multiple tokens when streaming ↵shibe2
(#4446)
2023-12-12server : tweak default sampling parameters (#4367)kalomaze
* Set a more typical Top P setting as the default * Update temp max
2023-12-12english : use `typos` to fix comments and logs (#4354)Richard Kiss
2023-12-12server : fix local model name in server (#4420)Vladimir Zorin
2023-12-10Update README.md (#4388)Yueh-Po Peng
Fix small typo.
2023-12-07llama : per-layer KV cache + quantum K cache (#4309)Georgi Gerganov
* per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-06server : recognize cache_prompt parameter in OAI API (#4347)Georgi Gerganov
2023-12-03server : fix OpenAI API `stop` field to be optional (#4299)Ed Lee
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84ae3bcbf0d617b7ee6a5413bcbd58af)
2023-12-03py : add grammar to oai like api (#4294)Rickard Edén
2023-12-01llama : support optional tensors (#4283)Georgi Gerganov
2023-12-01server : add --log-disable to disable logging to file (#4260)Ziad Ben Hadj-Alouane
* * add --log-disable to disable logging to file in the server example * * typo fix
2023-12-01server : add single-client multi-prompt support (#4232)Ziad Ben Hadj-Alouane
* * add multiprompt support * * cleanup * * more cleanup * * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests * * remove all references to mutex_multitasks * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * * change to set --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-11-30py : fix oai proxy (#3972)rhjdvsgsgks
* fix oai proxy fix generation not stoped while bot stop talking in chat mode fix possible `slot_id` not exist response for cors (and pre flight) * oai proxy: workaround for some client (such as Chatbox) * use stop as separator to replace hardcoded `\n`
2023-11-25server : OAI API compatibility (#4198)Georgi Gerganov
* Add openai-compatible POST /v1/chat/completions API endpoint to server example * fix code style * Update server README.md * Improve server README.md * Fix server.cpp code style according to review * server : some style changes * server : indentation * server : enable special tokens during tokenization by default * server : minor code style * server : change random string generator * straightforward /v1/models endpoint --------- Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com> Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
2023-11-23Fix incorrect format strings and uninitialized variables. (#4133)Haohui Mai
* Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement
2023-11-19server : relay error messages (#4131)SoftwareRenderer
2023-11-16Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)Kerfuffle
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-10server : fix crash when prompt exceeds context size (#3996)Alexey Parfenov
2023-11-10server : allow continue edit on completion mode (#3950)Jhen-Jie Hong
* server : allow continue edit on completion mode * server : handle abort case in runCompletion * server : style improvement