summaryrefslogtreecommitdiff
path: root/examples/server/README.md
AgeCommit message (Collapse)Author
2025-06-09Docs update (#509)saood06
* use npm as deps manager and vite as bundler * update XTC docs --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-06-03Adding top-n-sigma sampler (#489)Kawrakow
* Adding top-n-sigma sampler * Fix typos in XTC PR * Update README.md for main and server * More README * More README --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-28set cache_prompt default to true (#465)saood06
2024-08-12Merge mainline - Aug 12 2024 (#17)Kawrakow
* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-06-13`build`: rename main → llama-cli, server → llama-server, llava-cli → ↵Olivier Chafik
llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-07server: update cache_prompt documentation [no ci] (#7745)Johannes Gäßler
2024-05-19server: add test for token probs (#7347)Johannes Gäßler
2024-05-18server: correct --threads documentation [no ci] (#7362)Johannes Gäßler
2024-05-17[Server] Added --verbose option to README [no ci] (#7335)Leon Knauer
2024-05-14docs: Fix typo and update description for --embeddings flag (#7026)Ryuei
- Change '--embedding' to '--embeddings' in the README - Update the description to match the latest --help output - Added a caution about defining physical batch size
2024-05-08server : add_special option for tokenize endpoint (#7059)Johan
2024-05-07server: fix incorrectly reported token probabilities (#7125)Johannes Gäßler
* server: normalize token probabilities * fix temperature == 0.0f
2024-05-07server : update readme with undocumented options (#7013)Kyle Mistele
2024-04-29build(cmake): simplify instructions (`cmake -B build && cmake --build build ↵Olivier Chafik
...`) (#6964) * readme: cmake . -B build && cmake --build build * build: fix typo Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * build: drop implicit . from cmake config command * build: remove another superfluous . * build: update MinGW cmake commands * Update README-sycl.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * build: reinstate --config Release as not the default w/ some generators + document how to build Debug * build: revert more --config Release * build: nit / remove -H from cmake example * build: reword debug instructions around single/multi config split --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-04-12JSON schema conversion: ⚡️ faster repetitions, min/maxLength for ↵Olivier Chafik
strings, cap number length (#6555) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import
2024-04-08llama : save and restore kv cache for single seq id (#6341)Jan Boon
* llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <martindevans@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-04server : remove obsolete --memory-f32 optionGeorgi Gerganov
2024-04-03A few small fixes to server's README docs (#6428)Fattire
* Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.
2024-03-26cuda : rename build flag to LLAMA_CUDA (#6299)slaren
2024-03-25Server: clean up OAI params parsing function (#6284)Xuan Son Nguyen
* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs
2024-03-23common: llama_load_model_from_url split support (#6192)Pierrick Hymbert
* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` ↵Pierrick Hymbert
(#6254)
2024-03-21server : update readme doc from `slot_id` to `id_slot` (#6213)Jan Boon
2024-03-17common: llama_load_model_from_url using --model-url (#6098)Pierrick Hymbert
* common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-11Update server docker image URLs (#5997)Jakub N
2024-03-11Server: format error to json (#5961)Xuan Son Nguyen
* server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme
2024-03-09server : clarify some items in the readme (#5957)Georgi Gerganov
* server : clarify some items in the readme * server : fix typo
2024-03-09Server: reorganize some http logic (#5939)Xuan Son Nguyen
* refactor static file handler * use set_pre_routing_handler for validate_api_key * merge embedding handlers * correct http verb for endpoints * fix embedding response * fix test case CORS Options * fix code style
2024-03-09server : add SSL support (#5926)Gabe Goodhart
* add cmake build toggle to enable ssl support in server Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add flags for ssl key/cert files and use SSLServer if set All SSL setup is hidden behind CPPHTTPLIB_OPENSSL_SUPPORT in the same way that the base httlib hides the SSL support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Update readme for SSL support in server Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Add LLAMA_SERVER_SSL variable setup to top-level Makefile Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-03-07server : refactor (#5882)Georgi Gerganov
* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
2024-03-03server : init http requests thread pool with --parallel if set (#5836)Pierrick Hymbert
2024-03-01server : remove api_like_OAI.py proxy script (#5808)Georgi Gerganov
2024-03-01server: allow to override threads server pool with --threads-http (#5794)Pierrick Hymbert
2024-02-25server: docs - refresh and tease a little bit more the http server (#5718)Pierrick Hymbert
* server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25server: logs - unified format and --log-format option (#5700)Pierrick Hymbert
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id * server : skip GH copilot requests from logging * server : change message format of server_log() * server : no need to repeat log in comment * server : log style consistency * server : fix compile warning * server : fix tests regex patterns on M2 Ultra * server: logs: PR feedback on log level * server: logs: allow to choose log format in json or plain text * server: tests: output server logs in text * server: logs switch init logs to server logs macro * server: logs ensure value json value does not raised error * server: logs reduce level VERBOSE to VERB to max 4 chars * server: logs lower case as other log messages * server: logs avoid static in general Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25server: concurrency fix + monitoring - add /metrics prometheus compatible ↵Pierrick Hymbert
endpoint (#5708) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct
2024-02-24server: init functional tests (#5566)Pierrick Hymbert
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-22server : clarify some params in the docs (#5640)Alexey Parfenov
2024-02-22Add docs for llama_chat_apply_template (#5645)Xuan Son Nguyen
* add docs for llama_chat_apply_template * fix typo
2024-02-21server: health: fix race condition on slots data using tasks queue (#5634)Pierrick Hymbert
* server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.
2024-02-20server : health endpoint configurable failure on no slot (#5594)Pierrick Hymbert
2024-02-18common, server : surface min_keep as its own parameter (#5567)Robey Holderith
* Feature - surface min_keep as its own parameter * Updated README with min_keep param
2024-02-18server : slots monitoring endpoint (#5550)Pierrick Hymbert
2024-02-18server : enhanced health endpoint (#5548)Pierrick Hymbert
* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
2024-02-18server : --n-predict option document and cap to max value (#5549)Pierrick Hymbert
* server: document --n-predict * server: ensure client request cannot override n_predict if set * server: fix print usage LF in new --n-predict option
2024-02-16server : add "samplers" param to control the samplers order (#5494)Alexey Parfenov
2024-02-16ggml : add numa options (#5377)bmwl
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-11server : allow to specify tokens as strings in logit_bias (#5003)Alexey Parfenov
* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-07server : update `/props` with "total_slots" value (#5373)Justin Parker
* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section