Age | Commit message (Collapse) | Author |
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew
* server: update refs -> llama-server
gitignore llama-server
* server: simplify nix package
* main: update refs -> llama
fix examples/main ref
* main/server: fix targets
* update more names
* Update build.yml
* rm accidentally checked in bins
* update straggling refs
* Update .gitignore
* Update server-llm.sh
* main: target name -> llama-cli
* Prefix all example bins w/ llama-
* fix main refs
* rename {main->llama}-cmake-pkg binary
* prefix more cmake targets w/ llama-
* add/fix gbnf-validator subfolder to cmake
* sort cmake example subdirs
* rm bin files
* fix llama-lookup-* Makefile rules
* gitignore /llama-*
* rename Dockerfiles
* rename llama|main -> llama-cli; consistent RPM bin prefixes
* fix some missing -cli suffixes
* rename dockerfile w/ llama-cli
* rename(make): llama-baby-llama
* update dockerfile refs
* more llama-cli(.exe)
* fix test-eval-callback
* rename: llama-cli-cmake-pkg(.exe)
* address gbnf-validator unused fread warning (switched to C++ / ifstream)
* add two missing llama- prefixes
* Updating docs for eval-callback binary to use new `llama-` prefix.
* Updating a few lingering doc references for rename of main to llama-cli
* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.
* Updating documentation references for lookup-merge and export-lora
* Updating two small `main` references missed earlier in the finetune docs.
* Update apps.nix
* update grammar/README.md w/ new llama-* names
* update llama-rpc-server bin name + doc
* Revert "update llama-rpc-server bin name + doc"
This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.
* add hot topic notice to README.md
* Update README.md
* Update README.md
* rename gguf-split & quantize bins refs in **/tests.sh
---------
Co-authored-by: HanClinto <hanclinto@gmail.com>
|
|
|
|
|
|
|
|
|
|
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
|
|
|
|
* server: normalize token probabilities
* fix temperature == 0.0f
|
|
|
|
...`) (#6964)
* readme: cmake . -B build && cmake --build build
* build: fix typo
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* build: drop implicit . from cmake config command
* build: remove another superfluous .
* build: update MinGW cmake commands
* Update README-sycl.md
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* build: reinstate --config Release as not the default w/ some generators + document how to build Debug
* build: revert more --config Release
* build: nit / remove -H from cmake example
* build: reword debug instructions around single/multi config split
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
|
|
strings, cap number length (#6555)
* json: rename python schema converter to make import easier
* server: skip null json_schema / grammar fields
* json: deps management for primitive rules (+ allow null values)
* json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`
* grammars: add troubleshooting section to readme
* json: cap length of numbers to 15 digits before/after decimal point
(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)
* json: unify all repetition code (w/ or w/o sep)
* json: support string minLength/maxLength
* server+json: update server/README w/ result_format
* nits
* json: fix type error w/ python 3.8
* json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions)
* json: simplify DOT `{"type": "string", "pattern": "^.$"}`
* json: remove recursion in opt_repetitions (avoids Python stack overflow)
* json: rm dead code
* json: rm useless assert & ggml.h import
|
|
* llama : save and restore kv cache for single seq id
* remove trailing whitespace
* respond error in case there's no space in the kv cache
* add kv seq save restore to test case
* add --slot-save-path arg to enable save restore and restrict save location
* Returning 0 for some cases, instead of asserting.
* cleanup error cases
* rename sequence state functions
* rename state get set functions
* add previous function names back in with DEPRECATED notice
* update doc
* adjust endpoints to preferred style
* fix restoring zero cell count
* handle seq rm return value
* unused param
* keep in the size check
* fix return types
* add server test case for slot save restore
* cleanup
* add cake
* cleanup style
* add special
* removing a whole sequence never fails
* move sequence state file functionality from server to llama to match session api and add version tags
* catch exceptions on save as well
* error log messages
* check types for stricter restore
* update server doc
* readme : update API changes date
* strict filename validation
* move include, reject bom as well
* also reject empty filename
* reject whitespace and trailing dot
---------
Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Typo fix to server's README.md
Fix minor typo ("tonen") in server README.
* server readme grammar/style fixes.
Quickly went through this file to look for inconsistencies in
presentation of defaults, flag options, and looked for typos
and grammar issues.
Not perfect, but hopefully improved.
* Update README.md
Remove an extra space before newline.
|
|
|
|
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
|
|
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
(#6254)
|
|
|
|
* common: llama_load_model_from_url with libcurl dependency
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* server: format error to json
* server: do not crash on grammar error
* fix api key test case
* revert limit max n_predict
* small fix
* correct coding style
* update completion.js
* launch_slot_with_task
* update docs
* update_slots
* update webui
* update readme
|
|
* server : clarify some items in the readme
* server : fix typo
|
|
* refactor static file handler
* use set_pre_routing_handler for validate_api_key
* merge embedding handlers
* correct http verb for endpoints
* fix embedding response
* fix test case CORS Options
* fix code style
|
|
* add cmake build toggle to enable ssl support in server
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add flags for ssl key/cert files and use SSLServer if set
All SSL setup is hidden behind CPPHTTPLIB_OPENSSL_SUPPORT in the same
way that the base httlib hides the SSL support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Update readme for SSL support in server
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Add LLAMA_SERVER_SSL variable setup to top-level Makefile
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
|
* server : refactoring (wip)
* server : remove llava/clip objects from build
* server : fix empty prompt handling + all slots idle logic
* server : normalize id vars
* server : code style
* server : simplify model chat template validation
* server : code style
* server : minor
* llama : llama_chat_apply_template support null buf
* server : do not process embedding requests when disabled
* server : reorganize structs and enums + naming fixes
* server : merge oai.hpp in utils.hpp
* server : refactor system prompt update at start
* server : disable cached prompts with self-extend
* server : do not process more than n_batch tokens per iter
* server: tests: embeddings use a real embeddings model (#5908)
* server, tests : bump batch to fit 1 embedding prompt
* server: tests: embeddings fix build type Debug is randomly failing (#5911)
* server: tests: embeddings, use different KV Cache size
* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings
* server: tests: embeddings, no need to wait for server idle as it can timout
* server: refactor: clean up http code (#5912)
* server : avoid n_available var
ggml-ci
* server: refactor: better http codes
* server : simplify json parsing + add comment about t_last
* server : rename server structs
* server : allow to override FQDN in tests
ggml-ci
* server : add comments
---------
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
|
|
|
|
|
|
|
|
* server: docs - refresh and tease a little bit more the http server
* Rephrase README.md server doc
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
* server : skip GH copilot requests from logging
* server : change message format of server_log()
* server : no need to repeat log in comment
* server : log style consistency
* server : fix compile warning
* server : fix tests regex patterns on M2 Ultra
* server: logs: PR feedback on log level
* server: logs: allow to choose log format in json or plain text
* server: tests: output server logs in text
* server: logs switch init logs to server logs macro
* server: logs ensure value json value does not raised error
* server: logs reduce level VERBOSE to VERB to max 4 chars
* server: logs lower case as other log messages
* server: logs avoid static in general
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
|
|
* server: tests: init scenarios
- health and slots endpoints
- completion endpoint
- OAI compatible chat completion requests w/ and without streaming
- completion multi users scenario
- multi users scenario on OAI compatible endpoint with streaming
- multi users with total number of tokens to predict exceeds the KV Cache size
- server wrong usage scenario, like in Infinite loop of "context shift" #3969
- slots shifting
- continuous batching
- embeddings endpoint
- multi users embedding endpoint: Segmentation fault #5655
- OpenAI-compatible embeddings API
- tokenize endpoint
- CORS and api key scenario
* server: CI GitHub workflow
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* add docs for llama_chat_apply_template
* fix typo
|
|
* server: health: fix race condition on slots data using tasks queue
* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.
|
|
|
|
* Feature - surface min_keep as its own parameter
* Updated README with min_keep param
|
|
|
|
* server: enrich health endpoint with available slots, return 503 if not slots are available
* server: document new status no slot available in the README.md
|
|
* server: document --n-predict
* server: ensure client request cannot override n_predict if set
* server: fix print usage LF in new --n-predict option
|
|
|
|
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverted Makefile
* Fixed include
* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
* removed trailing whitespace
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverting Makefile
* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
* Removing MIRROR_MODE code for this PR
* Removing last bit of MIRROR_MODE code for this PR
* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
* Fixed lingering init_llama_backend() bool calls in tests and examples
* Remote enum llama_numa_strategies
* Revert bad merge with dynatemp flags
* add missing enum ggml_numa_strategies declaration and revert sync problem with master
* add missing enum ggml_numa_strategies declaration
* fixed ggml_init_numa variable
* Update ggml.h
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
* Fix up some boolean vs enum comparisons
* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
* Update ggml.h
Align enum values
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
Remove whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
align paremeters
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/server.cpp
remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
* Update ggml.c
simplified return for platforms without NUMA support
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* removed redundant else from cli argument processing of --numa
* whitespace
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|
|
* server: allow to specify tokens as strings in logit_bias
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* include total "num_slots" in default_generation_settings_for_props
* cleanup total_slots return value in /props endpoint
* update /props endpoint docs with total_slots
* remove num_slots from default_generation_settings_for_props
* update /props endpoint section
|
|
* server: added `dynatemp_range` and `dynatemp_exponent`
* Update README.md
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
|
|
|
|
|