summaryrefslogtreecommitdiff
path: root/examples/server
AgeCommit message (Collapse)Author
2024-01-30server : improve README (#5209)Wu Jian Ping
2024-01-29server : embeddings compatibility for OpenAI (#5190)Wu Jian Ping
2024-01-28ggml : add unified SYCL backend for Intel GPUs (#2690)Abhilash Majumder
* first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28docker : add server-first container images (#5157)Kyle Mistele
* feat: add Dockerfiles for each platform that user ./server instead of ./main * feat: update .github/workflows/docker.yml to build server-first docker containers * doc: add information about running the server with Docker to README.md * doc: add information about running with docker to the server README * doc: update n-gpu-layers to show correct GPU usage * fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
2024-01-27sync : ggmlGeorgi Gerganov
2024-01-27Remove unused data and add fixes (#5154)Michael Klimenko
* Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation
2024-01-27server : add self-extend support (#5104)Maximilian Winter
* Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-26server : refactored the task processing logic (#5065)Xuan Son Nguyen
* server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"
2024-01-18server : defer tasks when "slot unavailable" (#5018)Xuan Son Nguyen
* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-13server : fix prompt caching with system prompt (#4914)Georgi Gerganov
2024-01-13server : fix deadlock that occurs in multi-prompt scenarios (#4905)Ziad Ben Hadj-Alouane
* * fix deadlock * * dont ruint all whitespace
2024-01-13server : fix crash with multimodal models without BOS token (#4904)makomk
2024-01-12llama : ggml-backend integration (#4766)slaren
* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-11server : fix infill when prompt is empty (#4833)Georgi Gerganov
2024-01-11server : implement credentialed CORS (#4514)Laura
* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage
2024-01-11server : support for multiple api keys (#4864)Michael Coppola
* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11server : add `LOG_INFO` when model is successfully loaded (#4881)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading
2024-01-11server : fix typo in model name (#4876)Isaac McFadyen
2024-01-11server : update readme to document the new `/health` endpoint (#4866)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too
2024-01-11server : fix build + rename enums (#4870)Georgi Gerganov
2024-01-10server : add a `/health` endpoint (#4860)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line
2024-01-09server : update readme about token probs (#4777)Behnam M
* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09server : add api-key flag to documentation (#4832)Zsapi
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-07server : fix n_predict check (#4798)Georgi Gerganov
2024-01-04server : send token probs for "stream == false" (#4714)Georgi Gerganov
2024-01-04server : fix options in README.md (#4765)Michael Coppola
* fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03server : throw an error when `slot unavailable` (#4741)Justin Parker
2024-01-02server : add token counts to html footer (#4738)Phil H
* server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>
2024-01-02editorconfig : fix whitespace and indentation #4710Georgi Gerganov
2024-01-02server : add --override-kv parameter (#4710)minarchist
* Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
2023-12-30clip : refactor + bug fixes (#4696)Georgi Gerganov
* clip : refactor + bug fixes ggml-ci * server : add log message
2023-12-29cmake : fix ld warning duplicate libraries libllama.a (#4671)Cuong Trinh Manh
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.
2023-12-29server : replace sleep with condition variables (#4673)Justine Tunney
The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b
2023-12-29server : fix OpenAI server sampling w.r.t. penalty. (#4675)SakuraUmi
2023-12-29server : allow to generate multimodal embeddings (#4681)Karthik Sethuraman
2023-12-28Fix OpenAI server sampling w.r.t. temp and seed (#4668)Justine Tunney
The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29
2023-12-23server : allow to specify custom prompt for penalty calculation (#3727)Alexey Parfenov
2023-12-17server : disable llm logs if SERVER_VERBOSE is off (#3792)olexiyb
2023-12-17server : fix grammar being ignored (#4494)AdithyanI
Fix bug in identifying the grammar.
2023-12-17server : fix possible ambiguity in content type charset (#4501)Alexey Parfenov
2023-12-17server : allow requests larger than 8K (#4500)mzcu
2023-12-15server : add optional API Key Authentication example (#4441)ShadovvBeast
* Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-13server : fix handling of characters that span multiple tokens when streaming ↵shibe2
(#4446)
2023-12-12server : tweak default sampling parameters (#4367)kalomaze
* Set a more typical Top P setting as the default * Update temp max
2023-12-12english : use `typos` to fix comments and logs (#4354)Richard Kiss
2023-12-12server : fix local model name in server (#4420)Vladimir Zorin
2023-12-10Update README.md (#4388)Yueh-Po Peng
Fix small typo.
2023-12-07llama : per-layer KV cache + quantum K cache (#4309)Georgi Gerganov
* per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-06server : recognize cache_prompt parameter in OAI API (#4347)Georgi Gerganov
2023-12-03server : fix OpenAI API `stop` field to be optional (#4299)Ed Lee
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84ae3bcbf0d617b7ee6a5413bcbd58af)