summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-04-08llama : save and restore kv cache for single seq id (#6341)Jan Boon
* llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <martindevans@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-08remove row=1 cond (#6532)Abhilash Majumder
2024-04-08Adding KodiBot to UI list (#6535)Firat
KodiBot is free and open source ai chat app released under the GNU General Public License.
2024-04-07Change Windows AMD example to release build to make inference much faster. ↵Mark Fairbairn
(#6525)
2024-04-07flake.lock: Update (#6517)Georgi Gerganov
Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) → 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d' (2024-04-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib' (2024-03-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) → 'github:NixOS/nixpkgs/fd281bd6b7d3e32ddfa399853946f782553163b5' (2024-04-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-04-07Add GritLM as supported models. (#6513)DAN™
2024-04-07sync : ggmlGeorgi Gerganov
2024-04-07ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)Slava Primenko
`cudaHostRegisterReadOnly` parameter was only introduced in CUDA 11.1 See this issue for more details: https://github.com/ggerganov/examples/whisper/whisper.cpp/issues/2007
2024-04-07scripts : sync ggml-cuda folderGeorgi Gerganov
2024-04-07Run make to build the project (#6457)limitedAtonement
2024-04-07support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, ↵Neo Zhang Jianyu
GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (#6521)
2024-04-06sync : ggmlGeorgi Gerganov
2024-04-06backend : fix typo in scheduler documentation (ggml/781)Daniel Bevenius
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-06Tests: Added integration tests for GBNF parser (#6472)Clint Herron
* Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements. * Fixing whitespace errors and cleaning error message alert to be clearer. * Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API. * Comment cleanup. * Reorganizing tests for readability. * Cleaning up debug message to make a bit more sense.
2024-04-06ci: bench: support sse and fix prompt processing time / server: add tokens ↵Pierrick Hymbert
usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate
2024-04-05gguf.py : add licence and version to gguf writer (#6504)Brian
2024-04-05readme : update UI list (#6503)Hoang Nguyen
* Add MindMac to UI list * Update proprietary description Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-04-05bench : make n_batch and n_ubatch configurable in Batched bench (#6500)Ting Sun
* bench: make n_batch and n_ubatch configurable * bench: update doc for batched bench
2024-04-05[SYCL] Fixed minor bug when enabling FP16 for non intel targets (#6464)Ouadie EL FAROUKI
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> --------- Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-04readme : add Dot to UI list (#6487)alexpinel
2024-04-04readme : fix typo (#6481)Jun Jie
2024-04-04server: add cURL support to server Dockerfiles (#6474)Ed Lepedus
* server: add cURL support to `full.Dockerfile` * server: add cURL support to `full-cuda.Dockerfile` and `server-cuda.Dockerfile` * server: add cURL support to `full-rocm.Dockerfile` and `server-rocm.Dockerfile` * server: add cURL support to `server-intel.Dockerfile` * server: add cURL support to `server-vulkan.Dockerfile` * fix typo in `server-vulkan.Dockerfile` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-04ci: exempt master branch workflows from getting cancelled (#6486)Minsoo Cheong
* ci: exempt master branch workflows from getting cancelled * apply to bench.yml
2024-04-04build CI: Name artifacts (#6482)Ewout ter Hoeven
Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP. It might be possible to further simplify the packing step (in future PRs).
2024-04-04server: allow penalizing repetition of newlines on server webpage (#6431)Shakhar Dasgupta
2024-04-04ci: bench fix concurrency for workflow trigger dispatch with sha1 (#6478)Pierrick Hymbert
2024-04-04Correct README link (#6458)limitedAtonement
README is called README.md.
2024-04-04ci: bench: add more ftype, fix triggers and bot comment (#6466)Pierrick Hymbert
* ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces
2024-04-04common: remove duplicate check for curl (#6471)Daniel Bevenius
This commit removes one of the two identical checks for curl being NULL in llama_load_model_from_url. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-04examples : add GBNF validator program (#5948)Clint Herron
* Revising GBNF validator program to be much simpler. * Changing from streams to using cstdio * Adding final newline character.
2024-04-04server : remove obsolete --memory-f32 optionGeorgi Gerganov
2024-04-04server : add option to disable KV offload (#6468)Xiao-Yong Jin
2024-04-04convert : fix for lint error complaining of bare except (#6470)Clint Herron
2024-04-03A few small fixes to server's README docs (#6428)Fattire
* Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.
2024-04-03server : handle exception on wrong type in request (#6452)JH23X
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>
2024-04-03llama : add SEA-LION support (#6448)bryanSwk
* initial commit for sealion support * add sealion support * minor fix * q/k ln and pos_embd only if required * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : clear whitespaces --------- Co-authored-by: bryan <bryansiow@aisingapore.org> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03ci : update checkout, setup-python and upload-artifact to latest (#6456)Ewout ter Hoeven
* CI: Update actions/checkout to v4 * CI: Update actions/setup-python to v5 * CI: Update actions/upload-artifact to v4
2024-04-03server: add cURL support to `server.Dockerfile` (#6461)Ed Lepedus
2024-04-03readme : add feature-rich rust bindings (#6465)Francisco Melo
2024-04-03security : create policy (#6354)Joyce
* Create SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * minor * fix * fix --------- Signed-off-by: Joyce <joycebrum@google.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03Missing tokenizer.model error during gguf conversion (#6443)Abhishek Gopinath K
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-04-03Add OpenChat, Alpaca, Vicuna chat templates (#6397)kaizau
* Add openchat chat template * Add chat template test for openchat * Add chat template for vicuna * Add chat template for orca-vicuna * Add EOS for vicuna templates * Combine vicuna chat templates * Add tests for openchat and vicuna chat templates * Add chat template for alpaca * Add separate template name for vicuna-orca * Remove alpaca, match deepseek with jinja output * Regenerate chat template test with add_generation_prompt * Separate deepseek bos from system message * Match openchat template with jinja output * Remove BOS token from templates, unprefix openchat
2024-04-03readme : update hot topicsGeorgi Gerganov
2024-04-03ggml : mul_mat_id use the same tensor for all the experts (#6387)slaren
* ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03[SYCL] Disable iqx on windows as WA (#6435)Meng, Hengyu
* disable iqx on windows as WA * array instead of global_memory
2024-04-01flake.lock: Update (#6402)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-04-01compare-llama-bench.py: fix long hexsha args (#6424)Johannes Gäßler
2024-04-01ci: server: verify deps are coherent with the commit (#6409)Pierrick Hymbert
* ci: server: verify deps are coherent with the commit * ci: server: change the ref to build as now it's a pull event target
2024-03-31readme : update hot topicsGeorgi Gerganov
2024-03-30ci: bench: fix Resource not accessible by integration on PR event (#6393)Pierrick Hymbert