summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-10server : improve "prompt" handling (#7847)Georgi Gerganov
2024-06-10CUDA: use tensor cores for MMQ (#7676)Johannes Gäßler
* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early
2024-06-10use the correct SYCL context for host USM allocations (#7777)Ben Ashbaugh
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-09flake.lock: Update (#7838)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29) → 'github:NixOS/nixpkgs/051f920625ab5aabe37c920346e3e69d7d34400e?narHash=sha256-4q0s6m0GUcN7q%2BY2DqD27iLvbcd1G50T2lv08kKxkSI%3D' (2024-06-07) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-09imatrix : handle partial entries (#7833)Georgi Gerganov
2024-06-10docs: Added initial PR template with directions for doc only changes and ↵Nicolás Pérez
squash merges [no ci] (#7700) This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions. Co-authored-by: Brian <mofosyne@gmail.com> Co-authored-by: compilade <git@compilade.net>
2024-06-09server: do not remove whitespace at the start of a completion chunk (#7830)mgroeber9110
2024-06-09CUDA: revise q8_1 data layout for mul_mat_q (#7824)Johannes Gäßler
2024-06-09convert-hf : set the model name based on cli arg, if present (#7693)sasha0552
`--model-name` argument was added a while ago but did not do anything. This commit fixes this issue and enables this feature.
2024-06-09convert-hf : match model part name prefix and suffix (#7687)compilade
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present. This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some persistent problem, but shall do in the meantime.
2024-06-09gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)compilade
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. In addition use_temp_file is now opt-in instead of opt-out defaulting to False. Also GGUFWriter now does not require output file name until when actually writing to it. And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09Revert "[SYCL] Update rpc-server.cpp to include SYCL backend (#7682)" (#7808)slaren
This reverts commit 9422c5e34bbd302493b77a8f6d546154a1f4fe82.
2024-06-08url: save -mu downloads to new cache location (#7826)Olivier Chafik
* url: save -mu download to new cache location * url: fs_get_cache_file_path util * url: tweak sig of fs_get_cache_file
2024-06-08server : smart slot selection using Longest Common Prefix (#7728)sasha0552
* server : Smart selection of available slot using Longest Common Substring * add usage * remove trailing whitespaces * Use Longest Common Prefix (LCP) instead of LCS * Rename argument
2024-06-07vulkan : reuse parent extra for views (#7806)slaren
* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07gguf-split : change binary multi-byte units to decimal (#7803)Christian Zhou-Zheng
2024-06-07cmake : fix BUILD_SHARED_LIBS=ON build (#7784)intelmatt
common depends on pthreads in Linux
2024-06-07server: update cache_prompt documentation [no ci] (#7745)Johannes Gäßler
2024-06-07server : do not get prompt in infill mode (#7286)woodx
* avoid to get prompt in infill mode and embedding mode * remove embedding mode * refactor format --------- Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07[SYCL] fix softmax r2r result wrong issue (#7811)pengxin99
2024-06-07check for nans in imatrix and quantize (#7807)slaren
* imatrix : detect nan/inf values * quantize : check imatrix for nan/inf values
2024-06-06server : fix --threads-http arg (#7801)Georgi Gerganov
2024-06-06imatrix : migrate to gpt_params (#7771)Georgi Gerganov
* imatrix : migrate to gpt_params ggml-ci * imatrix : add --save-frequency cli arg * common : fix --no-ppl
2024-06-06Added support for . (any character) token in grammar engine. (#6467)Clint Herron
* Added support for . (any characer) token in grammar engine. * Add integration tests for any-character symbol.
2024-06-06README minor fixes (#7798) [no ci]Mattheus Chediak
derievatives --> derivatives
2024-06-06grammars: x{min,max} repetition operator (#6640)Olivier Chafik
* grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates * grammars: handle `x{n}` and fix `x{n,n}` * grammars: document new repetition operators * grammars: uniform use of int for min & max * grammars: refactor parser test * grammar: parsing tests w/ natural pretty print of updated expectations * grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all) * grammars: improve test pretty print again * grammars: pretty print rules and chars * grammars: fix copy rule skipping * grammars: disallow `a{,}` (not allowed in regexps) * Update common/grammar-parser.cpp Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: fix copy rule skipping (again) & display of expectations * grammars: more test cases * grammars: update reps parsing to bring ? / * / + closer to before * json: use new GBNF repetitions{m,n} syntax * grammars: update performance gotchas w/ repetition advice * Update examples/json_schema_to_grammar.py Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: comment on rule repetitions * grammars: ensure unambiguous number alternatives * grammar: nit typo switched error msgs * grammar: nit numbering in comment * json: update numeric rule to be unambiguous * Apply suggestions from code review Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * json: fix integral-part * grammar: add repetition tests --------- Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-06llama : add jina v2 base code (#7596)Joan Fontanals
* feat: add changes to handle jina v2 base code * fix: do not complicate things * fix: fix the usage of the code model * fix: fix comments * fix: fix linting issues * fix: remove ollama patches * style : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06docker : build only main and server in their images (#7782)slaren
* add openmp lib to dockerfiles * build only main and server in their docker images
2024-06-06docker : add openmp lib (#7780)slaren
2024-06-06Fix encoding in python scripts (#7733)Galunid
2024-06-05CUDA: refactor mmq, dmmv, mmvq (#7716)Johannes Gäßler
* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits
2024-06-05ggml : refactor rope norm/neox (#7634)Georgi Gerganov
* ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci
2024-06-05readme : remove -ins (#7759)arch-btw
-ins and --instruct were moved in https://github.com/ggerganov/llama.cpp/pull/7675 I have adjusted the README accordingly. There was no trace of --chatml in the README.
2024-06-05Fix per token atrributes bits (#7749)jaime-m-p
2024-06-04Allow number of nodes in CUDA graph to change (#7738)agray3
Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.
2024-06-04common : refactor cli arg parsing (#7675)Georgi Gerganov
* common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params
2024-06-04ggml : remove OpenCL (#7735)Georgi Gerganov
ggml-ci
2024-06-04llama : remove beam search (#7736)Georgi Gerganov
2024-06-04readme : remove obsolete Zig instructions (#7471)Georgi Gerganov
2024-06-04llama-bench : allow using a different printer for stderr with -oe (#7722)slaren
compare-commits.sh : hide stdout, use -oe to print markdown
2024-06-04Improve hipBLAS support in CMake (#7696)Daniele
* Improve hipBLAS support in CMake This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK. * Set ROCM_PATH correctly
2024-06-04refine .gitignore (#7688)zhouwg
This adds tags and android ndk into the git ignore list
2024-06-04Per token attributes (#7685)jaime-m-p
* Add per token attributes enum * Using phi-3 for testing 'rstrip' * Using jina-v2 for testing 'lstrip' * Brute force test for 'lstrip' and 'rstrip' * Implement 'rstrip' and 'lstrip' * Update phi-3 GGUF file (obsolete since 917dc8c) * Replace llama_token_type with llama_token_attribs
2024-06-04ggml : prevent builds with -ffinite-math-only (#7726)Georgi Gerganov
This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
2024-06-03llama : offload to RPC in addition to other backends (#7640)Radoslav Gerganov
* llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-03ggml : use OpenMP as a thread pool (#7606)Masaya, Kato
* ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-03make: fix debug options not being applied to NVCC (#7714)Johannes Gäßler
2024-06-03Vulkan Mixture of Experts (MoE) support (#7628)0cc4m
* Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU
2024-06-03cmake : add pkg-config spec file for llama.cpp (#7702)Andy Tai
2024-06-03llama : MiniCPM support tied embeddings (#7664)zhangkaihuo
* support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>