summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-14llama : more checks before assuming FIM tokens (#7644)Sigbjørn Skjæret
* More checks before assuming FIM tokens for Llama arch * extensive token check
2024-06-14convert : add Poro-34B-chat tokenizer support (#7713)Elaine
* support for Poro chat pre-tokenizer * add support for Poro pre-tokenizer * Update convert-hf-to-gguf-update.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Change Poro-34B-chat to poro-chat * Change Poro-34B-chat to poro-chat * Update convert-hf-to-gguf-update.py * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13rpc : fix ggml_backend_rpc_supports_buft() (#7918)Radoslav Gerganov
2024-06-13readme : Remove outdated instructions from README.md (#7914) [no ci]Galunid
2024-06-13move BLAS to a separate backend (#6210)slaren
* move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13`build`: rename main → llama-cli, server → llama-server, llava-cli → ↵Olivier Chafik
llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-12CUDA: fix broken oob check for FA vec f32 kernel (#7904)Johannes Gäßler
2024-06-12tests : add non-cont unary tests (#7857)Georgi Gerganov
* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci
2024-06-12ggml : improve ggml_is_contiguous logic (#7856)Georgi Gerganov
* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci
2024-06-12server : restore numeric prompts (#7883)Georgi Gerganov
2024-06-12update intel docker oneapi-basekit to 2024.1.1-devel-ubuntu22.04 (#7894)Meng, Hengyu
In addition this reverts a workaround we had to do to workaround the upstream issue with expired intel GPG package keys in 2024.0.1-devel-ubuntu22.04
2024-06-12Fix a typo and add Fedora 40 pacakge to install for Vulkan (#7794) [no ci]Patrice Ferlet
Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support
2024-06-11vulkan: select only one device for single gpu with multiple drivers (#7582)k.h.lai
2024-06-11Update Vulkan RoPE implementation (#7818)0cc4m
* Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-12fix broken link in pr template (#7880) [no ci]Deven Mistry
* fix broken link in pr template * Update pull_request_template.md [no ci] --------- Co-authored-by: Brian <mofosyne@gmail.com>
2024-06-11github: move PR template to .github/ root (#7868)Brian
2024-06-11llama-bench: more compact markdown tables (#7879)Johannes Gäßler
2024-06-11tests : check the Python version (#7872)Georgi Gerganov
ggml-ci
2024-06-11CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (#7860)Johannes Gäßler
2024-06-11fix CUDA CI by using a windows-2019 image (#7861)slaren
* try to fix CUDA ci with --allow-unsupported-compiler * trigger when build.yml changes * another test * try exllama/bdashore3 method * install vs build tools before cuda toolkit * try win-2019
2024-06-11json: refine constraint for whitespace to avoid runaways yet allow pretty ↵Olivier Chafik
print (#7866)
2024-06-11`json`: document schema conversion in GBNF readme, align manual grammar ↵Olivier Chafik
examples & converters (#7841) * json: fix char pattern in grammar converters * json: prevent number precision & whitespace runaways in example grammars * json: add doc to grammar readme
2024-06-10cmake : fix CMake requirement for CUDA (#7821)Jared Van Bortel
2024-06-10ci : try win-2019 on server windows test (#7854)slaren
2024-06-10examples : remove --instruct remnants (#7846)Georgi Gerganov
2024-06-10server : improve "prompt" handling (#7847)Georgi Gerganov
2024-06-10CUDA: use tensor cores for MMQ (#7676)Johannes Gäßler
* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early
2024-06-10use the correct SYCL context for host USM allocations (#7777)Ben Ashbaugh
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-09flake.lock: Update (#7838)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29) → 'github:NixOS/nixpkgs/051f920625ab5aabe37c920346e3e69d7d34400e?narHash=sha256-4q0s6m0GUcN7q%2BY2DqD27iLvbcd1G50T2lv08kKxkSI%3D' (2024-06-07) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-09imatrix : handle partial entries (#7833)Georgi Gerganov
2024-06-10docs: Added initial PR template with directions for doc only changes and ↵Nicolás Pérez
squash merges [no ci] (#7700) This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions. Co-authored-by: Brian <mofosyne@gmail.com> Co-authored-by: compilade <git@compilade.net>
2024-06-09server: do not remove whitespace at the start of a completion chunk (#7830)mgroeber9110
2024-06-09CUDA: revise q8_1 data layout for mul_mat_q (#7824)Johannes Gäßler
2024-06-09convert-hf : set the model name based on cli arg, if present (#7693)sasha0552
`--model-name` argument was added a while ago but did not do anything. This commit fixes this issue and enables this feature.
2024-06-09convert-hf : match model part name prefix and suffix (#7687)compilade
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present. This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some persistent problem, but shall do in the meantime.
2024-06-09gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)compilade
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. In addition use_temp_file is now opt-in instead of opt-out defaulting to False. Also GGUFWriter now does not require output file name until when actually writing to it. And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09Revert "[SYCL] Update rpc-server.cpp to include SYCL backend (#7682)" (#7808)slaren
This reverts commit 9422c5e34bbd302493b77a8f6d546154a1f4fe82.
2024-06-08url: save -mu downloads to new cache location (#7826)Olivier Chafik
* url: save -mu download to new cache location * url: fs_get_cache_file_path util * url: tweak sig of fs_get_cache_file
2024-06-08server : smart slot selection using Longest Common Prefix (#7728)sasha0552
* server : Smart selection of available slot using Longest Common Substring * add usage * remove trailing whitespaces * Use Longest Common Prefix (LCP) instead of LCS * Rename argument
2024-06-07vulkan : reuse parent extra for views (#7806)slaren
* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07gguf-split : change binary multi-byte units to decimal (#7803)Christian Zhou-Zheng
2024-06-07cmake : fix BUILD_SHARED_LIBS=ON build (#7784)intelmatt
common depends on pthreads in Linux
2024-06-07server: update cache_prompt documentation [no ci] (#7745)Johannes Gäßler
2024-06-07server : do not get prompt in infill mode (#7286)woodx
* avoid to get prompt in infill mode and embedding mode * remove embedding mode * refactor format --------- Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07[SYCL] fix softmax r2r result wrong issue (#7811)pengxin99
2024-06-07check for nans in imatrix and quantize (#7807)slaren
* imatrix : detect nan/inf values * quantize : check imatrix for nan/inf values
2024-06-06server : fix --threads-http arg (#7801)Georgi Gerganov
2024-06-06imatrix : migrate to gpt_params (#7771)Georgi Gerganov
* imatrix : migrate to gpt_params ggml-ci * imatrix : add --save-frequency cli arg * common : fix --no-ppl
2024-06-06Added support for . (any character) token in grammar engine. (#6467)Clint Herron
* Added support for . (any characer) token in grammar engine. * Add integration tests for any-character symbol.
2024-06-06README minor fixes (#7798) [no ci]Mattheus Chediak
derievatives --> derivatives