summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-11llama : add Jina Embeddings architecture (#6826)Joan Fontanals
* feat: first things to do * feat: create tensors for Jina architecture * fix: use other tensors * feat: embedding gets results * fix: fix usage of ALIBI * fix: clean prints * fix: do some cleanup unused vars * fix: revert changes to Makefile and CMakeLists * fix: revert some changes * fix: fix small detail * fix: fix convert formatting * fix: fix linting and editor * feat: set proper vocab settings * fix: JinaBertForMaskedLM registration * feat: support q_normalization and k_normalization in Jina arch * feat: handle gpt2 tokenizer with Jina architecture * feat: example comments in embedding * feat: rename Jina Bert to Jina Bert V2 * fix: add some changes as per review * feat: proper KQ_pos for Jina embeddings * feat: add capacity to load models ES and DE for Spanish * llama : fix pre-tokenizers * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * minor : clean-up * embedding : add warning about missing SEP --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-11ggml : full ALiBi support (#7192)Georgi Gerganov
* ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * ggml : fix assert message * vulkan : add dev notes * ggml : require mask when using ALiBi ggml-ci * convert : fix convert for refact models
2024-05-10llama-bench : add pp+tg test type (#7199)slaren
2024-05-10metal : fix flash attention kernel requirements (#7169)Georgi Gerganov
* metal : fix flash attention kernel requirements ggml-ci * metal : fix ggml_metal_supports_op ggml-ci
2024-05-10convert : print "ignore_merges" fieldGeorgi Gerganov
2024-05-10llama : use n_vocab to differentiate between mistral 7B and llama3 8B (#7200)slaren
2024-05-10Fix memory bug in grammar parser (#7194)Justine Tunney
The llama.cpp grammar parser had a bug where forgetting to add a closing quotation mark to strings would cause parsing to crash. Anyone running a server on a public endpoint is advised to upgrade. To reproduce this bug ./llamafile -m foo.gguf -p bar --grammar 'root::="' Credit for discovering and reporting this issue goes to Eclypsium Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
2024-05-10Main+: optionally allow special tokens from user in interactive mode (#7097)HanishKVC
@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
2024-05-10llava : fix moondream support (#7163)Andrei
* Revert "Revert "llava : add support for moondream vision language model (#6899)"" This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737. * Fix num_positions and embeddings initialization
2024-05-10Minor arithmetic improvement to mmvq wrapper kernel (#7172)Ouadie EL FAROUKI
2024-05-10eval-callback : fix conversion to float (#7184)slaren
2024-05-09Vulkan Bugfixes and Improvements (#7084)0cc4m
* Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation * Further work towards MoE, disabled for now * Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code * Add softmax with f16 mask and pos buffer support * Disable mul_mat_id shaders for now * Fix flake8 * Fix validation errors caused by empty buffers on larger batch sizes
2024-05-09readme : add scheduled server workflow status badgeGeorgi Gerganov
2024-05-09readme : add app (#6371)l3utterfly
* added Layla to supported UIs * Update README.md
2024-05-09llama3 custom regex split (#6965)jaime-m-p
* merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * llama3 custom regex split * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * Using char32_t for codepoints * lint : fix * already exists unicode_tolower() * Typing * Restore BOM * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * Fix merge * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci * Move unused variable value * GPT2 custom regex split * Add alternative regex for custom aplit llama3 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Style * Add bruteforce random tests for token encoding * wip: fixing unicode codepoint ranges * Fix merge * Unicode tables: separator, lowercase, uppercase and whitespace * llama3 custom regex split: fix \s * Restore BOM * Style * wip: generate NDF table * Ignore special tokens for testing * Clean gen-unicode-data.py * Refactor random tokenizer test * lint : fix * tests : add fail test for llama-bpe --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: jaime-m-p <>
2024-05-09CUDA: generalize FP16 fattn vec kernel (#7061)Johannes Gäßler
* CUDA: generalize FP16 fattn vec kernel * disable unsupported head sizes for AMD in test * try AMD fix * fix batch size 2-8 * partially revert changes
2024-05-09Add warning if token is invalid (#7173)Galunid
2024-05-09llama : update llama_timings.n_p_eval setting (#7160)Daniel Bevenius
This commit changes the value assigned to llama_timings.n_p_eval when ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value. The motivation for this change is that if session caching is enabled, for example using the `--prompt-cache main-session.txt` command line argument for the main example, and if the same prompt is used then on subsequent runs, the prompt tokens will not actually be passed to llama_decode, and n_p_eval will not be updated by llama_synchoronize. But the value of n_p_eval will be set 1 by llama_get_timings because ctx->n_p_eval will be 0. This could be interpreted as 1 token was evaluated for the prompt which could be misleading for applications using this value. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-05-09gguf-py : add special token modification capability (#7166)Sigbjørn Skjæret
* Add special token modification capability To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁hole|>" --special-token suffix "<|fim▁end|>" ``` * improve help text * flake-- * fix multiple tokens warning * make script executable * switch to namedtuple, no need to dataclass * typing++ * add progress bar * Add special token modification capability To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>" ``` (yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled)) or ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>" ``` etc... NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error). * improve help text * flake-- * fix multiple tokens warning * make script executable * switch to namedtuple, no need to dataclass * typing++ * add progress bar * fail on invalid token id
2024-05-09opencl : alignment size converted from bits to bytes (#7090)Albert Jin
* opencl alignment size should be converted from bits to bytes Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN > Alignment requirement (in bits) for sub-buffer offsets. * Update ggml-opencl.cpp for readability using division instead of shift Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-05-09TypoFix (#7162)Ahmet Zeer
2024-05-08cmake : fix typo (#7151)Jared Van Bortel
2024-05-08convert-hf : save memory with lazy evaluation (#7075)compilade
* convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra_*_tensors * convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file
2024-05-08Introduction of CUDA Graphs to LLama.cpp (#6766)agray3
* DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-05-08JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143)Johannes Gäßler
2024-05-08Revert "llava : add support for moondream vision language model (#6899)"Georgi Gerganov
This reverts commit 46e12c4692a37bdd31a0432fc5153d7d22bc7f72.
2024-05-08server : add themes + favicon (#6848)JohnnyB
* Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.
2024-05-08metal : use `vm_allocate` instead of `posix_memalign` on macOS (#7078)Gilad S
* fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses * fix: typo * fix: use `vm_allocate` instead of `posix_memalign` * fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL` * fix: use `vm_allocate` only on macOS
2024-05-08main : add --conversation / -cnv flag (#7108)Dawid Potocki
2024-05-08sgemm : AVX Q4_0 and Q8_0 (#6891)Eve
* basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops
2024-05-08server : add_special option for tokenize endpoint (#7059)Johan
2024-05-08convert.py : --vocab-only generates false but valid params (#7027)20kdc
An example of how this might be used in the style of baby-llama will be attached with this PR.
2024-05-08llama : add BPE pre-tokenization for Qwen2 (#7114)Ren Xuancheng
* Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-08clean up json_value & server_log (#7142)Xuan Son Nguyen
2024-05-08convert : add BPE pre-tokenization for DBRX (#7132)DAN™
* Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.
2024-05-08py : also print the normalizersGeorgi Gerganov
2024-05-08compare-llama-bench.py: add missing basicConfig (#7138)Brian
* compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table
2024-05-08ggml : introduce bfloat16 support (#6412)Justine Tunney
* Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function
2024-05-08metal : fix unused warningGeorgi Gerganov
2024-05-08Further tidy on Android instructions README.md (#7077)Jeximo
* Further tidy on Android instructions README.md Fixed some logic when following readme direction * Clean up redundent information A new user arriving will see simple directions on llama.cpp homepage * corrected puncuation Period after cmake, colon after termux * re-word for clarity method seems to be more correct, instead of alternative in this context * Organized required packages per build type building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux. * README.md corrected title * fix trailing whitespace
2024-05-08Fixed save_imatrix to match old behaviour for MoE (#7099)jukofyork
* Fixed save_imatrix to match old behaviour for MoE This fix is simple and clear, but unnecessarily doubles the memory overhead.. * Fixed missing idx variable * Unconditionally increment ncall Co-authored-by: slaren <slarengh@gmail.com> * Fixed 2 bugs in save_imatrix() - Fixed segfault bug because the counts vector needed to be created. - Fixed pre-existing bug didn't actually add to the counts for "--combine" option. * ncall needs summing too * Trailing whitespace --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-05-07server: fix incorrectly reported token probabilities (#7125)Johannes Gäßler
* server: normalize token probabilities * fix temperature == 0.0f
2024-05-07Fix OLMo HF to GGUF conversion (#6910)nopperl
2024-05-07server : update readme with undocumented options (#7013)Kyle Mistele
2024-05-07readme : update hot topicsGeorgi Gerganov
2024-05-07main : update log text (EOS to EOG) (#7104)RhinoDevel
* Update log text (EOS to EOG) The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT. * Improve log msg. further by using "an" instead of "some". As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
2024-05-07docs: fix typos (#7124)omahs
* fix typo * fix typos * fix typo * fix typos * fix typo * fix typos
2024-05-07ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098)Georgi Gerganov
* ci : add GG_BUILD_EXTRA_TESTS_0 env ggml-ci * Update run.sh ggml-ci
2024-05-06Add an option to build without CUDA VMM (#7067)William Tambellini
Add an option to build ggml cuda without CUDA VMM resolves https://github.com/ggerganov/llama.cpp/issues/6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4
2024-05-06flake.lock: Update (#7079)Georgi Gerganov
Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01) → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29) → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25) → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>