summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-19codecov : remove (#8004)Georgi Gerganov
2024-06-19[SYCL] refactor (#6408)Meng, Hengyu
* seperate lower precision GEMM from the main files * fix workgroup size hardcode
2024-06-18tokenizer : BPE fixes (#7530)jaime-m-p
* Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t
2024-06-18Only use FIM middle token if it exists (#7648)Sigbjørn Skjæret
* Only use FIM middle if it exists * Only use FIM middle if it exists
2024-06-18Fix no gcc pragma on Windows (#7751)jojorne
2024-06-18Allow compiling with CUDA without CUDA runtime installed (#7989)Ulrich Drepper
On hosts which are not prepared/dedicated to execute code using CUDA it is still possible to compile llama.cpp with CUDA support by just installing the development packages. Missing are the runtime libraries like /usr/lib64/libcuda.so* and currently the link step will fail. The development environment is prepared for such situations. There are stub libraries for all the CUDA libraries available in the $(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end of the search path will not change anything for environments which currently work fine but will enable compiling llama.cpp also in case the runtime code is not available.
2024-06-18chore: clean useless beam search param (#7985)Frank Mai
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-18readme : update UI list (#7943)Abheek Gulati
2024-06-18ggml : syncGeorgi Gerganov
2024-06-18whisper : use ggml_backend_sched (whisper/2239)Georgi Gerganov
* whisper : use ggml_backend_sched (wip) * use sched in whisper_allocr * whisper : single backend in whisper_context * whisper : remove whisper_state->backends_used * whisper : remove whisper_context->backend * whisper : reset scheduler after init * whisper : fix external encoder (e.g. CoreML) * whisper : cleanup * whisper : handle null GPU buffer types + fix sycl --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17update: support Qwen2-57B-A14B (#7835)Ștefan-Gabriel Muscalu
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shared_exp are now properly calculated * update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shexp are now properly calculated
2024-06-17Make updates to type cast based on compiler instead of OS (#7851)Srihari-mcw
2024-06-17llama : disable FA if KV head size do not match (#7982)Georgi Gerganov
2024-06-17Add Nix and Flox install instructions (#7899)Bryan Honof
2024-06-17sched : offload_op also requires supports_op (#7977)slaren
2024-06-17fix: divide 0 exception in mamba (#7932)Frank Mai
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-17Implement non-mapped async IO for CUDA on Windows. (#7896)Markus Tavenrath
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive. * Free resources except for backend. * Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA. * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * Fix editorconfig and unused variable * Fix issues with Windows build --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17rpc : fix load/store misaligned addresses (#7948)Georgi Gerganov
2024-06-17gguf-dump.py: add --markdown dump output (#7853)Brian
* gguf-dump.py: add --markdown dump output * gguf-dump.py: Add toc * gguf-dump.py: use standard tensor name lookup. Also add tensor ID field * gguf-dump.py: Add tensor overview count * gguf-dump.py: fix array preview * gguf-dump.py: markdownTableWithAlignmentSupport() added * Add type hints and spacing Co-authored-by: compilade <git@compilade.net> * gguf-dump.py: prettyfy dimention * gguf-dump: right align element count * gguf-dump.py: element count autosizing * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
2024-06-17[SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" ↵Neo Zhang
(#7946) * Update README-sycl.md * Update README-sycl.md * Update README-sycl.md * Update README-sycl.md
2024-06-17Add support for sqrt on CUDA (#7953)Calvin Laurenson
* cuda sqrt support * enable cuda in pca * fix comments in pca * add test * add sqrt to ggml_backend_cuda_supports_op * fix test * new line * Use F32 sqrtf instead of F64 sqrt Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-16cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)Georgi Gerganov
* cuda : fix bounds check for src0 rows in MMVQ kernel * Update ggml-cuda/mmvq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-16ggml : fix and optimize ppc64le (ggml/849)Hong Bo PENG
* fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16ggml : remove duplicate include of ggml-common.h (ggml/853)Daniel Bevenius
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16flake.lock: Update (#7951)Georgi Gerganov
2024-06-16unicode : avoid char32_t (#7957)Georgi Gerganov
ggml-ci
2024-06-16readme : update UI list [no ci] (#7958)hopkins385
2024-06-16ggml : fix handling of zero blocks in IQ quants (#7955)Georgi Gerganov
ggml-ci
2024-06-16github : update pr templateGeorgi Gerganov
2024-06-16Vulkan Shader Refactor, Memory Debugging Option (#7947)0cc4m
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory * Improve debug log code * Add memory debug output option * Fix flake8 * Fix unnecessary high llama-3 VRAM use
2024-06-15Add `cvector-generator` example (#7514)Xuan Son Nguyen
* add control-vector-generator * calc diff * add comments * proof-of-concept stdlib implementation Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish. * param parsing, refactor, comments Added basic command-line parameters for outfile and one each positive/negative prompt. Refactored some messy code in PCA computation and GGUF exporting. Left a bunch of comments regarding further work needed. * example template completions Implements an example template set built from the positive/negative prompts like the control vector Python implementation. * add multi prompts, multi-thread for PCA * fix mem error * add debugs * fix matrix transpose multiplication you have got to be kidding me * preliminary template/multiprompt support model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish * fix zero output & param parsing, functional templating fixed a bug where the output file had no tensor data/was all zero fixed a bug where single hyphen flags were not being correctly parsed implements creation of templated prompts from input (still need to adapt based on model) * fix square_diff matmul index range and CRLF->LF line endings fixed a logic error where square_diff would not multiply all rows fixed a formatting error where the provided completions.txt had CRLF line endings * add command-line args for num threads, num completions file lines, always reload model refactored a few things and did what the commit message says on the tin * code aestheticization * fix compiler warnings * in-series multithreading for prompt embedding? added commented-out code to attempt to start implementing mutlithreading for embedding in main * remove unnecessary multithreading * interim fix memory leak * translated everything but PCA (I think) * tentatively translate the rest * fix ggml errors and make new ones at least it compiles and runs * fix cb_eval * temporary commit while I move dev environments it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent * update debug statements * pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped * update comments * (wip) refactor * clean up PCA ggml implementation * fix shape of v_diff_original * add n_batch for pca * working version * remember to copy back the last_eigenvector * fix n_completions * bring back n_completions * default n_pca_batch to 20 * fix macos build * add to makefile all targets * use ggml_format_name * add readme * fix .editorconfig * use ggml_backend_tensor_copy * attemp to fix compile problem on mac * fix compile warn * reuse allocr * move param parser to common * better error handling * clean up a bit * add print_usage * shorten help msg * beautify help msg * escape prompt by default * change compile target to llama-cvector-generator * typo * disable GPU for PCA * code style --------- Co-authored-by: Christian Zhou-Zheng <christianzhouzheng@gmail.com>
2024-06-15[SYCL] remove global variables (#7710)Meng, Hengyu
* separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related
2024-06-14ci : fix macos x86 build (#7940)olexiyb
In order to use old `macos-latest` we should use `macos-12` Potentially will fix: https://github.com/ggerganov/llama.cpp/issues/6975
2024-06-14CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)Johannes Gäßler
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes
2024-06-14metal : utilize max shared memory for mul_mat_id (#7935)Georgi Gerganov
2024-06-14llama-bench : fix RPC indication (#7936)Radoslav Gerganov
Show "<backend_name>+RPC" when RPC offloading is used
2024-06-14llama : more checks before assuming FIM tokens (#7644)Sigbjørn Skjæret
* More checks before assuming FIM tokens for Llama arch * extensive token check
2024-06-14convert : add Poro-34B-chat tokenizer support (#7713)Elaine
* support for Poro chat pre-tokenizer * add support for Poro pre-tokenizer * Update convert-hf-to-gguf-update.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Change Poro-34B-chat to poro-chat * Change Poro-34B-chat to poro-chat * Update convert-hf-to-gguf-update.py * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13rpc : fix ggml_backend_rpc_supports_buft() (#7918)Radoslav Gerganov
2024-06-13readme : Remove outdated instructions from README.md (#7914) [no ci]Galunid
2024-06-13move BLAS to a separate backend (#6210)slaren
* move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13`build`: rename main → llama-cli, server → llama-server, llava-cli → ↵Olivier Chafik
llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-12CUDA: fix broken oob check for FA vec f32 kernel (#7904)Johannes Gäßler
2024-06-12tests : add non-cont unary tests (#7857)Georgi Gerganov
* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci
2024-06-12ggml : improve ggml_is_contiguous logic (#7856)Georgi Gerganov
* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci
2024-06-12server : restore numeric prompts (#7883)Georgi Gerganov
2024-06-12update intel docker oneapi-basekit to 2024.1.1-devel-ubuntu22.04 (#7894)Meng, Hengyu
In addition this reverts a workaround we had to do to workaround the upstream issue with expired intel GPG package keys in 2024.0.1-devel-ubuntu22.04
2024-06-12Fix a typo and add Fedora 40 pacakge to install for Vulkan (#7794) [no ci]Patrice Ferlet
Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support
2024-06-11vulkan: select only one device for single gpu with multiple drivers (#7582)k.h.lai
2024-06-11Update Vulkan RoPE implementation (#7818)0cc4m
* Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>