summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-22iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)Iwan Kawrakow
We get 2.2X for PP-512 (52 t/s)
2024-06-22iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)Iwan Kawrakow
We get only a 2.07X for PP-512 to get up to 31 t/s, so iq2_s remains slow.
2024-06-22Add Q8_0Iwan Kawrakow
2024-06-22CosmeticsIwan Kawrakow
2024-06-22iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)Iwan Kawrakow
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22iqk_mul_mat: faster q3_K TGIwan Kawrakow
We get 31 t/s up from 26 t/s, but we need to treat PP differently from TG, else we get a ~10% drop in PP performance.
2024-06-22iqk_mul_mat for llama.cppIwan Kawrakow
2024-06-21JSON Schema to GBNF integration tests (#7790)Clint Herron
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in #7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.
2024-06-21vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)k.h.lai
* vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query
2024-06-21ggml : AVX IQ quants (#7845)Eve
* initial iq4_xs * fix ci * iq4_nl * iq1_m * iq1_s * iq2_xxs * iq3_xxs * iq2_s * iq2_xs * iq3_s before sllv * iq3_s * iq3_s small fix * iq3_s sllv can be safely replaced with sse multiply
2024-06-21llama : optimize long word tokenization with WPM (#8034)Georgi Gerganov
ggml-ci
2024-06-21llama : allow pooled embeddings on any model (#7477)Douglas Hanley
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling
2024-06-21swiftui : enable stream updating (#7754)Shuichi Tsutsumi
2024-06-20requirements : Bump torch and numpy for python3.12 (#8041)Hamdoud Hakem
2024-06-20convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040)Hamdoud Hakem
2024-06-20common: fix warning (#8036)Johannes Gäßler
* common: fix warning * Update common/common.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-20[SYCL] Fix windows build and inference (#8003)luoyu-intel
* add sycl preset * fix debug link error. fix windows crash * update README
2024-06-20CUDA: stream-k decomposition for MMQ (#8018)Johannes Gäßler
* CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices
2024-06-20metal : fix `ggml_metal_supports_op` for BF16 (#8021)Michael de Gans
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
2024-06-20server : fix smart slot selection (#8020)sasha0552
2024-06-19un-ignore `build-info.cmake` and `build-info.sh` (#7996)Michael de Gans
* un-ignore `build-info.cmake` and `build-info.sh` I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases. * un-ignore `build-info.cpp.in` For the same reason as the previous two files. * Reorganize `.gitignore` * Add exceptions for files mentioned by @slaren I did leave .clang-tidy since it was explicitly ignored before. * Add comments for organization * Sort some lines for pretty * Test with `make` and `cmake` builds to ensure no build artifacts might be comitted * Remove `.clang-tidy` from `.gitignore` Per comment by @ggerganov * Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
2024-06-19ggml : synchronize threads using barriers (#7993)slaren
2024-06-19codecov : remove (#8004)Georgi Gerganov
2024-06-19[SYCL] refactor (#6408)Meng, Hengyu
* seperate lower precision GEMM from the main files * fix workgroup size hardcode
2024-06-18tokenizer : BPE fixes (#7530)jaime-m-p
* Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t
2024-06-18Only use FIM middle token if it exists (#7648)Sigbjørn Skjæret
* Only use FIM middle if it exists * Only use FIM middle if it exists
2024-06-18Fix no gcc pragma on Windows (#7751)jojorne
2024-06-18Allow compiling with CUDA without CUDA runtime installed (#7989)Ulrich Drepper
On hosts which are not prepared/dedicated to execute code using CUDA it is still possible to compile llama.cpp with CUDA support by just installing the development packages. Missing are the runtime libraries like /usr/lib64/libcuda.so* and currently the link step will fail. The development environment is prepared for such situations. There are stub libraries for all the CUDA libraries available in the $(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end of the search path will not change anything for environments which currently work fine but will enable compiling llama.cpp also in case the runtime code is not available.
2024-06-18chore: clean useless beam search param (#7985)Frank Mai
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-18readme : update UI list (#7943)Abheek Gulati
2024-06-18ggml : syncGeorgi Gerganov
2024-06-18whisper : use ggml_backend_sched (whisper/2239)Georgi Gerganov
* whisper : use ggml_backend_sched (wip) * use sched in whisper_allocr * whisper : single backend in whisper_context * whisper : remove whisper_state->backends_used * whisper : remove whisper_context->backend * whisper : reset scheduler after init * whisper : fix external encoder (e.g. CoreML) * whisper : cleanup * whisper : handle null GPU buffer types + fix sycl --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17update: support Qwen2-57B-A14B (#7835)Ștefan-Gabriel Muscalu
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shared_exp are now properly calculated * update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM * fix: QWEN2MOE support for expert_feed_forward_length previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH n_ff_exp and n_ff_shexp are now properly calculated
2024-06-17Make updates to type cast based on compiler instead of OS (#7851)Srihari-mcw
2024-06-17llama : disable FA if KV head size do not match (#7982)Georgi Gerganov
2024-06-17Add Nix and Flox install instructions (#7899)Bryan Honof
2024-06-17sched : offload_op also requires supports_op (#7977)slaren
2024-06-17fix: divide 0 exception in mamba (#7932)Frank Mai
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-17Implement non-mapped async IO for CUDA on Windows. (#7896)Markus Tavenrath
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive. * Free resources except for backend. * Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA. * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * Fix editorconfig and unused variable * Fix issues with Windows build --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17rpc : fix load/store misaligned addresses (#7948)Georgi Gerganov
2024-06-17gguf-dump.py: add --markdown dump output (#7853)Brian
* gguf-dump.py: add --markdown dump output * gguf-dump.py: Add toc * gguf-dump.py: use standard tensor name lookup. Also add tensor ID field * gguf-dump.py: Add tensor overview count * gguf-dump.py: fix array preview * gguf-dump.py: markdownTableWithAlignmentSupport() added * Add type hints and spacing Co-authored-by: compilade <git@compilade.net> * gguf-dump.py: prettyfy dimention * gguf-dump: right align element count * gguf-dump.py: element count autosizing * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
2024-06-17[SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" ↵Neo Zhang
(#7946) * Update README-sycl.md * Update README-sycl.md * Update README-sycl.md * Update README-sycl.md
2024-06-17Add support for sqrt on CUDA (#7953)Calvin Laurenson
* cuda sqrt support * enable cuda in pca * fix comments in pca * add test * add sqrt to ggml_backend_cuda_supports_op * fix test * new line * Use F32 sqrtf instead of F64 sqrt Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-16cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)Georgi Gerganov
* cuda : fix bounds check for src0 rows in MMVQ kernel * Update ggml-cuda/mmvq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-16ggml : fix and optimize ppc64le (ggml/849)Hong Bo PENG
* fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16ggml : remove duplicate include of ggml-common.h (ggml/853)Daniel Bevenius
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16flake.lock: Update (#7951)Georgi Gerganov
2024-06-16unicode : avoid char32_t (#7957)Georgi Gerganov
ggml-ci
2024-06-16readme : update UI list [no ci] (#7958)hopkins385
2024-06-16ggml : fix handling of zero blocks in IQ quants (#7955)Georgi Gerganov
ggml-ci