summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-11Add support for BERT embedding models (#5423)Douglas Hanley
* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11flake.lock: Updategithub-actions[bot]
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
2024-02-11vulkan: only use M-sized matmul on Apple GPUs (#5412)Sergio López
* vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-02-11common : use enums for sampler types (#5418)Alexey Parfenov
* common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11server : allow to specify tokens as strings in logit_bias (#5003)Alexey Parfenov
* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11main : ctrl+C print timing in non-interactive mode (#3873)Georgi Gerganov
2024-02-11common : fix compile warningGeorgi Gerganov
2024-02-11ggml : fix compile warnings (unused vars) (#4966)Georgi Gerganov
2024-02-11ggml : add mmla kernels for quantized GEMM (#4966)snadampal
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info
2024-02-11lookup: add print for drafting performance (#5450)Johannes Gäßler
2024-02-11server : add llama2 chat template (#5425)Xuan Son Nguyen
* server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-10metal : use autoreleasepool to avoid memory leaks (#5437)Ian Bull
There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436
2024-02-10scripts : update sync scripts with new backendsGeorgi Gerganov
2024-02-10sync : ggmlGeorgi Gerganov
2024-02-10ggml : add abort_callback for cpu backend (ggml/725)Michael Podvitskiy
* a way to use abort_callback with the cpu backend * whisper update
2024-02-09vulkan: Set limit for task concurrency (#5427)Neuman Vong
A common default for the maximum number of open files is 256, which can lead to `asyncio.gather(*tasks)` failing with Too many open files. $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc ggml_vulkan: Generating and compiling shaders to SPIR-V Traceback (most recent call last): File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module> asyncio.run(main()) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main await asyncio.gather(*tasks) [...snip...] OSError: [Errno 24] Too many open files This change sets a reasonable concurrency limit for tasks (and therefore open files), without significant impact on run time.
2024-02-09llava : add requirements.txt and update README.md (#5428)Daniel Bevenius
* llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-09server : fix prompt caching for repeated prompts (#5420)Riley Stewart
2024-02-09llama : do not cap thread count when MoE on CPU (#5419)Paul Tsochantaris
* Not capping thread count when MoE inference is running on CPU * Whitespace
2024-02-09readme : add JavaScript/Wasm repo (#5415)Marko Tasic
2024-02-09ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)Michael Podvitskiy
2024-02-09Fix Vulkan crash on APUs with very little device memory (#5424)0cc4m
* Fix Vulkan crash on APUs with very little device memory * Fix debug output function names
2024-02-08CUDA: more warps for mmvq on NVIDIA (#5394)Johannes Gäßler
2024-02-08llama : do not print "offloading layers" message in CPU-only builds (#5416)slaren
2024-02-08Fix f16_sycl cpy call from Arc (#5411)Abhilash Majumder
* fix f16_sycl cpy call * rm old logic * add fp16 build CI * use macro * format fix
2024-02-08llava : add missing .py, and fix paths in README.md (#5414)Daniel Bevenius
This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08fix trailing whitespace (#5407)Johannes Gäßler
2024-02-08llama : fix MiniCPM (#5392)runfuture
* fix bug for norm_rms_eps missing * to align with the same order as convert.py for model write * fix: undo HF models permute tensor * update for flake8 lint
2024-02-08llava: fix typo/formatting in README.md (#5405)Daniel Bevenius
This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08sampling: fix top_k <= 0 (#5388)Johannes Gäßler
* sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-08tests : .gitignore obj filesGeorgi Gerganov
2024-02-07CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)Michael Podvitskiy
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-07fix typo in readme (#5399)Ebey Abraham
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
2024-02-07Add Ava in the list of llama.cpp UIs (#4362)Kamil Tomšík
2024-02-07CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)Johannes Gäßler
2024-02-07[SYCL] update install make by w64devkit (#5297)Neo Zhang Jianyu
2024-02-07llava-cli : always tokenize special tokens (#5382)Xiao-Yong Jin
* llava-cli: tokenize special tokens in prompt * llava-cli: use the escape CLI argument, remove incomplete separate escaping process
2024-02-07Basic Vulkan Multi-GPU implementation (#5321)0cc4m
* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-07readme : modernize (#5379)Eve
* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
2024-02-07readme : update ui list (#5354)Ben Williams
2024-02-07llama : add MiniCPM support (#5346)runfuture
* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm
2024-02-07server : update `/props` with "total_slots" value (#5373)Justin Parker
* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
2024-02-06convert : fix TypeError on GPT-2 vocab.json (#5288)Sang-Kil Park
2024-02-06server : remove model.json endpoint (#5371)Alexey Parfenov
2024-02-06CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)Johannes Gäßler
2024-02-06Update README.md (#5366)Kawrakow
Add some links to quantization related PRs
2024-02-06Slight quantization improvement for Q4_K and Q5_K (#5361)Kawrakow
* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)BarfingLemurs
2024-02-06CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)Johannes Gäßler
2024-02-06server : include total "num_slots" in props endpoint (#5349)Justin Parker