Age | Commit message (Collapse) | Author |
|
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
→ 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
|
|
* vulkan: refactor guess_matmul_pipeline for vendor
Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.
Signed-off-by: Sergio Lopez <slp@redhat.com>
* vulkan: only use M-sized matmul on Apple GPUs
L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.
Signed-off-by: Sergio Lopez <slp@redhat.com>
---------
Signed-off-by: Sergio Lopez <slp@redhat.com>
|
|
* common: use enums for sampler types
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* minor : spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* server: allow to specify tokens as strings in logit_bias
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: update unit tests for the new vec_dot interface
* llama.cpp: add MATMUL_INT8 capability to system_info
|
|
|
|
* server: add mistral chat template
* server: fix typo
* server: rename template mistral to llama2
* server: format_llama2: remove BOS
* server: validate "--chat-template" argument
* server: clean up using_chatml variable
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]
[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931
This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.
This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436
|
|
|
|
|
|
* a way to use abort_callback with the cpu backend
* whisper update
|
|
A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.
$ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
ggml_vulkan: Generating and compiling shaders to SPIR-V
Traceback (most recent call last):
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
asyncio.run(main())
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
await asyncio.gather(*tasks)
[...snip...]
OSError: [Errno 24] Too many open files
This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.
|
|
* llava: add requirements.txt and update README.md
This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.
The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llava: fix typo in llava-surgery.py output
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
|
|
* Not capping thread count when MoE inference is running on CPU
* Whitespace
|
|
|
|
|
|
* Fix Vulkan crash on APUs with very little device memory
* Fix debug output function names
|
|
|
|
|
|
* fix f16_sycl cpy call
* rm old logic
* add fp16 build CI
* use macro
* format fix
|
|
This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
|
|
* fix bug for norm_rms_eps missing
* to align with the same order as convert.py for model write
* fix: undo HF models permute tensor
* update for flake8 lint
|
|
This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:
Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* sampling: fix top_k <= 0
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|
|
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
|
|
|
|
|
|
|
|
* llava-cli: tokenize special tokens in prompt
* llava-cli: use the escape CLI argument, remove incomplete separate escaping process
|
|
* Initial Vulkan multi-gpu implementation
Move most global variables into backend context
* Add names to backend device functions
* Add further missing cleanup code
* Reduce code duplication in tensor split layer assignment
* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h
* Only do device info print in the beginning and initialize one backend for cpu assist
Add missing cleanup code
* Rework backend memory management to make sure devices and buffers get properly allocated and freed
* Rename cpu assist free function
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* first cleanup, update everything to Llama 2 and remove outdated content
* Delete SHA256SUMS
* make build instructions generic
* recommend Q4_K_M quantization method
* Update README.md
|
|
|
|
* support minicpm arch.
* fix tab/space typo.
* convert minicpm model via convert-hf-gguf.py
* try to make tokenizer work
* fix bug for quantize minicpm
* fix for flake8 lint
* remove convert-minicpm.py
* fix for editorconfig
* correct minicpm model type (size)
* constants expanded for minicpm
* Minor change of the constant names for minicpm
|
|
* include total "num_slots" in default_generation_settings_for_props
* cleanup total_slots return value in /props endpoint
* update /props endpoint docs with total_slots
* remove num_slots from default_generation_settings_for_props
* update /props endpoint section
|
|
|
|
|
|
|
|
Add some links to quantization related PRs
|
|
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
|