ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-02-11	Add support for BERT embedding models (#5423)	Douglas Hanley
	* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11	flake.lock: Update	github-actions[bot]
	Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
2024-02-11	vulkan: only use M-sized matmul on Apple GPUs (#5412)	Sergio López
	* vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-02-11	common : use enums for sampler types (#5418)	Alexey Parfenov
	* common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11	server : allow to specify tokens as strings in logit_bias (#5003)	Alexey Parfenov
	* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11	main : ctrl+C print timing in non-interactive mode (#3873)	Georgi Gerganov

2024-02-11	common : fix compile warning	Georgi Gerganov

2024-02-11	ggml : fix compile warnings (unused vars) (#4966)	Georgi Gerganov

2024-02-11	ggml : add mmla kernels for quantized GEMM (#4966)	snadampal
	* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info
2024-02-11	lookup: add print for drafting performance (#5450)	Johannes Gäßler

2024-02-11	server : add llama2 chat template (#5425)	Xuan Son Nguyen
	* server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-10	metal : use autoreleasepool to avoid memory leaks (#5437)	Ian Bull
	There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436
2024-02-10	scripts : update sync scripts with new backends	Georgi Gerganov

2024-02-10	sync : ggml	Georgi Gerganov

2024-02-10	ggml : add abort_callback for cpu backend (ggml/725)	Michael Podvitskiy
	* a way to use abort_callback with the cpu backend * whisper update
2024-02-09	vulkan: Set limit for task concurrency (#5427)	Neuman Vong
	A common default for the maximum number of open files is 256, which can lead to `asyncio.gather(tasks)` failing with Too many open files. $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc ggml_vulkan: Generating and compiling shaders to SPIR-V Traceback (most recent call last): File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module> asyncio.run(main()) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main await asyncio.gather(tasks) [...snip...] OSError: [Errno 24] Too many open files This change sets a reasonable concurrency limit for tasks (and therefore open files), without significant impact on run time.
2024-02-09	llava : add requirements.txt and update README.md (#5428)	Daniel Bevenius
	* llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-09	server : fix prompt caching for repeated prompts (#5420)	Riley Stewart

2024-02-09	llama : do not cap thread count when MoE on CPU (#5419)	Paul Tsochantaris
	* Not capping thread count when MoE inference is running on CPU * Whitespace
2024-02-09	readme : add JavaScript/Wasm repo (#5415)	Marko Tasic

2024-02-09	ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)	Michael Podvitskiy

2024-02-09	Fix Vulkan crash on APUs with very little device memory (#5424)	0cc4m
	* Fix Vulkan crash on APUs with very little device memory * Fix debug output function names
2024-02-08	CUDA: more warps for mmvq on NVIDIA (#5394)	Johannes Gäßler

2024-02-08	llama : do not print "offloading layers" message in CPU-only builds (#5416)	slaren

2024-02-08	Fix f16_sycl cpy call from Arc (#5411)	Abhilash Majumder
	* fix f16_sycl cpy call * rm old logic * add fp16 build CI * use macro * format fix
2024-02-08	llava : add missing .py, and fix paths in README.md (#5414)	Daniel Bevenius
	This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08	fix trailing whitespace (#5407)	Johannes Gäßler

2024-02-08	llama : fix MiniCPM (#5392)	runfuture
	* fix bug for norm_rms_eps missing * to align with the same order as convert.py for model write * fix: undo HF models permute tensor * update for flake8 lint
2024-02-08	llava: fix typo/formatting in README.md (#5405)	Daniel Bevenius
	This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08	sampling: fix top_k <= 0 (#5388)	Johannes Gäßler
	* sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-08	tests : .gitignore obj files	Georgi Gerganov

2024-02-07	CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)	Michael Podvitskiy
	Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-07	fix typo in readme (#5399)	Ebey Abraham
	Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
2024-02-07	Add Ava in the list of llama.cpp UIs (#4362)	Kamil Tomšík

2024-02-07	CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)	Johannes Gäßler

2024-02-07	[SYCL] update install make by w64devkit (#5297)	Neo Zhang Jianyu

2024-02-07	llava-cli : always tokenize special tokens (#5382)	Xiao-Yong Jin
	* llava-cli: tokenize special tokens in prompt * llava-cli: use the escape CLI argument, remove incomplete separate escaping process
2024-02-07	Basic Vulkan Multi-GPU implementation (#5321)	0cc4m
	* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-07	readme : modernize (#5379)	Eve
	* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
2024-02-07	readme : update ui list (#5354)	Ben Williams

2024-02-07	llama : add MiniCPM support (#5346)	runfuture
	* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm
2024-02-07	server : update `/props` with "total_slots" value (#5373)	Justin Parker
	* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
2024-02-06	convert : fix TypeError on GPT-2 vocab.json (#5288)	Sang-Kil Park

2024-02-06	server : remove model.json endpoint (#5371)	Alexey Parfenov

2024-02-06	CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)	Johannes Gäßler

2024-02-06	Update README.md (#5366)	Kawrakow
	Add some links to quantization related PRs
2024-02-06	Slight quantization improvement for Q4_K and Q5_K (#5361)	Kawrakow
	* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06	readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)	BarfingLemurs

2024-02-06	CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)	Johannes Gäßler

2024-02-06	server : include total "num_slots" in props endpoint (#5349)	Justin Parker