ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-02-21	IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)	Kawrakow
	* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-20	server : support llava 1.6 (#5553)	CJ Pais
	* server: init working 1.6 * move clip_image to header * remove commented code * remove c++ style from header * remove todo * expose llava_image_embed_make_with_clip_img * fix zig build
2024-02-20	make : fix debug build with CUDA (#5616)	slaren

2024-02-20	llava : add explicit instructions for llava-1.6 (#5611)	Daniel Bevenius
	This commit contains a suggestion for the README.md in the llava example. The suggestion adds explicit instructions for how to convert a llava-1.6 model and run it using llava-cli. The motivation for this is that having explicit instructions similar to the 1.5 instructions will make it easier for users to try this out. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20	Server: use llama_chat_apply_template (#5593)	Xuan Son Nguyen
	* server: use llama_chat_apply_template * server: remove trailing space * server: fix format_chat * server: fix help message Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: fix formatted_chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20	readme : update UI list (#5605)	Dane Madsen
	* Add maid to ui list * Specify licence
2024-02-20	metal : add build system support for embedded metal library (#5604)	Haoxiang Fei
	* add build support for embedded metal library * Update Makefile --------- Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20	server : health endpoint configurable failure on no slot (#5594)	Pierrick Hymbert

2024-02-20	Update ggml_sycl_op_mul_mat_vec_q (#5502)	AidanBeltonS
	* Update ggml_sycl_op_mul_mat_vec_q * Apply suggestions from code review Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * revert suggestion on macro * fix bug * Add quant type GGML_TYPE_IQ1_S to unsupported * fix format --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-19	nix: now that we can do so, allow MacOS to build Vulkan binaries	Mathijs de Bruin
	Author: Philip Taron <philip.taron@gmail.com> Date: Tue Feb 13 20:28:02 2024 +0000
2024-02-19	Enable Vulkan MacOS CI	0cc4m

2024-02-19	Refactor validation and enumeration platform checks into functions to clean ↵	0cc4m
	up ggml_vk_instance_init()
2024-02-19	Add check for VK_KHR_portability_enumeration for MoltenVK support	0cc4m

2024-02-19	Add preprocessor checks for Apple devices.	Mathijs de Bruin
	Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-19	Resolve ErrorIncompatibleDriver with Vulkan on MacOS.	Mathijs de Bruin
	Refs: - https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f - https://github.com/SaschaWillems/Vulkan/issues/954 - https://github.com/haasn/libplacebo/issues/128 - https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-19	Allow for Vulkan build with Accelerate.	Mathijs de Bruin
	Closes #5304
2024-02-19	cuda : ignore peer access already enabled errors (#5597)	slaren
	* cuda : ignore peer access already enabled errors * fix hip
2024-02-19	make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598)	Jared Van Bortel

2024-02-19	examples : support minItems/maxItems in JSON grammar converter (#5039)	nopperl
	* support minLength and maxLength in JSON schema grammar converter * Update examples/json-schema-to-grammar.py --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19	llava : remove extra cont (#5587)	Georgi Gerganov

2024-02-19	llava : replace ggml_cpy with ggml_cont	slaren

2024-02-19	sync : ggml	Georgi Gerganov
	ggml-ci
2024-02-19	ggml-alloc : apply ggml/731	Georgi Gerganov

2024-02-19	metal : option to embed MSL source into compiled binary (whisper/1842)	Didzis Gosko
	* ggml : embed Metal library source (ggml-metal.metal) into binary enable by setting WHISPER_EMBED_METAL_LIBRARY * rename the build option * rename the preprocessor directive * generate Metal library embedding assembly on-fly during build process
2024-02-19	ci : enable -Werror for CUDA builds (#5579)	Georgi Gerganov
	* cmake : pass -Werror through -Xcompiler ggml-ci * make, cmake : enable CUDA errors on warnings ggml-ci
2024-02-19	make : fix CUDA build (#5580)	Georgi Gerganov

2024-02-19	readme : fix typo in README-sycl.md (#5353)	valiray

2024-02-19	cmake : remove obsolete sycl compile flags (#5581)	Abhilash Majumder
	* rm unwanted sycl compile options * fix bug * fix bug * format fix
2024-02-19	minor : fix trailing whitespace (#5538)	Georgi Gerganov

2024-02-19	llava : avoid changing the original BakLLaVA model (#5577)	Daniel Bevenius
	This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava : update surgery script to not remove tensors") but this time the change is to the BakLLaVA specific part of the surgery script. I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works as expected using the instructions in README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-19	baby-llama : allocate graphs in ggml_context (#5573)	NawafAlansari
	* Fixed the baby-llama issue (see issue #4830) * minor : fix whitespaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19	llama : add llama_chat_apply_template() (#5538)	Xuan Son Nguyen
	* llama: add llama_chat_apply_template * test-chat-template: remove dedundant vector * chat_template: do not use std::string for buffer * add clarification for llama_chat_apply_template * llama_chat_apply_template: add zephyr template * llama_chat_apply_template: correct docs * llama_chat_apply_template: use term "chat" everywhere * llama_chat_apply_template: change variable name to "tmpl"
2024-02-19	cuda, metal : fix nans in soft_max (#5574)	slaren
	* cuda : fix nans in soft_max * metal : fix nans in soft_max --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19	readme : update (#5572)	Mirko185
	Added 1.5-bit on README.md
2024-02-19	ggml : android and old glibc NUMA incompatibility bugfixes (#5557)	bmwl
	* #ifdef out some code NUMA blocks for Android due to lack of support * added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper * Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc * harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways --------- Co-authored-by: root <root@nenya.lothlorien.ca>
2024-02-18	build : pass all warning flags to nvcc via -Xcompiler (#5570)	Jared Van Bortel
	* build : pass all warning flags to nvcc via -Xcompiler * make : fix apparent mis-merge from #3952 * make : fix incorrect GF_CC_VER for CUDA host compiler
2024-02-18	ggml : restore vec dot stride arg names (#5453)	Georgi Gerganov

2024-02-18	ci : fix wikitext url + compile warnings (#5569)	Georgi Gerganov
	ggml-ci
2024-02-18	metal : fix unused warnings (#0)	Georgi Gerganov

2024-02-18	common, server : surface min_keep as its own parameter (#5567)	Robey Holderith
	* Feature - surface min_keep as its own parameter * Updated README with min_keep param
2024-02-18	server : slots monitoring endpoint (#5550)	Pierrick Hymbert

2024-02-18	sampling : do not set min_keep to n_probs (#5564)	Georgi Gerganov

2024-02-18	cmake : fix GGML_USE_SYCL typo (#5555)	Georgi Gerganov

2024-02-18	server : enhanced health endpoint (#5548)	Pierrick Hymbert
	* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
2024-02-18	server : --n-predict option document and cap to max value (#5549)	Pierrick Hymbert
	* server: document --n-predict * server: ensure client request cannot override n_predict if set * server: fix print usage LF in new --n-predict option
2024-02-18	server : graceful server shutdown (#5244)	Daniel Hiltgen
	This updates the server queue to support graceful shutdown of the server on signals.
2024-02-18	common : fix ub (#5530)	Georgi Gerganov

2024-02-18	ggml, common, examples, tests : fixed type arguments in printf (#5528)	Herman Semenov

2024-02-18	llava : update surgery script to not remove tensors (#5536)	Daniel Bevenius
	This commit updates the surgery script to not remove the tensors from the model file. For this to work the `--skip-unknown` flag is added as an argument to the convert.py script in README.md. The motivation for this change is that the surgery script currently removes the projector tensors from the model file. If the model was checked out from a repository, the model file will have been updated and have to be checked out again to reset this effect. If this can be avoided I think it would be preferable. I did not perform this change for BakLLaVA models as I am not sure how that part works.
2024-02-18	1.5 bit quantization (#5453)	Kawrakow
	* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>