ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-02-07	Basic Vulkan Multi-GPU implementation (#5321)	0cc4m
	* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-07	readme : modernize (#5379)	Eve
	* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
2024-02-07	readme : update ui list (#5354)	Ben Williams

2024-02-07	llama : add MiniCPM support (#5346)	runfuture
	* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm
2024-02-07	server : update `/props` with "total_slots" value (#5373)	Justin Parker
	* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
2024-02-06	convert : fix TypeError on GPT-2 vocab.json (#5288)	Sang-Kil Park

2024-02-06	server : remove model.json endpoint (#5371)	Alexey Parfenov

2024-02-06	CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)	Johannes Gäßler

2024-02-06	Update README.md (#5366)	Kawrakow
	Add some links to quantization related PRs
2024-02-06	Slight quantization improvement for Q4_K and Q5_K (#5361)	Kawrakow
	* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06	readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)	BarfingLemurs

2024-02-06	CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)	Johannes Gäßler

2024-02-06	server : include total "num_slots" in props endpoint (#5349)	Justin Parker

2024-02-06	server : add `dynatemp_range` and `dynatemp_exponent` (#5352)	Michael Coppola
	* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-02-06	server : various fixes for the prompt field in /completion (#5300)	Niall Coates
	server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert
2024-02-06	py : handle byte tokens in `get_token_type` (#5341)	Georgi Gerganov
	* py : handle byte tokens in `get_token_type` * py : fix empty bytes arg
2024-02-05	make: Use ccache for faster compilation (#5318)	Johannes Gäßler
	* make: Use ccache for faster compilation
2024-02-05	README: updated introduction (#5343)	Johannes Gäßler
	* README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05	ggml : make use of ggml-quants.h possible in C++ code (#5338)	Kawrakow
	* Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05	ggml : avoid duplicating function calls using MIN/MAX macros (#5325)	Dr. Tom Murphy VII Ph.D
	* Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05	iq3_xxs: quards for the no-imatrix situation (#5334)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05	py : fix internlm2-hf convert to gguf (#5305)	Guoteng
	* py : fix internlm2-hf convert to gguf * ggml-ci
2024-02-05	iq2_xxs: tune quantization (#5320)	Kawrakow
	We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05	server : allow to get default generation settings for completion (#5307)	Alexey Parfenov

2024-02-05	common : add dynamic temperature parameters to main example cli (#5295)	l3utterfly
	* added dynamic temp params in main * added help text
2024-02-05	scripts : fix typos, cleanup (#5303)	Georgi Gerganov

2024-02-05	scripts : add non-interactive server-llm.sh (#5303)	Нияз Гарифзянов
	* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05	readme : add CodeShell models to the supported models list (#5330)	chiranko

2024-02-05	[SYCL] Fix cpy with dims of 3 (#5289)	AidanBeltonS
	* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-04	flake.lock: Update	github-actions[bot]
	Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04	Adding some imatrix tools (#5302)	Kawrakow
	* imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-03	cmake : use set() for LLAMA_WIN_VER (#5298)	Welby Seely
	option() is specifically for booleans. Fixes #5158
2024-02-03	make: add nvcc info print (#5310)	Johannes Gäßler

2024-02-03	make: fix nvcc optimization flags for host code (#5309)	Johannes Gäßler

2024-02-03	add Vulkan support to Nix flake	Martin Schwaighofer

2024-02-03	Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)	0cc4m
	* Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03	refactor : switch to emplace_back to avoid extra object (#5291)	Michael Klimenko

2024-02-03	YaRN : store rope scaling type as int32_t in memory (#5285)	Jared Van Bortel
	* YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *
2024-02-03	readme : add tenere in the ui tools list (#5284)	BADR

2024-02-03	Fix im2col with 32fp (#5286)	AidanBeltonS

2024-02-02	perplexity : fix KL divergence calculations on Windows (#5273)	kalomaze

2024-02-02	scripts : parse wtype in server-llm.sh (#5167)	Georgi Gerganov
	* scripts : parse wtype in server-llm.sh * scripts : fix check for wfile
2024-02-02	py : add check for '.attn.masked_bias' layers to GPT2model (#5281)	Mirror Azure

2024-02-02	Tidy ggml-sycl (#5261)	AidanBeltonS
	* Tidy some code in ggml-sycl * Remove blank space * Remove std::printf comments --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-02	docker : add build for SYCL, Vulkan + update readme (#5228)	Xuan Son Nguyen
	* add vulkan dockerfile * intel dockerfile: compile sycl by default * fix vulkan dockerfile * add docs for vulkan * docs: sycl build in docker * docs: remove trailing spaces * docs: sycl: add docker section * docs: clarify install vulkan SDK outside docker * sycl: use intel/oneapi-basekit docker image * docs: correct TOC * docs: correct docker image for Intel oneMKL
2024-02-02	[SYCL] get MAX_MEM_ALLOC from device property (#5270)	Meng, Hengyu
	* get max alloc size from device prop * fix macro typo
2024-02-02	[SYCL] update guide of SYCL backend (#5254)	Neo Zhang Jianyu
	* update guide for make installation, memory, gguf model link, rm todo for windows build * add vs install requirement * update for gpu device check * update help of llama-bench * fix grammer issues
2024-02-02	llama : fix memory leak in llama_batch_free (#5252)	Ian Bull
	The llama_batch_init allocates memory for a fixed number of tokens. However, the llama_batch_free only frees memory for the number of tokens that were added to the batch. This change-set uses a null terminated array for the batch seq_id, and frees all the elements until the nullptr is reached. This change-set also changes the name of the first parameter from `n_tokens` to `n_tokens_alloc` to more clearly indicate that this value is the number of tokens allocated to the batch, not the number of tokens in the batch.
2024-02-01	add --no-mmap in llama-bench (#5257)	Neo Zhang Jianyu
	* add --no-mmap, show sycl backend * fix conflict * fix code format, change print for --no-mmap * ren no_mmap to mmap, show mmap when not default value in printer * update guide for mmap * mv position to reduce model reload
2024-02-01	Vulkan Phi Fix for AMD Proprietary Drivers (#5260)	0cc4m
	* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver * Fix another Vulkan CPY buffer size bug