ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-02-03	refactor : switch to emplace_back to avoid extra object (#5291)	Michael Klimenko

2024-02-03	YaRN : store rope scaling type as int32_t in memory (#5285)	Jared Van Bortel
	* YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *
2024-02-03	readme : add tenere in the ui tools list (#5284)	BADR

2024-02-03	Fix im2col with 32fp (#5286)	AidanBeltonS

2024-02-02	perplexity : fix KL divergence calculations on Windows (#5273)	kalomaze

2024-02-02	scripts : parse wtype in server-llm.sh (#5167)	Georgi Gerganov
	* scripts : parse wtype in server-llm.sh * scripts : fix check for wfile
2024-02-02	py : add check for '.attn.masked_bias' layers to GPT2model (#5281)	Mirror Azure

2024-02-02	Tidy ggml-sycl (#5261)	AidanBeltonS
	* Tidy some code in ggml-sycl * Remove blank space * Remove std::printf comments --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-02	docker : add build for SYCL, Vulkan + update readme (#5228)	Xuan Son Nguyen
	* add vulkan dockerfile * intel dockerfile: compile sycl by default * fix vulkan dockerfile * add docs for vulkan * docs: sycl build in docker * docs: remove trailing spaces * docs: sycl: add docker section * docs: clarify install vulkan SDK outside docker * sycl: use intel/oneapi-basekit docker image * docs: correct TOC * docs: correct docker image for Intel oneMKL
2024-02-02	[SYCL] get MAX_MEM_ALLOC from device property (#5270)	Meng, Hengyu
	* get max alloc size from device prop * fix macro typo
2024-02-02	[SYCL] update guide of SYCL backend (#5254)	Neo Zhang Jianyu
	* update guide for make installation, memory, gguf model link, rm todo for windows build * add vs install requirement * update for gpu device check * update help of llama-bench * fix grammer issues
2024-02-02	llama : fix memory leak in llama_batch_free (#5252)	Ian Bull
	The llama_batch_init allocates memory for a fixed number of tokens. However, the llama_batch_free only frees memory for the number of tokens that were added to the batch. This change-set uses a null terminated array for the batch seq_id, and frees all the elements until the nullptr is reached. This change-set also changes the name of the first parameter from `n_tokens` to `n_tokens_alloc` to more clearly indicate that this value is the number of tokens allocated to the batch, not the number of tokens in the batch.
2024-02-01	add --no-mmap in llama-bench (#5257)	Neo Zhang Jianyu
	* add --no-mmap, show sycl backend * fix conflict * fix code format, change print for --no-mmap * ren no_mmap to mmap, show mmap when not default value in printer * update guide for mmap * mv position to reduce model reload
2024-02-01	Vulkan Phi Fix for AMD Proprietary Drivers (#5260)	0cc4m
	* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver * Fix another Vulkan CPY buffer size bug
2024-02-01	cuda : fix LLAMA_CUDA_F16 (#5262)	slaren

2024-02-01	make : generate .a library for static linking (#5205)	Ali Nehzat

2024-02-01	llama : support InternLM2 (#5184)	Guoteng
	* support InternLM2 inference * add add_space_prefix KV pair
2024-01-31	Fix broken Vulkan Cmake (properly) (#5230)	Eve
	* build vulkan as object * vulkan ci
2024-01-31	llama : reorder build_orion() at correct place (#5118)	Georgi Gerganov

2024-01-31	llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)	Georgi Gerganov
	* llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31	metal : add im2col F32 dst support (#5132)	Georgi Gerganov

2024-01-31	llava : add MobileVLM support (#5132)	JidongZhang-THU
	* New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31	format license text, restore apache license by legal suggestion (#5233)	Neo Zhang Jianyu

2024-01-31	ggml : limit n_threads to the max n_tasks (#5238)	slaren

2024-01-31	Vulkan Fixes (#5223)	0cc4m
	* Fix Vulkan F16 models * Fix Vulkan context shift crash * Add Vulkan to common.cpp dump_non_result_info_yaml function * Fix bug in Vulkan CPY op * Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> --------- Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
2024-01-30	Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)	Yiming Cui

2024-01-31	support SYCL backend windows build (#5208)	Neo Zhang Jianyu
	* support SYCL backend windows build * add windows build in CI * add for win build CI * correct install oneMKL * fix install issue * fix ci * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix install cmd * fix win build * fix win build * fix win build * restore other CI part * restore as base * rm no new line * fix no new line issue, add -j * fix grammer issue * allow to trigger manually, fix format issue * fix format * add newline * fix format * fix format * fix format issuse --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-01-30	kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)	Jared Van Bortel

2024-01-30	Revert "server : change deps.sh xxd files to string literals (#5221)"	Georgi Gerganov
	This reverts commit 4003be0e5feef320f3707786f22722b73cff9356.
2024-01-30	server : fix context shift (#5195)	Georgi Gerganov
	* server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes
2024-01-30	server : change deps.sh xxd files to string literals (#5221)	JohnnyB
	* Changed ugly xxd to literals. HPP files are much more readable as multiline literals rather than hex arrays. * Dashes in literal variable names. Replace . and - with _ in file names -> variable names. * Comment on removing xxd. XXD-> string literals * XXD to string literals. Replaced these unreadable headers with string literal versions using new deps.sh.
2024-01-30	ggml : fix IQ3_XXS on Metal (#5219)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30	sync : ggml (#0)	Georgi Gerganov

2024-01-30	gguf : fix comparison (ggml/715)	Georgi Gerganov
	ggml-ci
2024-01-30	`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)	John Balis
	* added cuda float16->float32 upcasting to ggml_cuda_cpy * added ability to copy 4d tensors with the cuda backend * added tests for float16_>float32 upcast and 4d tensor cuda copys * added 4d copy test for float32->float16 copy * applied patch suggested by @iamlemec * simplify cpy tests --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30	gguf : add input validation, prevent integer overflows (ggml/709)	Georgi Gerganov
	* gguf : add input validation, prevent integer overflows ggml-ci * gguf : fix switch default case * gguf : sanitize info->n_dims and info->type ggml-ci * gguf : assert GGUF_TYPE_SIZE access ggml-ci * ggml : assert mallocs are successful ggml-ci * gguf : prevent integer overflow * gguf : sanitize tensor info ggml-ci * gguf : stricter limit on the number of items ggml-ci
2024-01-30	ci : fix yolo URLs + fix metal capture (ggml/712)	Georgi Gerganov

2024-01-30	metal : add debug capture backend function (ggml/694)	Jack Mousseau
	Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-30	Faster AVX2 dot product for IQ2_XS (#5187)	Kawrakow
	* iq2xs: faster AVX2 dot product * iq2xs: small AVX2 imrovement * Speed up computing sign bits in AVX2 iq2_xs dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Peter Reid <peter@peterreid.net>
2024-01-30	SOTA 3-bit quants (#5196)	Kawrakow
	* iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30	Vulkan Windows APU Memory Handling (#5199)	0cc4m
	* Add basic UMA memory handling Improve memory OOM behavior Fix tests * Fix UMA handling * Also fix UMA handling for prealloc buffers * Remove unnecessary warning message * Remove outdated comment
2024-01-30	quantize : fix typo (#5211)	Vladimir Malyutin
	Fix misprint in quantize help
2024-01-30	main : allow empty --prompt-cache file (#5176)	divinity76
	* allow empty --prompt-cache file This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user. I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++ < 17 (the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17) * formatting (requested in codereview) * remove c++17, file_is_empty
2024-01-30	readme : minor (#5204)	Romain Neutron
	This is about tuning the code formatting of the README file
2024-01-30	readme : update hot topics	Georgi Gerganov

2024-01-30	server : improve README (#5209)	Wu Jian Ping

2024-01-29	ggml alloc: Fix for null dereference on alloc failure (#5200)	Paul Tsochantaris
	* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated * Freeing the allocated buffers rather than the pointer in ggml-alloc.c * Fixed the fix of the fix
2024-01-29	kompute : fix fallback to CPU (#5201)	Jared Van Bortel

2024-01-29	Nomic Vulkan backend (#4456)	Jared Van Bortel
	Signed-off-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: niansa <anton-sa@web.de> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: ToKiNoBug <tokinobug@163.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
2024-01-29	fix typo "RLIMIT_MLOCK" (#5175)	divinity76