ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-03-01	build(nix): Introduce flake.formatter for `nix fmt` (#5687)	Tushar
	* build(nix): Introduce flake.formatter for `nix fmt` * chore: Switch to pkgs.nixfmt-rfc-style
2024-03-01	convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792)	nold

2024-03-01	llama : add StarCoder2 support (#5795)	Sourab Mangrulkar
	* Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-01	server : remove api_like_OAI.py proxy script (#5808)	Georgi Gerganov

2024-03-01	ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813)	ddpasa

2024-03-01	gemma : fix bfloat16 -> float16 conversion issue (#5810)	kunal-vaishnavi

2024-03-01	common : fix flag `--logits-all` to `--all-logits` (#5805)	Miwa / Ensan

2024-03-01	llama : cleanup unused mmq flags (#5772)	Pierrick Hymbert
	* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-01	unicode : switch to multimap based nfd_map (#5799)	Douglas Hanley
	* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
2024-03-01	server: allow to override threads server pool with --threads-http (#5794)	Pierrick Hymbert

2024-03-01	ci : add Ubuntu 22 Vulkan CI run (#5789)	Eve

2024-03-01	server : fix newlines in help (#5785)	Georgi Gerganov

2024-03-01	[SYCL] Use batched mul_mat pathway (#5591)	AidanBeltonS
	* Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-29	Server: normalize naming (#5779)	Xuan Son Nguyen
	* server: normalize naming * fix spacing
2024-02-29	llama : constified `llama_set_state_data`'s `src` (#5774)	Marcus Dunn

2024-02-28	ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)	Georgi Gerganov
	ggml-ci
2024-02-28	make portability_enumeration_ext apple only (#5757)	Eve

2024-02-28	llama : remove deprecated API (#5770)	Georgi Gerganov
	ggml-ci
2024-02-28	awq-py : remove (#5768)	Georgi Gerganov

2024-02-28	sync : ggml	Georgi Gerganov

2024-02-28	add google magika inference example (ggml/748)	slaren
	* add magika inference example * ggml : fix unaligned accesses in custom ops * ggml : fix FP32 GELU for values that exceed the FP16 range * use ggml_pool_1d * add README * Update README.md * pad inputs if the files are too small * cleanup ggml-ci
2024-02-28	Introduce backend GUIDs (ggml/743)	UEXTM.com
	* Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in https://github.com/ggerganov/ggml/pull/741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <slarengh@gmail.com> * Fix brace style to match Co-authored-by: slaren <slarengh@gmail.com> * Add void to () in function signature Co-authored-by: slaren <slarengh@gmail.com> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-28	server : hit Ctrl+C twice to exit (#5734)	Xuan Son Nguyen
	* server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-28	llama : fix non-quantization of expert gating tensors (#5754)	compilade
	This reverts a single line from #5475
2024-02-28	llama : improve BERT tokenization (#5740)	Douglas Hanley
	* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-28	readme : add link to LLaVA 1.6 models (#5758)	Daniel Bevenius
	Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-28	server : add "/chat/completions" alias for "/v1/...` (#5722)	Jorge A
	* Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-28	ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)	Kawrakow
	* WIP: make i-quants work for QK_K = 64 * iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish. * QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works. * Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access * Q2_K: fixed bug in imatrix quantization for QK_K = 64 * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27	Attempt to fix android build (#5752)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27	IQ4_XS: a 4.25 bpw quantization (#5747)	Kawrakow
	* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27	cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)	Engininja2

2024-02-27	ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)	Engininja2

2024-02-27	llama : fix defrag bugs + add parameter (#5735)	Georgi Gerganov
	* llama : fix defrag bugs + enable by default ggml-ci * llama : add defrag_thold parameter ggml-ci * llama : cont * llama : disable log message ggml-ci * llama : fix graph size check during defrag
2024-02-27	Makefile: use variables for cublas (#5689)	le.chang
	* make: use arch variable for cublas * fix UNAME_M * check opt first --------- Co-authored-by: lindeer <le.chang118@gmail.com>
2024-02-26	fix server hangs on empty prompt (#5733)	Xuan Son Nguyen

2024-02-26	Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization ↵	Kawrakow
	range (#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-26	CUDA: fix DEBUG_CUDA_MALLOC (#5729)	Johannes Gäßler

2024-02-26	readme : update ui list (#5731)	Artem
	* Add LLMFarm (ui for iOS) to list
2024-02-26	[SYCL] Add support for soft_max ALiBi (#5639)	AidanBeltonS
	* Add support for bias * Update pre-processor * rm commented code * fix format * fix CI --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-26	unicode : reuse iterator (#5726)	Georgi Gerganov

2024-02-26	server: CI fix trailing space (#5728)	Pierrick Hymbert

2024-02-26	server: CI tests reduce build matrix (#5725)	Pierrick Hymbert

2024-02-26	llama : fix Gemma rope type (#5691)	Georgi Gerganov

2024-02-25	flake.lock: Update	github-actions[bot]
	Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16) → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
2024-02-25	server: tests - slow inference causes timeout on the CI (#5715)	Pierrick Hymbert
	* server: tests - longer inference timeout for CI
2024-02-25	server: docs - refresh and tease a little bit more the http server (#5718)	Pierrick Hymbert
	* server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25	llama : refactor k-shift implementation + KV defragmentation (#5691)	Georgi Gerganov
	* llama : refactor k-shift implementation ggml-ci * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add * llama : cont k-shift refactoring + normalize type names ggml-ci * minor : fix MPI builds * llama : reuse n_rot from the build context ggml-ci * llama : revert enum name changes from this PR ggml-ci * llama : update llama_rope_type * llama : add comment about rope values * llama : fix build * passkey : apply kv cache updates explicitly ggml-ci * llama : change name to llama_kv_cache_update() * llama : add llama_kv_cache_seq_pos_max() * passkey : fix llama_kv_cache_seq_pos_max() usage * llama : some llama_kv_cell simplifications * llama : add llama_kv_cache_compress (EXPERIMENTAL) * llama : add alternative KV cache merging (EXPERIMENTAL) * llama : add llama_kv_cache_defrag * llama : comments * llama : remove llama_kv_cache_compress will add in a separate PR ggml-ci * llama : defragment via non-overlapping moves * llama : ggml_graph based defrag implementation ggml-ci * llama : switch the loop order in build_defrag * llama : add comments
2024-02-25	server : fix crash when system prompt is bigger than batch size (#5714)	compilade
	The system prompt is now decoded in batches. * server : fix off-by-one n_past when start of prompt matches whole cache The tokens right after the matching part would otherwise skip a pos value.
2024-02-25	ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)	Radosław Gryta
	* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility vqtbl1q_u8 is not part of arm v7 neon library * [android-example] Remove abi filter after arm v7a fix * [github-workflows] Do not skip Android armeabi-v7a build
2024-02-25	make : fix nvcc version is empty (#5713)	kwin1412
	fix nvcc version is empty