ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-05-19	cmake : update android comments (#7341)	Georgi Gerganov

2024-05-18	android : use "ci-android" branch for CI (#7341)	Georgi Gerganov
	* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir
2024-05-18	server: correct --threads documentation [no ci] (#7362)	Johannes Gäßler

2024-05-18	perplexity : ndot progress and show stats with < 100 tasks (#7348)	strawberrymelonpanda
	Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-17	rpc : set SO_REUSEADDR for the server socket (#7320)	Radoslav Gerganov
	ref: #7293
2024-05-17	server : add support for the RPC backend (#7305)	Radoslav Gerganov
	ref: #7292
2024-05-17	[Server] Added --verbose option to README [no ci] (#7335)	Leon Knauer

2024-05-16	Revert "server bench: fix bench not waiting for model load (#7284)" (#7334)	Pierrick Hymbert
	This reverts commit 583fd6b000ec9ad1b465b5c98524f4a0ae388077.
2024-05-16	rpc : get available mem for the CPU backend	Radoslav Gerganov
	This can be overridden with the -m command line option ref: #7293
2024-05-16	rpc : add command line arg for specifying backend memory	Radoslav Gerganov
	ref: #7293
2024-05-16	doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)	Vaibhav Srivastav
	* chore: add references to the quantisation space. * fix grammer lol. * Update README.md Co-authored-by: Julien Chaumond <julien@huggingface.co> * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-15	ggml : tag ggml_tensor::backend as deprecated (#7290)	slaren

2024-05-15	embedding : free the batch after execution (#7297)	dm4

2024-05-15	server bench: fix bench not waiting for model load (#7284)	Johannes Gäßler

2024-05-14	server: free sampling contexts on exit (#7264)	Steve Grubb
	* server: free sampling contexts on exit This cleans up last leak found by the address sanitizer. * fix whitespace * fix whitespace
2024-05-14	Revert "move ndk code to a new library (#6951)" (#7282)	Brian
	This reverts commit efc8f767c8c8c749a245dd96ad4e2f37c164b54c.
2024-05-14	ggml : add RPC backend (#6829)	Radoslav Gerganov
	* ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos
2024-05-14	move ndk code to a new library (#6951)	Elton Kola

2024-05-14	docs: Fix typo and update description for --embeddings flag (#7026)	Ryuei
	- Change '--embedding' to '--embeddings' in the README - Update the description to match the latest --help output - Added a caution about defining physical batch size
2024-05-14	llava-cli: fix base64 prompt (#7248)	k.h.lai

2024-05-13	perplexity: add BF16 vs. FP16 results (#7150)	Johannes Gäßler

2024-05-13	change default temperature of OAI compat API from 0 to 1 (#7226)	Benjamin Findley
	* change default temperature of OAI compat API from 0 to 1 * make tests explicitly send temperature to OAI API
2024-05-11	fix system prompt handling (#7153)	Xuan Son Nguyen

2024-05-11	server : free llama_batch on exit (#7212)	Steve Grubb
	* [server] Cleanup a memory leak on exit There are a couple memory leaks on exit of the server. This hides others. After cleaning this up, you can see leaks on slots. But that is another patch to be sent after this. * make tab into spaces
2024-05-11	server: fix reported top tokens for temperature 0 (#7203)	Johannes Gäßler

2024-05-11	llama : add Jina Embeddings architecture (#6826)	Joan Fontanals
	* feat: first things to do * feat: create tensors for Jina architecture * fix: use other tensors * feat: embedding gets results * fix: fix usage of ALIBI * fix: clean prints * fix: do some cleanup unused vars * fix: revert changes to Makefile and CMakeLists * fix: revert some changes * fix: fix small detail * fix: fix convert formatting * fix: fix linting and editor * feat: set proper vocab settings * fix: JinaBertForMaskedLM registration * feat: support q_normalization and k_normalization in Jina arch * feat: handle gpt2 tokenizer with Jina architecture * feat: example comments in embedding * feat: rename Jina Bert to Jina Bert V2 * fix: add some changes as per review * feat: proper KQ_pos for Jina embeddings * feat: add capacity to load models ES and DE for Spanish * llama : fix pre-tokenizers * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * minor : clean-up * embedding : add warning about missing SEP --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-10	llama-bench : add pp+tg test type (#7199)	slaren

2024-05-10	Fix memory bug in grammar parser (#7194)	Justine Tunney
	The llama.cpp grammar parser had a bug where forgetting to add a closing quotation mark to strings would cause parsing to crash. Anyone running a server on a public endpoint is advised to upgrade. To reproduce this bug ./llamafile -m foo.gguf -p bar --grammar 'root::="' Credit for discovering and reporting this issue goes to Eclypsium Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
2024-05-10	Main+: optionally allow special tokens from user in interactive mode (#7097)	HanishKVC
	@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
2024-05-10	llava : fix moondream support (#7163)	Andrei
	* Revert "Revert "llava : add support for moondream vision language model (#6899)"" This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737. * Fix num_positions and embeddings initialization
2024-05-10	eval-callback : fix conversion to float (#7184)	slaren

2024-05-09	TypoFix (#7162)	Ahmet Zeer

2024-05-08	convert-hf : save memory with lazy evaluation (#7075)	compilade
	* convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra__tensors convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file
2024-05-08	JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143)	Johannes Gäßler

2024-05-08	Revert "llava : add support for moondream vision language model (#6899)"	Georgi Gerganov
	This reverts commit 46e12c4692a37bdd31a0432fc5153d7d22bc7f72.
2024-05-08	server : add themes + favicon (#6848)	JohnnyB
	* Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.
2024-05-08	main : add --conversation / -cnv flag (#7108)	Dawid Potocki

2024-05-08	server : add_special option for tokenize endpoint (#7059)	Johan

2024-05-08	clean up json_value & server_log (#7142)	Xuan Son Nguyen

2024-05-08	ggml : introduce bfloat16 support (#6412)	Justine Tunney
	* Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function
2024-05-08	Fixed save_imatrix to match old behaviour for MoE (#7099)	jukofyork
	* Fixed save_imatrix to match old behaviour for MoE This fix is simple and clear, but unnecessarily doubles the memory overhead.. * Fixed missing idx variable * Unconditionally increment ncall Co-authored-by: slaren <slarengh@gmail.com> * Fixed 2 bugs in save_imatrix() - Fixed segfault bug because the counts vector needed to be created. - Fixed pre-existing bug didn't actually add to the counts for "--combine" option. * ncall needs summing too * Trailing whitespace --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-05-07	server: fix incorrectly reported token probabilities (#7125)	Johannes Gäßler
	* server: normalize token probabilities * fix temperature == 0.0f
2024-05-07	server : update readme with undocumented options (#7013)	Kyle Mistele

2024-05-07	main : update log text (EOS to EOG) (#7104)	RhinoDevel
	* Update log text (EOS to EOG) The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT. * Improve log msg. further by using "an" instead of "some". As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
2024-05-07	docs: fix typos (#7124)	omahs
	* fix typo * fix typos * fix typo * fix typos * fix typo * fix typos
2024-05-05	Adding support for the --numa argument for llama-bench. (#7080)	kunnis

2024-05-04	gguf-split: add --no-tensor-first-split (#7072)	Xuan Son Nguyen

2024-05-04	If first token generated from the server is the stop word the server will ↵	maor-ps
	crash (#7038) This will reproduce the issue in llama13b { 'prompt': 'Q: hello world \nA: ', 'stop': ['\n'], 'temperature': 0.0, 'n_predict': 10, 'cache_prompt': True, 'n_probs': 10 }
2024-05-01	main : fix off by one error for context shift (#6921)	l3utterfly

2024-05-01	Server: add tests for batch size, different seeds (#6950)	Johannes Gäßler