ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-01-27	Remove unused data and add fixes (#5154)	Michael Klimenko
	* Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation
2024-01-27	server : add self-extend support (#5104)	Maximilian Winter
	* Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-26	server : refactored the task processing logic (#5065)	Xuan Son Nguyen
	* server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"
2024-01-25	examples : make pydantic scripts pass mypy and support py3.8 (#5099)	Jared Van Bortel

2024-01-25	android : use release cmake build type by default (#5123)	Valentin Konovalov

2024-01-23	Additional KL-divergence statistics (#5081)	Kawrakow
	* perplexity: add top-token probability * perplexity: add additional KL-divergence statistics * perplexity: a better organized KL-divergence statistics output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-23	minor : clean-up some warnings and style (#5094)	Georgi Gerganov
	* minor : clean-up some warnings and style ggml-ci * ggml : add comment
2024-01-23	llama.vim : added api key support (#5090)	Michael Coppola
	Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-22	KL-divergence (#5076)	Kawrakow
	* kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22	llava : MobileVLM support (#4954)	XiaotaoChen
	* MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-22	imatrix : keep intermediate imatrix results (#5077)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22	finetune : print sample-start/include-sample-start (#5072)	Daniel Bevenius
	This commit adds `--sample-start` and `--include-sample-start` to the output from the main function in finetune.cpp. The motivation for this is that even though these are set explicitly by the user via the command line, if one forgets to set them then it is useful to have their values printed out. Otherwise it is possible to go through the whole training process before realizing that the values are not what one expected. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-22	llama : add Q3_K_XS (#5060)	Kawrakow
	* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21	Add ability to evauate multiple choice tasks (#5047)	Kawrakow
	* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21	Slightly faster imatrix (#5050)	Kawrakow
	* imatrix: speedup by avoiding unnecessary allocations and copies * imatrix: add --no-ppl option to skip PPL calculations altogether --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-20	perplexity : fix MSVC build after #5020 (#5043)	Jared Van Bortel
	* perplexity : fix MSVC build after #5020 * try a differerent fix
2024-01-19	finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)	Uzo Nweke
	* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791
2024-01-19	imatrix : add README.md	Georgi Gerganov

2024-01-19	winogrande: evaluate log-probs in parallel (#5036)	Kawrakow
	This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19	perplexity: avoid unnecessary alloocations and logit copies (#5035)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19	perplexity : faster Winogrande via batching (#5024)	Georgi Gerganov
	* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande
2024-01-18	server : defer tasks when "slot unavailable" (#5018)	Xuan Son Nguyen
	* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18	imatrix : fix assert for src0 non-cont check	Georgi Gerganov

2024-01-18	perplexity : fix winogrande N tasks option	Georgi Gerganov

2024-01-18	HellaSwag: speed up by parallelizing log-prob evaluation (#5020)	Kawrakow
	For Mistral-7B and fp16, time on my system goes down from 536 seconds to 423 seconds for the full evaluation dataset (10042 tasks). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18	perplexity : faster HellaSwag via batching (#5017)	Georgi Gerganov
	* perplexity : faster HellaSwag ggml-ci * perplexity : clean-up ggml-ci * perplexity : no need for decode_helper ggml-ci * perplexity : add comments * perplexity : option to specify max batched tasks via `n_parallel` * perplexity : remove HellaSwag restruction for n_batch
2024-01-18	Add Winogrande evaluation (#5015)	Kawrakow
	* winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17	imatrix : offload to GPU support (#4957)	Georgi Gerganov
	* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci
2024-01-16	finetune : add training data file to log message (#4979)	Daniel Bevenius
	This commit adds the name of the training data file to the log message printed when the training data is tokenized. The motivation for this change is that it can be useful to show which file is being tokenized when running the finetune example. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16	examples : add complete parallel function calling example (#4974)	Maximilian Winter

2024-01-16	perplexity : fix kv cache handling for hellaswag (#4981)	Georgi Gerganov
	ggml-ci
2024-01-16	android : introduce starter project example (#4926)	Neuman Vong
	* Introduce starter project for Android Based on examples/llama.swiftui. * Add github workflow * Set NDK version * Only build arm64-v8a in CI * Sync bench code * Rename CI prop to skip-armeabi-v7a * Remove unused tests
2024-01-16	examples : fix and improv docs for the grammar generator (#4909)	Maximilian Winter
	* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue. * Fixed some issues and bugs of the grammar generator. Imporved Documentation * Update pydantic_models_to_grammar.py
2024-01-16	finetune : use LLAMA_FILE_MAGIC_GGLA (#4961)	Daniel Bevenius
	This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16	speculative : threading options (#4959)	stduhpf
	* speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace
2024-01-14	Add ability to use importance matrix for all k-quants (#4930)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14	2-bit quantizations (#4897)	Kawrakow
	* imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-13	metal : remove old API (#4919)	Georgi Gerganov
	ggml-ci
2024-01-13	server : fix prompt caching with system prompt (#4914)	Georgi Gerganov

2024-01-13	llama : minimize size used for state save/load (#4820)	David Friehs
	* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes
2024-01-13	main : add parameter --no-display-prompt (#4541)	Yann Follet
	* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13	server : fix deadlock that occurs in multi-prompt scenarios (#4905)	Ziad Ben Hadj-Alouane
	* * fix deadlock * * dont ruint all whitespace
2024-01-13	server : fix crash with multimodal models without BOS token (#4904)	makomk

2024-01-12	examples : add pydantic models to GBNF grammar generator (#4883)	Maximilian Winter
	* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.
2024-01-12	llama : ggml-backend integration (#4766)	slaren
	* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)	Daniel Bevenius
	This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12	llama.swiftui : update models layout (#4826)	Zay
	* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout
2024-01-12	Importance Matrix calculation (#4861)	Kawrakow
	* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	server : fix infill when prompt is empty (#4833)	Georgi Gerganov

2024-01-11	main : better name for variable n_print (#4874)	Georgi Gerganov