ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-01-13	metal : remove old API (#4919)	Georgi Gerganov
	ggml-ci
2024-01-13	server : fix prompt caching with system prompt (#4914)	Georgi Gerganov

2024-01-13	llama : minimize size used for state save/load (#4820)	David Friehs
	* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes
2024-01-13	main : add parameter --no-display-prompt (#4541)	Yann Follet
	* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13	server : fix deadlock that occurs in multi-prompt scenarios (#4905)	Ziad Ben Hadj-Alouane
	* * fix deadlock * * dont ruint all whitespace
2024-01-13	server : fix crash with multimodal models without BOS token (#4904)	makomk

2024-01-12	examples : add pydantic models to GBNF grammar generator (#4883)	Maximilian Winter
	* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.
2024-01-12	llama : ggml-backend integration (#4766)	slaren
	* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)	Daniel Bevenius
	This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12	llama.swiftui : update models layout (#4826)	Zay
	* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout
2024-01-12	Importance Matrix calculation (#4861)	Kawrakow
	* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	server : fix infill when prompt is empty (#4833)	Georgi Gerganov

2024-01-11	main : better name for variable n_print (#4874)	Georgi Gerganov

2024-01-11	main : disable token count by default (#4874)	Georgi Gerganov

2024-01-11	llama : restore intended k-quants mixes for MoE models (#4872)	Kawrakow
	* Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	server : implement credentialed CORS (#4514)	Laura
	* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage
2024-01-11	server : support for multiple api keys (#4864)	Michael Coppola
	* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11	server : add `LOG_INFO` when model is successfully loaded (#4881)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading
2024-01-11	main : print total token count and tokens consumed so far (#4874)	pudepiedj
	* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn
2024-01-11	server : fix typo in model name (#4876)	Isaac McFadyen

2024-01-11	server : update readme to document the new `/health` endpoint (#4866)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too
2024-01-11	server : fix build + rename enums (#4870)	Georgi Gerganov

2024-01-10	server : add a `/health` endpoint (#4860)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line
2024-01-10	clip : support more quantization types (#4846)	John
	Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
2024-01-09	llava-cli : don't crash if --image flag is invalid (#4835)	Justine Tunney
	This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.
2024-01-09	server : update readme about token probs (#4777)	Behnam M
	* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09	server : add api-key flag to documentation (#4832)	Zsapi
	Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-08	llama.swiftui : update readme	Georgi Gerganov

2024-01-08	main : add self-extend support (#4815)	Georgi Gerganov
	* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div
2024-01-08	examples : add passkey test (#3856)	Georgi Gerganov
	* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme
2024-01-07	llama-bench : add no-kv-offload parameter (#4812)	slaren

2024-01-07	llama.swiftui : use llama.cpp as SPM package (#4804)	Alex Azarov

2024-01-07	llama.swiftui : add visionOS target (#4805)	Alex Azarov

2024-01-07	server : fix n_predict check (#4798)	Georgi Gerganov

2024-01-06	llama.swiftui : use correct pointer for llama_token_eos (#4797)	Daniel Illescas Romero

2024-01-06	examples : improve base-translate.sh script (#4783)	Georgi Gerganov

2024-01-05	metal : switch back to default.metallib (ggml/681)	Georgi Gerganov
	ggml-ci
2024-01-05	examples : add few-shot translation example (#4783)	Georgi Gerganov

2024-01-04	finetune : remove unused includes (#4756)	Daniel Bevenius
	This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-04	server : send token probs for "stream == false" (#4714)	Georgi Gerganov

2024-01-04	llama.swiftui : support loading custom model from file picker (#4767)	singularity
	* swiftui: support load model from file picker * swiftui: remove trailing whitespace
2024-01-04	server : fix options in README.md (#4765)	Michael Coppola
	* fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04	llama.swiftui : fix build of ggml.metallib (#4754)	singularity
	* metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03	server : throw an error when `slot unavailable` (#4741)	Justin Parker

2024-01-02	server : add token counts to html footer (#4738)	Phil H
	* server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>
2024-01-02	editorconfig : fix whitespace and indentation #4710	Georgi Gerganov

2024-01-02	server : add --override-kv parameter (#4710)	minarchist
	* Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
2024-01-02	finetune: fix typo in README.md (#4733)	Daniel Bevenius
	Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-30	clip : refactor + bug fixes (#4696)	Georgi Gerganov
	* clip : refactor + bug fixes ggml-ci * server : add log message
2023-12-29	clip : use ggml_backend_buffer_is_host (#4205)	Georgi Gerganov