summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-142-bit quantizations (#4897)Kawrakow
* imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14sync : ggmlGeorgi Gerganov
2024-01-13ggml: cache sin/cos for RoPE (#4908)Johannes Gäßler
2024-01-13metal : remove old API (#4919)Georgi Gerganov
ggml-ci
2024-01-13server : fix prompt caching with system prompt (#4914)Georgi Gerganov
2024-01-13llama : fix detokenization of non-special added-tokens (#4916)Georgi Gerganov
Co-authored-by: goerch <jhr.walter@t-online.de>
2024-01-13metal : disable log for loaded kernels (#4794)Georgi Gerganov
2024-01-13llama : minimize size used for state save/load (#4820)David Friehs
* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes
2024-01-13workflows: unbreak nix-build-aarch64, and split it out (#4915)Someone
The fix should be just the `sudo apt-get update`
2024-01-13main : add parameter --no-display-prompt (#4541)Yann Follet
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13gguf : fix potential infinite for-loop (#4600)texmex76
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-13metal : refactor kernel loading code (#4794)Georgi Gerganov
* metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci
2024-01-13compare-llama-bench: tweak output format (#4910)Johannes Gäßler
2024-01-13server : fix deadlock that occurs in multi-prompt scenarios (#4905)Ziad Ben Hadj-Alouane
* * fix deadlock * * dont ruint all whitespace
2024-01-13server : fix crash with multimodal models without BOS token (#4904)makomk
2024-01-13convert : update phi-2 to latest HF repo (#4903)Georgi Gerganov
* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff
2024-01-12sync : ggmlGeorgi Gerganov
2024-01-12ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)Georgi Gerganov
* ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix
2024-01-12backend_sched : fix assignmentsslaren
ggml-ci
2024-01-12examples : add pydantic models to GBNF grammar generator (#4883)Maximilian Winter
* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.
2024-01-12CUDA: faster q8_0 -> f16 dequantization (#4895)Johannes Gäßler
2024-01-12llama : ggml-backend integration (#4766)slaren
* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12llama : remove redundant assert for StableLM (#4901)Georgi Gerganov
2024-01-12export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)Daniel Bevenius
This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12llama.swiftui : update models layout (#4826)Zay
* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout
2024-01-12gitignore : imatrixGeorgi Gerganov
2024-01-12CUDA: fix softmax compile for old CUDA versions (#4862)Johannes Gäßler
2024-01-12llama : fix typo "imp_embd" -> "inp_embd"Georgi Gerganov
2024-01-12common : streamline the formatting of help (#4890)howlger
* common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12py : fix lint (#4889)Georgi Gerganov
2024-01-12llama : fix llm_build_k_shift to use correct n_rot (#4889)Georgi Gerganov
* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot
2024-01-12Importance Matrix calculation (#4861)Kawrakow
* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11server : fix infill when prompt is empty (#4833)Georgi Gerganov
2024-01-11main : better name for variable n_print (#4874)Georgi Gerganov
2024-01-11main : disable token count by default (#4874)Georgi Gerganov
2024-01-11swift : track ggml release branch (#4867)Georgi Gerganov
2024-01-11llama : restore intended k-quants mixes for MoE models (#4872)Kawrakow
* Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)Kawrakow
* iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11swift : pin ggml commit + remove ggml.h from spm-headers (#4878)Georgi Gerganov
ggml-ci
2024-01-11server : implement credentialed CORS (#4514)Laura
* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage
2024-01-11server : support for multiple api keys (#4864)Michael Coppola
* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11server : add `LOG_INFO` when model is successfully loaded (#4881)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading
2024-01-11ci: nix-flake-update: new token with pr permissions (#4879)Someone
* ci: nix-flake-update: new token with pr permissions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11main : print total token count and tokens consumed so far (#4874)pudepiedj
* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn
2024-01-11server : fix typo in model name (#4876)Isaac McFadyen
2024-01-11metal : put encoder debug group behind a define (#4873)Paul Tsochantaris
2024-01-11sync : ggmlGeorgi Gerganov
2024-01-11metal : fix deprecation warning (ggml/690)Georgi Gerganov
2024-01-11ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)Timothy Cronin