summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-16finetune : use LLAMA_FILE_MAGIC_GGLA (#4961)Daniel Bevenius
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-16speculative : threading options (#4959)stduhpf
* speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace
2024-01-15pass cpu-architecture arguments only to host code (C;C++) (#4943)ngc92
2024-01-15llama : apply classifier-free guidance to logits directly (#4951)David Friehs
2024-01-15awq-py : fix typo in awq-py/README.md (#4947)Victor Z. Peng
2024-01-15cuda : fix dequantize kernel names (#4938)Georgi Gerganov
2024-01-15llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14llama : fix missing quotes (#4937)David Pflug
2024-01-14Add ability to use importance matrix for all k-quants (#4930)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14llama : check LLAMA_TRACE env for extra logging (#4929)Georgi Gerganov
* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci
2024-01-14scripts : sync-ggml-am.sh option to skip commitsGeorgi Gerganov
2024-01-14llama : use LLAMA_LOG_ macros for loggingGeorgi Gerganov
2024-01-14Fix ffn_down quantization mix for MoE models (#4927)Kawrakow
* Fix ffn_down quantization mix for MoE models In #4872 I did not consider the part where every third tensor is quantized with more bits. Fir MoE this leads to tensors of the same layer being quantized with different number of bits, which is not considered as a possibility in the inference implementation (it is assumed all experts use the same quantization). * Fix the fix * Review suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14metal : correctly set SIMD support flags on iOS (#4923)Alex Azarov
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS
2024-01-14llama : support WinXP build with MinGW 8.1.0 (#3419)Karthik Kumar Viswanathan
2024-01-142-bit quantizations (#4897)Kawrakow
* imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14sync : ggmlGeorgi Gerganov
2024-01-13ggml: cache sin/cos for RoPE (#4908)Johannes Gäßler
2024-01-13metal : remove old API (#4919)Georgi Gerganov
ggml-ci
2024-01-13server : fix prompt caching with system prompt (#4914)Georgi Gerganov
2024-01-13llama : fix detokenization of non-special added-tokens (#4916)Georgi Gerganov
Co-authored-by: goerch <jhr.walter@t-online.de>
2024-01-13metal : disable log for loaded kernels (#4794)Georgi Gerganov
2024-01-13llama : minimize size used for state save/load (#4820)David Friehs
* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes
2024-01-13workflows: unbreak nix-build-aarch64, and split it out (#4915)Someone
The fix should be just the `sudo apt-get update`
2024-01-13main : add parameter --no-display-prompt (#4541)Yann Follet
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-13gguf : fix potential infinite for-loop (#4600)texmex76
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-13metal : refactor kernel loading code (#4794)Georgi Gerganov
* metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci
2024-01-13compare-llama-bench: tweak output format (#4910)Johannes Gäßler
2024-01-13server : fix deadlock that occurs in multi-prompt scenarios (#4905)Ziad Ben Hadj-Alouane
* * fix deadlock * * dont ruint all whitespace
2024-01-13server : fix crash with multimodal models without BOS token (#4904)makomk
2024-01-13convert : update phi-2 to latest HF repo (#4903)Georgi Gerganov
* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff
2024-01-12sync : ggmlGeorgi Gerganov
2024-01-12ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)Georgi Gerganov
* ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix
2024-01-12backend_sched : fix assignmentsslaren
ggml-ci
2024-01-12examples : add pydantic models to GBNF grammar generator (#4883)Maximilian Winter
* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.
2024-01-12CUDA: faster q8_0 -> f16 dequantization (#4895)Johannes Gäßler
2024-01-12llama : ggml-backend integration (#4766)slaren
* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12llama : remove redundant assert for StableLM (#4901)Georgi Gerganov
2024-01-12export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)Daniel Bevenius
This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12llama.swiftui : update models layout (#4826)Zay
* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout
2024-01-12gitignore : imatrixGeorgi Gerganov
2024-01-12CUDA: fix softmax compile for old CUDA versions (#4862)Johannes Gäßler
2024-01-12llama : fix typo "imp_embd" -> "inp_embd"Georgi Gerganov
2024-01-12common : streamline the formatting of help (#4890)howlger
* common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12py : fix lint (#4889)Georgi Gerganov
2024-01-12llama : fix llm_build_k_shift to use correct n_rot (#4889)Georgi Gerganov
* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot
2024-01-12Importance Matrix calculation (#4861)Kawrakow
* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11server : fix infill when prompt is empty (#4833)Georgi Gerganov