summaryrefslogtreecommitdiff
path: root/llama.cpp
AgeCommit message (Collapse)Author
2023-12-14ggml : remove n_dims from ggml_tensor (#4469)slaren
ggml-ci
2023-12-14ggml : add ggml_row_size() (fixes llama out of space) (#4461)LostRuins
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-13llama : add Mixtral support (#4406)slaren
* convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92 * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-12english : use `typos` to fix comments and logs (#4354)Richard Kiss
2023-12-09grammar : revert the replacement of llama_token_to_piece with id_to_token ↵Xiang (Kevin) Li
(#4396)
2023-12-07llama : per-layer KV cache + quantum K cache (#4309)Georgi Gerganov
* per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-05grammar : pre-computed pieces + reserve mem + less string copies (#4330)Marcus Dunn
* reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit 3773328080e6a139ee83198329a13cf4ff61d707. * changed decode_utf8 to take src by ref
2023-12-05llama : allow overriding GGUF metadata when loading model (#4092)Kerfuffle
* feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-03llama : pad KV cache size (#4280)Georgi Gerganov
* llama : pad KV cache size to 32 * metal : try to improve batched decoding
2023-12-01llama : avoid using "optional" keyword (#4283)Georgi Gerganov
2023-12-01llama : support optional tensors (#4283)Georgi Gerganov
2023-12-01llama : support attention bias on LLaMA architecture (#4283)CausalLM
* Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp
2023-12-01llama : add Qwen support (#4281)Shijie
* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-01llama : fix integer overflow during quantization (#4284)Georgi Gerganov
happens with multi-threaded quantization of Qwen-72B ggml-ci
2023-12-01ggml : add ggml_soft_max_ext (#4256)Georgi Gerganov
* metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug
2023-12-01build : fix build info generation and cleanup Makefile (#3920)Jared Van Bortel
* cmake : fix joining of REAL_GIT_DIR * fix includes with help from include-what-you-use * make : remove unneeded deps and add test-rope target * fix C includes in C++ source files * Revert "fix includes with help from include-what-you-use" This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.
2023-11-30llama : fix alignment of general.name in print meta (#4254)Daniel Bevenius
* llama: fix alignment of general.name in print meta This commit fixes the alignment of the general.name field in the llm_load_print_meta function. Currently the output looks like this: ```console llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) llm_load_print_meta: general.name = LLaMA v2 ``` And with this commit it looks like this: ```console llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) llm_load_print_meta: general.name = LLaMA v2 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama: fix alignment of special tokens Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-11-30llama : fix typical sampling (#4261)tarcey
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false. Test: Generating with temp=0.0001 (approx. argmax) should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
2023-11-28ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant full ↵Georgi Gerganov
offload checks in llama.cpp (#4240) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci
2023-11-25llama : grammar `reserve` space in `decode_utf8` (#4210)Marcus Dunn
* reserve space for codepoints * improvement for the appended 0
2023-11-24llama : set metal log callback correctly (#4204)slaren
2023-11-24ggml-cuda : support stablelm rope (#4156)slaren
* ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-23llama : KV cache view API + better KV cache management (#4170)Georgi Gerganov
* llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-21stablelm : simplify + speedup generation (#4153)Galunid
2023-11-19gguf-py : export chat templates (#4125)slaren
* gguf-py : export chat templates * llama.cpp : escape new lines in gguf kv info prints * gguf-py : bump version * gguf-py : check chat_template type * gguf-py : initialize chat_template
2023-11-17llama : increase max nodes (#4115)slaren
2023-11-17llama : add functions to get the model's metadata (#4013)slaren
* llama : add functions to get the model's metadata * format -> std::to_string * better documentation
2023-11-17llama : fix data units (#4101)Georgi Gerganov
* llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4. * llama : disambiguate data units ggml-ci
2023-11-16Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)Kerfuffle
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-15llama : restore prefix space in llama tokenizer (#4081)Jared Van Bortel
2023-11-14stablelm : StableLM support (#3586)Galunid
* Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers
2023-11-13sync : ggml (backend v2) (#3912)Georgi Gerganov
* sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp
2023-11-13Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041)Kerfuffle
* Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-10Unbreak persimmon after #3837 (#4010)Galunid
2023-11-07cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946)Meng Zhang
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
2023-11-05llama : mark LLM_ARCH_STARCODER as full offload supported (#3945)Meng Zhang
as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-03llama : change yarn_ext_factor placeholder to -1 (#3922)cebtenzzre
2023-11-02llm : prevent from 1-D tensors being GPU split (#3697)Georgi Gerganov
2023-11-01llama : fix llama_context_default_params after #2268 (#3893)cebtenzzre
2023-11-01llama : implement YaRN RoPE scaling (#2268)cebtenzzre
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01llm : fix llm_build_kqv taking unused tensor (benign, #3837)Georgi Gerganov
2023-11-01llm : fix falcon norm after refactoring (#3837)Georgi Gerganov
2023-11-01llm : add llm_build_context (#3881)Georgi Gerganov
* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments
2023-11-01finetune : add -ngl parameter (#3762)Andrew Godfrey
* Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01llama : refactor graph build code (#3837)Georgi Gerganov
* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
2023-10-31samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)kalomaze
* Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-30ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)Georgi Gerganov
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file
2023-10-29Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)Kerfuffle
* Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29llama : fix kv shift bug (#3835)Georgi Gerganov
ggml-ci
2023-10-29ggml : quantization refactoring (#3833)Georgi Gerganov
* ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>