Age | Commit message (Collapse) | Author |
|
ggml-ci
|
|
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* convert : support Mixtral as LLAMA arch
* convert : fix n_ff typo
* llama : model loading
* ggml : sync latest ggml_mul_mat_id
* llama : update graph to support MoE
* llama : fix cur -> cur_expert
* llama : first working version
* llama : fix expert weighting in the FFN
* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
* ggml : add n_as argument to ggml_mul_mat_id
* ggml : fix ggml_get_rows to take into account ne02 / ne11
* metal : add more general support for ggml_get_rows + tests
* llama : add basic support for offloading moe with CUDA
* metal : add/mul/div use general kernel when src1 not cont
* metal : reduce the kernel launches for ggml_mul_mat_id
* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
* ggml : update get_rows f16 and q
* cuda : support non-contiguous src1 in get_rows
* llama : offload missing ffn_moe_silu
* metal : fix ggml_get_rows to work with non-cont src1
* metal : add indirect mat-vec kernels for all quantization types
* llama : do not quantize expert gating tensors
* llama : add n_expert and n_expert_used to hparams + change quants
* test-backend-ops : add moe test
* cuda : fix get_rows when ncols is odd
* convert : determine n_ctx correctly
* metal : fix ggml_mul_mat_id for F32
* test-backend-ops : make experts more evenly probable (test_moe)
* test-backend-ops : cleanup, add moe test for batches
* test-backend-ops : add cpy from f32 -> all types test
* test-backend-ops : fix dequantize block offset
* llama : fix hard-coded number of experts
* test-backend-ops : simplify and disable slow tests to avoid CI timeout
* test-backend-ops : disable MOE test with thread sanitizer
* cuda : fix mul_mat_id with multi gpu
* convert : use 1e6 rope_freq_base for mixtral
* convert : fix style
* convert : support safetensors format
* gguf-py : bump version
* metal : add cpy f16 -> f32 kernel
* metal : fix binary ops for ne10 % 4 != 0
* test-backend-ops : add one more sum_rows test
* ggml : do not use BLAS with ggml_mul_mat_id
* convert-hf : support for mixtral-instruct (#4428)
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct
* convert : use sentencepiece tokenizer for Mixtral-instruct
* convert : make flake8 happy
* metal : fix soft_max kernels
ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92
* metal : limit kernels to not use more than the allowed threads
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
|
|
|
|
(#4396)
|
|
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312)
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* reserve space for codepoints
* improvement for the appended 0
* used precomputed token text for grammar sample
* reserve canidates_decoded
* reserve canidates_grammar
* remove candidates_decoded
* Revert "remove candidates_decoded"
This reverts commit 3773328080e6a139ee83198329a13cf4ff61d707.
* changed decode_utf8 to take src by ref
|
|
* feat: Allow overriding GGUF metadata when loading model
* Fix the one time GCC is stricter than clang about something
* Step1
* Refactor... basically everything!
* Nuke obsolete GetArrayLen struct
* simplify std::string specialization
* Various cleanups
Add informational output when overrides are applied
Warn user when an override with the wrong type is specified
* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes
* llama : rearrange model params
* Update new GET_KEY call
Add note that metadata KV overrides aren't reflected in initial metadata KV info dump
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama : pad KV cache size to 32
* metal : try to improve batched decoding
|
|
|
|
|
|
* Support attention_bias on LLaMA architecture
QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608).
* check existence of qkvo bias while loading llama models
Tested on LLaMA2, CUDA and CPU.
* Update llama.cpp
|
|
* enable qwen to llama.cpp
* llama : do not GPU split bias tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
happens with multi-threaded quantization of Qwen-72B
ggml-ci
|
|
* metal : implement soft_max_ext
* cuda : implement soft_max_ext
* ggml : implement soft_max_ext (CPU)
* batched-bench : print threads
ggml-ci
* metal : simplify soft_max encoding
ggml-ci
* cuda : use 512 threads for soft_max instead of 32
* ggml : update soft max cpu
* cuda : do warp-based block reduce
* cuda : increase max block size to 1024
* cuda : fix warp reduction initialization of shared mem
* metal : warp-based reduction for soft max kernel
* metal : warp-based reduce for rms_norm
* metal : simplify soft max kernel
ggml-ci
* alloc : fix build with debug
|
|
* cmake : fix joining of REAL_GIT_DIR
* fix includes with help from include-what-you-use
* make : remove unneeded deps and add test-rope target
* fix C includes in C++ source files
* Revert "fix includes with help from include-what-you-use"
This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.
|
|
* llama: fix alignment of general.name in print meta
This commit fixes the alignment of the general.name field in the
llm_load_print_meta function.
Currently the output looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
And with this commit it looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama: fix alignment of special tokens
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false.
Test: Generating with temp=0.0001 (approx. argmax) should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
|
|
offload checks in llama.cpp (#4240)
* ggml : use blas even if src0 is not F32
* llama : use n_threads_batch only when n_tokens >= 32
ggml-ci
* llama : revert n_threads_batch logic
ggml-ci
|
|
* reserve space for codepoints
* improvement for the appended 0
|
|
|
|
* ggml-cuda : support stablelm rope
* remove unused freq_base kernel parameter
* add n_dims parameter to llm_build_k_shift, default to n_rot via overload
* llama : fix llm_build_k_shift args
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama : keep track of used KV cells + better KV cache management
* llama : zero KV cache used upon clear
ggml-ci
* llama : allow exporting a view of the KV cache (#4180)
* Allow exporting a view of the KV cache
* Allow dumping the sequences per cell in common
* Track max contiguous cells value and position as well
* Fix max contiguous empty cells index calculation
Make dump functions deal with lengths or sequences counts > 10 better
* Fix off by one error in dump_kv_cache_view
* Add doc comments for KV cache view functions
Eliminate cell sequence struct; use llama_seq_id directly
Minor cleanups
* common : add -dkvc arg for enabling kv cache dumps
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
|
|
|
|
* gguf-py : export chat templates
* llama.cpp : escape new lines in gguf kv info prints
* gguf-py : bump version
* gguf-py : check chat_template type
* gguf-py : initialize chat_template
|
|
|
|
* llama : add functions to get the model's metadata
* format -> std::to_string
* better documentation
|
|
* llama : fix data units
ggml-ci
* Revert "llama : fix data units"
This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4.
* llama : disambiguate data units
ggml-ci
|
|
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
|
|
|
|
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
|
|
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
|
|
* Add ReLU and SQR CUDA ops to fix Persimmon offloading
* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
|
|
|
|
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build
* doc: add comments to ggml_cublas_loaded()
* fix defined(...)
|
|
as done in https://github.com/ggerganov/llama.cpp/pull/3827
|
|
|
|
|
|
|
|
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
|
|
|
|
|
|
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
|
|
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848)
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849)
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
|
|
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
|
|
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
|
|
ggml-ci
|
|
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|