Age | Commit message (Collapse) | Author |
|
* llama : add functions to get the model's metadata
* format -> std::to_string
* better documentation
|
|
* llama : fix data units
ggml-ci
* Revert "llama : fix data units"
This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4.
* llama : disambiguate data units
ggml-ci
|
|
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
|
|
|
|
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
|
|
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
|
|
* Add ReLU and SQR CUDA ops to fix Persimmon offloading
* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
|
|
|
|
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build
* doc: add comments to ggml_cublas_loaded()
* fix defined(...)
|
|
as done in https://github.com/ggerganov/llama.cpp/pull/3827
|
|
|
|
|
|
|
|
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
|
|
|
|
|
|
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
|
|
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848)
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849)
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
|
|
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
|
|
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
|
|
ggml-ci
|
|
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
(#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
|
|
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
|
|
|
|
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
|
|
* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Add test for MPT tokenization
* Revert code motion
* Remove unnecessary restriction in test case
* Clarify logic in conversion
|
|
* Add validation for special token ids to llama.cpp
Small optimization for llama_byte_to_token SPM mode
* Fix BPE newline check, only I could break something so simple
* Killll meeeeee
* Account for GGUF_KEY_KEY only setting when the key exists
* Minor code cleanups.
* Fix convert.py error msg when added tokens are out of range
* Make gguf SpecialVocab vocab size-aware
Update conversion scripts accordingly
* Avoid a string copy
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* sampling : refactor init to use llama_sampling_params
* llama : combine repetition, frequency and presence penalties in 1 call
* examples : remove embd-input and gptneox-wip
* sampling : rename penalty params + reduce size of "prev" vector
* sampling : add llama_sampling_print helper
* sampling : hide prev behind API and apply #3661
ggml-ci
|
|
* Minor fixes and fixed memleak
* Using const auto references in range-based loop C++17
|
|
* sampling : one sequence per sampling context
ggml-ci
* speculative : add tree-based sampling support
ggml-ci
* speculative : reuse the n_parallel CLI param
* speculative : refactor sampling
* examples : fix build after sampling refactoring
ggml-ci
* batched : fix n_seq_id
* sampling : fix malloc
ggml-ci
* swift : fix build
ggml-ci
* swift : try to fix build
ggml-ci
* prompts : add assistant.txt
* common : add llama_batch_add() and llama_batch_clear() helpers
* speculative : minor refactor
ggml-ci
* minor : comments + rename
ggml-ci
* speculative : fix off-by-one for n_drafted
* speculative : fix the n_drafted fix + p constants
|
|
|
|
|
|
* Rewrite special token handling from #1931
* shorten param name, add st verification by type
* use offsets instead of copy by substr
* formatting, remove copying iterator on delete
* llama : normalize code-style
* swift fix
* print pfx/sfx if verb, main: split pfx input sfx
* dont add space when using special tokens
* minor : comment + spacing
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
This commit removes `n_threads` from the `llama_decode_internal`
functions doc comment as it does not exist anymore.
It looks like this parameter was removed in
Commit 16bc66d9479edd5ee12ec734973554d4493c5dfa ("llama.cpp : split
llama_context_params into model and context params").
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* Fixing minor bugs in bpe_gpt2_preprocess
* Don't add bos token in test
|
|
* feat: Support bloom models
* fix(bloom): fix model size
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545)
* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt
* mpt : protect against "clip_qkv": null in mpt-7b
* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models
* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
* mpt : standardized all tensor names to follow GGUF spec
* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code
* mpt : fixed comment s/gptneox/mpt/
* mpt : remove tabs, trailing whitespace
* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt
* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252
* comment out n_past instead of marking it unused
* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]
* mpt : remove unused tokenizer_json in convert script
* ggml : remove obsolete n_past assert in ggml_alibi
* llama : print clam_kqv and max_alibi_bias hparams
---------
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* refact : fix convert script + zero out KV cache to avoid nans
* ggml : silu(-inf) should never happen
* metal : assert various kernel requirements
|
|
* sync : ggml (ggml-backend)
ggml-ci
* zig : add ggml-backend to the build
|
|
|
|
|
|
* Produces garbage output
* wip: correct tensors up to RoPE
* correct tensors thru RoPE
* Correct outputs through masked & softmax'd KQ
* fp32 works
* Rename adept->persimmon
* Produces correct outputs
* clean up convert scripts
* remove printing logic from ggml.c
* remove prints from llama.cpp & fix merge
* trivial cleanups
* Add offload funcs
* update conversion script to directly take adept artifacts rather than .saftensors file
* Fix norm eps bug
* Support sqr and concat on metal, persimmon-8b-q4 runs correctly
* Small changes from review
* Formatting changes
* Minor changes to conversion script
* Remove old script
* Fix editorconfig formatting
* Fix build
* add overlooked offload code ggml-ci
|
|
Fix: `sentencepiece` tokenizers with added tokens failed with an incorrect assertion
|
|
* kv cache slot search improvements
* Use n_ctx in kv find slot for consistency
* Ensure kv cache head points to a valid slot in llama_decode internal
* Add some comments to prevent dumb people (like me) from getting confused.
|
|
* Enable external file and add datestamp
* Add name of external file at end
* Upload ToK2024
* Delete ToK2024.txt
* Experiments with jeopardy
* Move ParallelQuestions to /proimpts and rename
* Interim commit
* Interim commit
* Final revision
* Remove trailing whitespace
* remove cmake_all.sh
* Remove cmake_all.sh
* Changed .gitignore
* Improved reporting and new question files.
* Corrected typo
* More LLM questions
* Update LLM-questions.txt
* Yet more LLM-questions
* Remove jeopardy results file
* Reinstate original jeopardy.sh
* Update examples/parallel/parallel.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|