ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2023-11-17	llama : add functions to get the model's metadata (#4013)	slaren
	* llama : add functions to get the model's metadata * format -> std::to_string * better documentation
2023-11-17	llama : fix data units (#4101)	Georgi Gerganov
	* llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4. * llama : disambiguate data units ggml-ci
2023-11-16	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)	Kerfuffle
	* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-15	llama : restore prefix space in llama tokenizer (#4081)	Jared Van Bortel

2023-11-14	stablelm : StableLM support (#3586)	Galunid
	* Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers
2023-11-13	sync : ggml (backend v2) (#3912)	Georgi Gerganov
	* sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp
2023-11-13	Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041)	Kerfuffle
	* Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-10	Unbreak persimmon after #3837 (#4010)	Galunid

2023-11-07	cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946)	Meng Zhang
	* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
2023-11-05	llama : mark LLM_ARCH_STARCODER as full offload supported (#3945)	Meng Zhang
	as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-03	llama : change yarn_ext_factor placeholder to -1 (#3922)	cebtenzzre

2023-11-02	llm : prevent from 1-D tensors being GPU split (#3697)	Georgi Gerganov

2023-11-01	llama : fix llama_context_default_params after #2268 (#3893)	cebtenzzre

2023-11-01	llama : implement YaRN RoPE scaling (#2268)	cebtenzzre
	Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01	llm : fix llm_build_kqv taking unused tensor (benign, #3837)	Georgi Gerganov

2023-11-01	llm : fix falcon norm after refactoring (#3837)	Georgi Gerganov

2023-11-01	llm : add llm_build_context (#3881)	Georgi Gerganov
	* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments
2023-11-01	finetune : add -ngl parameter (#3762)	Andrew Godfrey
	* Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01	llama : refactor graph build code (#3837)	Georgi Gerganov
	* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
2023-10-31	samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)	kalomaze
	* Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-30	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)	Georgi Gerganov
	* ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file
2023-10-29	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)	Kerfuffle
	* Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29	llama : fix kv shift bug (#3835)	Georgi Gerganov
	ggml-ci
2023-10-29	ggml : quantization refactoring (#3833)	Georgi Gerganov
	* ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-28	llama : allow quantizing k-quants to fall back when tensor size incompatible ↵	Kerfuffle
	(#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit
2023-10-28	starcoder : add GPU offloading (#3827)	Georgi Gerganov
	* starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci
2023-10-27	llama : correctly report GGUFv3 format (#3818)	cebtenzzre

2023-10-27	cuda : improve text-generation and batched decoding performance (#3776)	Georgi Gerganov
	* cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels
2023-10-23	llama : remove token functions with `context` args in favor of `model` (#3720)	Marcus Dunn
	* added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-22	Add test for MPT tokenization (#3728)	goerch
	* Add test for MPT tokenization * Revert code motion * Remove unnecessary restriction in test case * Clarify logic in conversion
2023-10-22	llama : validate special token ids are in range when loading GGUF model (#3635)	Kerfuffle
	* Add validation for special token ids to llama.cpp Small optimization for llama_byte_to_token SPM mode * Fix BPE newline check, only I could break something so simple * Killll meeeeee * Account for GGUF_KEY_KEY only setting when the key exists * Minor code cleanups. * Fix convert.py error msg when added tokens are out of range * Make gguf SpecialVocab vocab size-aware Update conversion scripts accordingly * Avoid a string copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-20	sampling : refactor init to use llama_sampling_params (#3696)	Georgi Gerganov
	* sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci
2023-10-20	ggml : fix rope + llama minor optimizations (#3560)	Herman Semenov
	* Minor fixes and fixed memleak * Using const auto references in range-based loop C++17
2023-10-18	speculative : add tree-based sampling example (#3624)	Georgi Gerganov
	* sampling : one sequence per sampling context ggml-ci * speculative : add tree-based sampling support ggml-ci * speculative : reuse the n_parallel CLI param * speculative : refactor sampling * examples : fix build after sampling refactoring ggml-ci * batched : fix n_seq_id * sampling : fix malloc ggml-ci * swift : fix build ggml-ci * swift : try to fix build ggml-ci * prompts : add assistant.txt * common : add llama_batch_add() and llama_batch_clear() helpers * speculative : minor refactor ggml-ci * minor : comments + rename ggml-ci * speculative : fix off-by-one for n_drafted * speculative : fix the n_drafted fix + p constants
2023-10-17	fix embeddings when using CUDA (#3657)	slaren

2023-10-17	llama : avoid fprintf in favor of LLAMA_LOG (#3538)	Georgi Gerganov

2023-10-17	tokenizer : special token handling (#3538)	staviq
	* Rewrite special token handling from #1931 * shorten param name, add st verification by type * use offsets instead of copy by substr * formatting, remove copying iterator on delete * llama : normalize code-style * swift fix * print pfx/sfx if verb, main: split pfx input sfx * dont add space when using special tokens * minor : comment + spacing --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-15	MPT : support GQA for replit-code-v1.5 (#3627)	cebtenzzre

2023-10-13	llama : remove n_threads from llama_decode_internal (#3614)	Daniel Bevenius
	This commit removes `n_threads` from the `llama_decode_internal` functions doc comment as it does not exist anymore. It looks like this parameter was removed in Commit 16bc66d9479edd5ee12ec734973554d4493c5dfa ("llama.cpp : split llama_context_params into model and context params"). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-10-10	Minor improvements in GPT2 tokenizer (#3567)	goerch
	* Fixing minor bugs in bpe_gpt2_preprocess * Don't add bos token in test
2023-10-10	llm : add bloom models (#3553)	Xingchen Song(宋星辰)
	* feat: Support bloom models * fix(bloom): fix model size --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-10	llm : add MPT support (#3417)	Jan Ploski
	* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-09	refact : fix convert script + zero out KV cache to avoid nans (#3523)	Georgi Gerganov
	* refact : fix convert script + zero out KV cache to avoid nans * ggml : silu(-inf) should never happen * metal : assert various kernel requirements
2023-10-08	sync : ggml (ggml-backend) (#3548)	Georgi Gerganov
	* sync : ggml (ggml-backend) ggml-ci * zig : add ggml-backend to the build
2023-10-08	llama : fix missing break in Persimmon arch case statements (#3535)	Kerfuffle

2023-10-07	quantize : fail fast on write errors (#3521)	cebtenzzre

2023-10-07	llm : support Adept Persimmon 8B (#3410)	Phillip Kravtsov
	* Produces garbage output * wip: correct tensors up to RoPE * correct tensors thru RoPE * Correct outputs through masked & softmax'd KQ * fp32 works * Rename adept->persimmon * Produces correct outputs * clean up convert scripts * remove printing logic from ggml.c * remove prints from llama.cpp & fix merge * trivial cleanups * Add offload funcs * update conversion script to directly take adept artifacts rather than .saftensors file * Fix norm eps bug * Support sqr and concat on metal, persimmon-8b-q4 runs correctly * Small changes from review * Formatting changes * Minor changes to conversion script * Remove old script * Fix editorconfig formatting * Fix build * add overlooked offload code ggml-ci
2023-10-07	Fix for #3454 (#3455)	goerch
	Fix: `sentencepiece` tokenizers with added tokens failed with an incorrect assertion
2023-10-06	kv cache slot search improvements (#3493)	Kerfuffle
	* kv cache slot search improvements * Use n_ctx in kv find slot for consistency * Ensure kv cache head points to a valid slot in llama_decode internal * Add some comments to prevent dumb people (like me) from getting confused.
2023-10-06	parallel : add option to load external prompt file (#3416)	pudepiedj
	* Enable external file and add datestamp * Add name of external file at end * Upload ToK2024 * Delete ToK2024.txt * Experiments with jeopardy * Move ParallelQuestions to /proimpts and rename * Interim commit * Interim commit * Final revision * Remove trailing whitespace * remove cmake_all.sh * Remove cmake_all.sh * Changed .gitignore * Improved reporting and new question files. * Corrected typo * More LLM questions * Update LLM-questions.txt * Yet more LLM-questions * Remove jeopardy results file * Reinstate original jeopardy.sh * Update examples/parallel/parallel.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>