ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-06-03	convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483)	Nexes the Elder
	* Direct conversion from fp16 to Q6_0 * forgotten comma * More precise infos
2025-05-24	Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)	Nexes the Elder
	* Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention
2025-05-09	Fix missing rope_freqs with convert_hf_to_gguf (#402)	saood06
	* lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09	Support for Llama-3-Nemotron models (#377)	saood06
	* conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29	Apply Qwen3 PR from llama.cpp (#355)	Ben Harris

2025-04-22	Add support for bitnet2b_2501 model (#337)	saood06
	* add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-02-09	Add optional MLA (#188)	Kawrakow
	* Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <saood05@gmail.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-23	Deepseek V3 support added (#176)	saood06
	Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-12	Merge mainline - Aug 12 2024 (#17)	Kawrakow
	* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27	Merge mainline llama.cpp (#3)	Kawrakow
	* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>