diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-05-04 11:49:29 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-05-04 11:49:29 +0300 |
commit | 7cb6a76cd0ae54909cdbffa95f163c077827dfc5 (patch) | |
tree | 4923e50820e635ac9ad96b22cdc1f546753f3589 | |
parent | ce2b0292e18cd8dd87776797fa455e7fc4cfeed9 (diff) |
Update README.md
-rw-r--r-- | README.md | 5 |
1 files changed, 5 insertions, 0 deletions
@@ -6,8 +6,13 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc. +>[!IMPORTANT] +>The new GGUFs for DeepSeek-V3/R1/Lite do not work in this repository. This is due to the backwards incompatibe change in mainline `llama.cpp` that [added MLA support](https://github.com/ggml-org/llama.cpp/pull/12801) +>2.5 months after MLA was available here, and worked with the original DeepSeek GGUFs. Please use the original GGUF or, if you don't have one, convert the HF safetnosrs using the Python conversion scrip in this repository. + ## Latest News +* May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370 * April 29 2025: Qwen3 support added * April 26 2025: GLM-4 support added * April 26 2025: Command-A support added |