Update README.md

author: Kawrakow <iwankawrakow@gmail.com> 2025-05-04 11:49:29 +0300
committer: GitHub <noreply@github.com> 2025-05-04 11:49:29 +0300
commit: 7cb6a76cd0ae54909cdbffa95f163c077827dfc5 (patch)
tree: 4923e50820e635ac9ad96b22cdc1f546753f3589
parent: ce2b0292e18cd8dd87776797fa455e7fc4cfeed9 (diff)
1 files changed, 5 insertions, 0 deletions
diff --git a/README.md b/README.md
index a04c1130..17d19645 100644
--- a/README.md
+++ b/README.md
@@ -6,8 +6,13 @@
 
 This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.
 
+>[!IMPORTANT]
+>The new GGUFs for DeepSeek-V3/R1/Lite do not work in this repository. This is due to the backwards incompatibe change in mainline `llama.cpp` that [added MLA support](https://github.com/ggml-org/llama.cpp/pull/12801)
+>2.5 months after MLA was available here, and worked with the original DeepSeek GGUFs. Please use the original GGUF or, if you don't have one, convert the HF safetnosrs using the Python conversion scrip in this repository.     
+
 ## Latest News
 
+* May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370 
 * April 29 2025: Qwen3 support added
 * April 26 2025: GLM-4 support added
 * April 26 2025: Command-A support added
author	Kawrakow <iwankawrakow@gmail.com>	2025-05-04 11:49:29 +0300
committer	GitHub <noreply@github.com>	2025-05-04 11:49:29 +0300
commit	7cb6a76cd0ae54909cdbffa95f163c077827dfc5 (patch)
tree	4923e50820e635ac9ad96b22cdc1f546753f3589
parent	ce2b0292e18cd8dd87776797fa455e7fc4cfeed9 (diff)