diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-02-09 19:48:44 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-02-09 19:48:44 +0200 |
commit | c12f73ba6153d162f36434cb48e36dd3649b7701 (patch) | |
tree | 3594697f680959c7130a3f109cd9e4abbc8d6e7d /common/common.h | |
parent | cae2b81155fdad75b7beab3a835c438120412969 (diff) |
Add optional MLA (#188)
* Deepseek MLA Optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make MLA optional
* Remove some unnecessary copies in the MLA attention
* Deepseek MLA Optimizations V2 (#195)
* Avoid allocating MHA KV cache when MLA is turned on
* Added missing gguf-py file
* Added final optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make sure we do have wk_b and wv_b before enabling MLA
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use type_k and type_v to set the types of the MLA caches
They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.
* Better gemm strategy when nth > nhead
It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.
---------
Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'common/common.h')
-rw-r--r-- | common/common.h | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/common/common.h b/common/common.h index 73d7d650..fc1ae619 100644 --- a/common/common.h +++ b/common/common.h @@ -174,6 +174,7 @@ struct gpt_params { bool simple_io = false; // improves compatibility with subprocesses and limited consoles bool cont_batching = true; // insert new sequences for decoding on-the-fly bool flash_attn = false; // flash attention + bool mla_attn = false; // MLA bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix bool ignore_eos = false; // ignore generated EOS tokens |