diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-05-07 17:38:22 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-05-07 17:38:22 +0300 |
commit | 30536ee369c829c7161b0170de550936b4548a6b (patch) | |
tree | ec2dd4eebc9815fffffb0a29551cacdea836e004 /include/llama.h | |
parent | 17c6fc6b7303915e0fd74bee53c4de8d21746d52 (diff) |
FlashMLA-3 for DeepSeek models on CUDA (#386)
* CUDA WIP: support for FlashMLA-3
* Much better
The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)
* Sadly, the previous commit was wrong
* Finalizing
* Also add these
* Minor
* Minor tweak
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'include/llama.h')
0 files changed, 0 insertions, 0 deletions