summaryrefslogtreecommitdiff
path: root/include/llama.h
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-05-07 17:38:22 +0300
committerGitHub <noreply@github.com>2025-05-07 17:38:22 +0300
commit30536ee369c829c7161b0170de550936b4548a6b (patch)
treeec2dd4eebc9815fffffb0a29551cacdea836e004 /include/llama.h
parent17c6fc6b7303915e0fd74bee53c4de8d21746d52 (diff)
FlashMLA-3 for DeepSeek models on CUDA (#386)
* CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'include/llama.h')
0 files changed, 0 insertions, 0 deletions