ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-05-07 17:38:22 +0300
committer	GitHub <noreply@github.com>	2025-05-07 17:38:22 +0300
commit	30536ee369c829c7161b0170de550936b4548a6b (patch)
tree	ec2dd4eebc9815fffffb0a29551cacdea836e004 /include/llama.h
parent	17c6fc6b7303915e0fd74bee53c4de8d21746d52 (diff)

FlashMLA-3 for DeepSeek models on CUDA (#386)

* CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'include/llama.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: