summaryrefslogtreecommitdiff
path: root/examples/sweep-bench
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-05-12 07:49:00 +0300
committerGitHub <noreply@github.com>2025-05-12 07:49:00 +0300
commit465569dff8b49a195450a0eb1974fd72a32fcebc (patch)
treeaf7f5b4af3738318a28ad9c9de722231c41c3d63 /examples/sweep-bench
parent8669c3db2b98f05775292778dd05f424ee0cd250 (diff)
Faster DeepSeek FA on CUDA (#408)
* New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/sweep-bench')
0 files changed, 0 insertions, 0 deletions