diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-02-27 08:42:18 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-02-27 08:42:18 +0200 |
commit | 51029edfdf286df76f9268fc87b9514291b2fe42 (patch) | |
tree | e3a960cfc8e2453224cde22dd6490c40aca27c43 /examples/llama.swiftui/llama.cpp.swift | |
parent | 94b659a2f106e017e5eeb6f492dc9f290e136833 (diff) |
Faster MLA on CUDA (#234)
* Slight MLA TG performance improvement on CUDA
The low MLA performance on CUDA is dues to
the wk_b * q_nope operation.
It turns into n_head matrix multiplications with
n_head separate quantization and GEMV steps.
The associated overhead is just too much for TG
where each GEMV is very fast (512 x 128 = 131 KFLOP
for DeepSeek-Lite, 4X that for DeepSeekV3/R1).
The way it was done there was also a copy of each q_nope
row before quantization, which I have now eliminated.
This results in a ~2.5% speedup.
What needs to happen instead is to launch a single
computation that quantizes all heads, and then have
a kernel that does the GEMV for all heads instead of
n_head sequential GEMVs.
* Slightly better
* CUDA: Quantize non-contiguous tensors
* Much better MLA
It is a total hack, but it works.
* Cleanup
Remove duplicated gemv's.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/llama.swiftui/llama.cpp.swift')
0 files changed, 0 insertions, 0 deletions