ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-02-27 08:42:18 +0200
committer	GitHub <noreply@github.com>	2025-02-27 08:42:18 +0200
commit	51029edfdf286df76f9268fc87b9514291b2fe42 (patch)
tree	e3a960cfc8e2453224cde22dd6490c40aca27c43 /examples/llama.swiftui/llama.cpp.swift
parent	94b659a2f106e017e5eeb6f492dc9f290e136833 (diff)

Faster MLA on CUDA (#234)

* Slight MLA TG performance improvement on CUDA The low MLA performance on CUDA is dues to the wk_b * q_nope operation. It turns into n_head matrix multiplications with n_head separate quantization and GEMV steps. The associated overhead is just too much for TG where each GEMV is very fast (512 x 128 = 131 KFLOP for DeepSeek-Lite, 4X that for DeepSeekV3/R1). The way it was done there was also a copy of each q_nope row before quantization, which I have now eliminated. This results in a ~2.5% speedup. What needs to happen instead is to launch a single computation that quantizes all heads, and then have a kernel that does the GEMV for all heads instead of n_head sequential GEMVs. * Slightly better * CUDA: Quantize non-contiguous tensors * Much better MLA It is a total hack, but it works. * Cleanup Remove duplicated gemv's. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'examples/llama.swiftui/llama.cpp.swift')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: