diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-02-13 11:50:20 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-02-13 11:50:20 +0200 |
commit | 05242ff17d3685321ea0ea12021f77609219f2a6 (patch) | |
tree | 9bcbe0b2e7785b195c4678f86903a1e0c69830c0 /ggml/src/ggml-kompute.cpp | |
parent | 1bbb543478bbc0997c3f86581c4f95338a5eb5c3 (diff) |
Faster MLA prompt processing (#205)
* Do not allocate / report caches that are not used
It is either the standard KV cache or MLA cache, not both.
* Rename X_pe to X_rope
Much easier to follow, at least for my brain, when we have
X_rope : rotational position encoding
X_nope : no position encoding
instead of X_pe and X_nope, where I was wondering wtf is 'pe'
and 'nope'.
* WIP
* WIP
* WIP
* WIP
* Warn user when disabling MLA
* MLA: compile time option to not use transposed KV cache
Cuts KV cache size in nearly half at the expense of slower
TG performance for long contexts (it becomes similar to
no-MLA).
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-kompute.cpp')
0 files changed, 0 insertions, 0 deletions