ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-02-13 11:50:20 +0200
committer	GitHub <noreply@github.com>	2025-02-13 11:50:20 +0200
commit	05242ff17d3685321ea0ea12021f77609219f2a6 (patch)
tree	9bcbe0b2e7785b195c4678f86903a1e0c69830c0 /examples/convert_legacy_llama.py
parent	1bbb543478bbc0997c3f86581c4f95338a5eb5c3 (diff)

Faster MLA prompt processing (#205)

* Do not allocate / report caches that are not used It is either the standard KV cache or MLA cache, not both. * Rename X_pe to X_rope Much easier to follow, at least for my brain, when we have X_rope : rotational position encoding X_nope : no position encoding instead of X_pe and X_nope, where I was wondering wtf is 'pe' and 'nope'. * WIP * WIP * WIP * WIP * Warn user when disabling MLA * MLA: compile time option to not use transposed KV cache Cuts KV cache size in nearly half at the expense of slower TG performance for long contexts (it becomes similar to no-MLA). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'examples/convert_legacy_llama.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: