ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-03-13 12:07:43 +0200
committer	GitHub <noreply@github.com>	2025-03-13 12:07:43 +0200
commit	305fabfc3b694d603fdb05d671dd59e2d4c7d58e (patch)
tree	645b23c154fa8af405f55138f38d264e05faa2ce /examples/perplexity
parent	3f23ed68f17583a8ee63afd0c214f5b39226226c (diff)

FlashMLA-2 (CPU): faster and smaller compute buffer size (#253)

* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'examples/perplexity')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: