ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-03-17 09:31:56 +0100
committer	GitHub <noreply@github.com>	2025-03-17 09:31:56 +0100
commit	f91b2e38d028c77cc5631295ba0937749e684749 (patch)
tree	0dff35b12df8aaab2aef4e3485d642a43cc69267 /examples/perplexity/README.md
parent	305fabfc3b694d603fdb05d671dd59e2d4c7d58e (diff)

Prepare wk_b tensors of DeepSeek models on the fly (#259)

* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'examples/perplexity/README.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: