summaryrefslogtreecommitdiff
path: root/examples/perplexity
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-03-17 09:31:56 +0100
committerGitHub <noreply@github.com>2025-03-17 09:31:56 +0100
commitf91b2e38d028c77cc5631295ba0937749e684749 (patch)
tree0dff35b12df8aaab2aef4e3485d642a43cc69267 /examples/perplexity
parent305fabfc3b694d603fdb05d671dd59e2d4c7d58e (diff)
Prepare wk_b tensors of DeepSeek models on the fly (#259)
* FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/perplexity')
0 files changed, 0 insertions, 0 deletions