diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-03-17 09:31:56 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-03-17 09:31:56 +0100 |
commit | f91b2e38d028c77cc5631295ba0937749e684749 (patch) | |
tree | 0dff35b12df8aaab2aef4e3485d642a43cc69267 /examples/perplexity/README.md | |
parent | 305fabfc3b694d603fdb05d671dd59e2d4c7d58e (diff) |
Prepare wk_b tensors of DeepSeek models on the fly (#259)
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
* Prepare wk_b when loading DeepSeek models (if wk_b is missing)
* Add some comments
* Fix case where wkv_b is quantized with k- or i-quants.
* Fix CUDA
There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples/perplexity/README.md')
0 files changed, 0 insertions, 0 deletions