Age | Commit message (Collapse) | Author |
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Much faster and it looks like better iq1_m quantiation
* Cleanup
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* llama4: WIP
* llama4: this seems to be working
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Well, there was also the initial MLA PR, which was derived from @fairydreaming
|
|
Forgot to add @Nexesenex
|
|
* Use links for ggml/llama.cpp authors
* This file is not html
* More
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS).
This PR corrects my mistake.
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Metal: WIP to update Metal FA implementation
Dk=192, Dv=128 works, but not Dk = 576, Dv = 512
* Metal FA: go to float
* WIP
* Metal FA: MLA options now all work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix GCC compilation errors on ARM
* One more
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* MoE improvements on Metal
This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
performance degrades. Why?
* Some cleanup
* Much better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_k: slightly better quantization
Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).
* Small improvement for type-1 quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Make sure no interleaved quants are being used for token embeddings
also with `--pure` and/or `--custom-q`.
* Simplify
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Better make_qx_quants
Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.
* Sae for iq4_nl, iq4_xs
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
with --pure (#294)
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* Add check if selected type is possible with --pure
I often want to quantize with --pure to see quantization performance
without quantization mixes. But for models where there qre tensors
with row sizes that are not multiple of 256, this results in a crash
for k- and i-quants. Hence, lets add a check if the quant selected
via --pure is applicable, and change it if not.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* q8_0_r8 on avx2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* llama-bench: enable having different number of threads for tg and pp
* Add -tgb to usage
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Update sweep bench (depracating .jsonl support)
* Fix README.md
|
|
* Make fused MoE reproducible
As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.
* Slightly better
* Also do it for non-fused mul_mat_id
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Improve DeepSeek batched processing speed
* Revert the commented out section in iqk_mul_mat.cpp
It does have some benefit at long contexts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix it for nth > rk2
* Handle rk2%nth_k != 0
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding ability to use THP on Linux
* Use the actual page size4 used for mmap also in munmap
* Add -thp to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP Gemma3: not working
* gemma3: build_gemma3 seems to be working now
* Revert changes to convert_hf_to_gguf.py
It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
This results in GGGGGGGGGGGGG when generating with
mla = 3, fa = 0.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Repack a model with the quantize tool
* WIP
* Fixed various issues
As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.
* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved
* Fix GCC 13.3 compilation error
* Another one
* Add missing include
* FlashMLA-3: the best of both worlds - CPU only
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Repack a model with the quantize tool
* WIP
* Fixed various issues
As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.
* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved
* Fix GCC 13.3 compilation error
* Another one
* Add missing include
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
as it is not supported.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
* Prepare wk_b when loading DeepSeek models (if wk_b is missing)
* Add some comments
* Fix case where wkv_b is quantized with k- or i-quants.
* Fix CUDA
There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.
* FlashMLA-2: avoid conversions to f32 also on CUDA
* Be able to compute for more than 65535 tokens
On CUDA just a quick hack that allows us to cancatenate tensors
with more than 65535 rows along zroth dimension as needed by
FlashMLA-2. Also needed some care in the perplexity tool to
avoid int overflows when evaluating the computed logits.
* Reduce memory usage for FlashMLA-2
Oh, also fix int overflow in the CUDA concat implementation.
It is funny how the llama.cpp 64-bit police has gone (almost) everywhere
and replaced 32-bit ints with 64-bit ints, needed or not,
but hasn't done it where it is actually needed.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
* Prepare wk_b when loading DeepSeek models (if wk_b is missing)
* Add some comments
* Fix case where wkv_b is quantized with k- or i-quants.
* Fix CUDA
There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA(CUDA): WIP to allow q8_0 quantized cache
* WIP
* FlashMLA(CUDA) - allow q8_0 for KV cache
This works, and PP is not bad, but TG is still quite a bit slower.
* FlashMLA(CUDA) - allow q8_0 for KV cache
This is better. ~9% slower than f16 cache for short contexts,
nearly on par at 16k tokens.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* This gives us ~20% TG speedup for DeepSeek on CUDA
* Slightly better
* Also do it for plain (not fused) mul_mat_id
* Guard against numerical precision issues for MLA on CUDA
* imatrix: wv_b <-> wkv_b
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|