Age | Commit message (Collapse) | Author |
|
|
|
* update webui
* add token/s in webui
* add webui files
* fix webui first message disappear in some browser
* add missing html files
---------
Co-authored-by: firecoperana <firecoperana>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Remove kv_l, kvt_l and just use k_l and v_l
* Hopefully take care of missing V cache (MLA)
* Fix save and restore when there is no V cache
* Fix double print
* Update write_kv_cache_data and read_kv_cache_data to be MLA aware
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_m_r4: CUDA dequantize
* iq1_m_r4: CUDA dequantize
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* MMQ for iq4_ks_r4
* MMQ for iq5_ks_r4
* Add forgotten file
* Another forgotten file
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Also do the dequantize approach for mul_mat_id
* Also do the dequantize approach for iqk_moe_fused_up_gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s_r4: CUDA dequantize
* iq1_s_r4: CUDA GEMV
* iq1_s_r4: MMQ on CUDA
Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding top-n-sigma sampler
* Fix typos in XTC PR
* Update README.md for main and server
* More README
* More README
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Direct conversion from fp16 to Q6_0
* forgotten comma
* More precise infos
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_kt: Metal dequantize
* iq2_kt: Metal GEMV
Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B
* iq3_kt: Metal dequantize
* iq3_kt: Metal GEMV
Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.
* iq4_kt: Metal dequantize - getting NaNs
* iq4_kt: Metal GEMV - also not working
* iq4_kt: Metal still not working
* Disable iq4_kt on Metal for now
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Remove kv_l, kvt_l and just use k_l and v_l
* Hopefully take care of missing V cache (MLA)
* Replace MLA-specific KV cache with the standard KV cache V2 (#473)
* Fix save and restore when there is no V cache
* Fix double print
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: saood06 <saood05@gmail.com>
|
|
* iq2_kt: NEON implementation
* iq3_kt: NEON implementation
* iq4_kt: not working NEON implementation
* iq4_kt: NEON implementation
Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.
* Cleanup
* iq4_kt: slightly faster TG on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* CUDA: iq4_ks_r4 GEMV and GEMM
* CUDA: iq5_ks_r4 GEMV and GEMM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* CUDA: iq4_k_r4 dequantize
* CUDA: iq4_k_r4 GEMV
~10% slower than iq4_k.
* CUDA: slightly faster iq4_k_r4 GEMV
* CUDA: slightly faster iq4_k_r4 GEMV
We are now within 3% of iq4_k
* CUDA: iq5_k_r4 dequantize
* CUDA: iq5_k_r4 GEMV
~3% slower than iq5_k.
* CUDA: iq3_k_r4 dequantize
* CUDA: iq3_k_r4 GEMV
* CUDA: slightly faster iq3_k_r4 GEMV
* CUDA: iq2_k_r4 GEMV
* CUDA: faster iq2_k_r4 GEMV
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Legacy quants conversion schemes in convert_hf_to_gguf.py
This, notably in order to make smaller conversions to generate an iMatrix file.
`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.
Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022
Original author @chentyjpm
Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
* forgotten IQ5_KS case mention
|
|
* Somewhat faster iq3_kt (AVX2)
* Cleanup
* Slightly faster iq4_kt
* Slightly faster iq4_kt
PP is now almost 50% better than original, TG is ~20% better
* Cleanup
* Very slightly faster iq4_kt TG
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix MSVC compilation
* MSVC cannot capture constexpr in lambdas
* Arghhh
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP
* WIP
* WIP
* Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
* Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
* Testing Trellis quantization: playing with scales and generators
* iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
* iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
* WIP
* WIP
* WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
* iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
* iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
* iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
* iq2_kt: very slightly faster CUDA dot product
* iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
* iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).
My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
* iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
* Minor
* Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
* Forgotten change
* WIP
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
* iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
* iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
* iq3_kt: CUDA dot product
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B, 4096) = 6.4179
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* Adding iq4_kt - not competitive at this point
* WIP
* WIP
* iq4_kt: CUDA dot product
* iq4_kt: minor tweaks
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B, 4096) = 6.3913
Ah, quantization is faster too. About 20% faster.
* iq3_kt: small improvements and faster quantization
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B, 4096) = 6.3825
Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
* iq3_kt: small progress
* WIP
* iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
* iq4_kt: very slightly better
at the expense of much longer quantization time.
* iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
* DRY
* DRY
* iq4_kt: CUDA dot product works
* DRY
* Report actual bpw
* Minor tweaks
* Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).
I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
* WIP for IQ2_KT
* WIP - working basic iq2_kt
* still super slow (0.17t/s eval)
* flatten 3inst iters + avx2 (0.3t/s eval)
* iq3_kt (0.3t/s eval) and renames
* wip buggy iq4_KT
* fix (0.22t/s eval)
* naming and remove unused fn
* cleanup
* more cleanup
* delete unused and noncompiling mmvq functions
* Some performance tweaks
* Slighty faster iq2_kt
* port Trellis struct to iq3_kt, iq4_kt
* oops untracked files
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time
* update gguf-split params parse logic
* Update examples/gguf-split/gguf-split.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
---------
gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging
* gguf-split : error when too many arguments are passed
Authored-by: slaren <slarengh@gmail.com>
|
|
* Streamline a bit the quant strategies
No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA).
The rest is applying the existing patterns to the new IQ_K quants.
Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v.
* remove <=8 experts condition.
|
|
* Refactor iqk: WIP
* Refactor iqk: Factor out float GEMM (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
* Refactor iqk: Factor out GEMM for repacked legacy quants
* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
* Refactor iqk: Factor out GEMM for repacked i-quants
* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
* Refactor iqk: factor out 1-bit quants (NEON)
* Refactor iqk: factor out k-quants (NEON)
* Refactor iqk: factor out floats (NEON)
* Also iq4_xs belongs to k-quants
* Refactor iqk: factor out iqk quants (NEON)
* Refactor iqk: factor out legacy quants (NEON)
* Refactor iqk: factor out repacked legacy quants (NEON)
* Refactor iqk: factor out repacked k-quants (NEON)
* Refactor iqk: factor out repacked iqk quants (NEON)
* Refactor iqk: GEMM kernels are refactored on NEON
* Refactor iqk: FA compiles
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
* Refactor iqk: FA refactored (Zen4)
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
* Adding forgotten file
* Most helpers don't need to be templates
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
* Fix bf16
* Refactor iqk: FA refactored (NEON)
* Forgotten MMQ ref and typo (#431)
* Adding forgotten iq5_k_r4
* Fix iq4_k_r4 on NEON
* Fix iq4_ks on NEON
It was broken before the refactoring (the shifts were not correctly
applied).
* Fix q8_0 on NEON
* Fix q6_0 K cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
|
|
* Add __syncthreads() to the new FA kernel
* Clearing padding
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4: faster PP for iq4_ks and iq5_ks
* Zen4: faster PP for iq2_ks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks_r4: basics
* iq5_ks_r4: Zen4 works
* iq5_ks_r4: AVX2 works
* iq5_ks_r4: NEON
* Fix iq5_ks on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix IQ4_K on AVX2
* Fix IQ4_KS on AVX2
* Fix IQ5_K on AVX2
* Fix IQ6_K on AVX2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* MMQ for iq2_k
* This works
* MMQ for iq3_k
* MMQ for iq2_ks
* Fix iq2_ks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* MMQ for iq4_k: WIP (not working)
* MMQ for iq4_k: working now
* MMQ for iq5_k
* Cleanup
* MMQ for iq5_k: slightly faster
* MMQ for iq6_k
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fixing SER bugs
* Cleanup
* This seems to fix it.
* This seems to work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fixing SER bugs
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Better CPU FA performance for DeepSeek-Lite
* It must be like this
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Enable MLA-3 in crippled GGUFs: WIP
* Enable MLA-3 in crippled GGUFs: seems to work
* Add newly created tensors to model.tensors_by_name
Else they don't get run-time repacked.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|