Age | Commit message (Collapse) | Author |
|
* iq1_kt: basics
* iq1_kt: CUDA dequantize
Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.
* iq1_kt: CUDA MMQ
* iq1_kt: CUDA MMVQ
* iq1_kt: AVX2 GEMM/GEMV
* iq1_kt: convert/repack to q8_0_r8 (AVX2)
* iq1_kt: slightly faster GEMV
18.6 t/s -> 19.4 t/s
* iq1_kt: NEON GEMM/GEMV
Pathetic as usual
* iq1_kt: slightly faster NEON - still pathetic
* iq1_kt: tiny bit better GEMV on NEON
* iq1_kt: convert/repack to q8_0_r8 (NEON)
* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON
* Adding frgotten file
* iq1_kt: add to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_m GEMM on NEON
* Set repacking threshold
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experiments for 2.6875 bpw quants
At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.
* iq2_kl: basics
* iq2_kl: CUDA dequantize
* iq2_kl: small improvement in PPL
Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.
* iq2_kl: MMQ
Quite good: PP-512(L3-8B) = 8472 t/s.
* iq2_kl: MMVQ
We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.
* iq2_kl: Zen4 GEMM/GEMV
Not particularly fast. I may need to think about rearranging the bits.
* iq2_kl: better Zen4
* iq2_kl: convert/repack to q8_k_r8 (AVX2)
* iq2_kl: AVX2 GEMM/GEMV
* iq2_kl: WIP NEON
The compiler started crashing!!!
* iq2_kl: NEON
Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.
* iq2_kl: convert/repack to q8_k_r8 (NEON)
* iq2_kl: Metal dequantize
* iq2_kl: Metal GEMV - pretty slow
* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)
* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)
* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 476 t/s up from 466 t/s.
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 492 t/s up from 476 t/s.
* Add iq2_kl to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_ks: basics
* iq3_ks: CUDA dequantize
* iq3_ks: CUDA mmvq
* iq3_ks: mmq
* iq3_ks: faster mmq
* iq3_ks: Zen4
* iq3_ks: AVX2 convert to q8_k_r8
This gives usPP-512 = 360 t/s.
* iq3_ks: AVX2 GEMM/GEMV
* iq3_ks: NEON GEMM/GEMV
* iq3_ks: NEON convert to q8_k_r8
This gives us PP-512 = 164 t/s.
* iq3_ks: Metal dequantize
* iq3_ks: Metal gemv - pathetic performance
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s
66.3 t/s -> 168.8 t/s.
* iq1_m
19 t/s -> 163 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xxs
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
* iq2_xs
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
* iq2_s
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
* iq3_xxs
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
* iq3_s
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
* q2_k
85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.
* q3_K
45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.
* q6_k
47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.
* q4_k
58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.
As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).
* q5_k
54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.
* iq4_xs
71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xxs
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
* iq2_xs
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
* iq2_s
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
* iq3_xxs
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
* iq3_s
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Faster GEMM fir iq2_ks, iq4_ks
* iq5_ks
63.8 t/s -> 166 t/s. iq5_ks_r4 is at 107.4 t/s.
But: iw5_ks_r4 TG performance is quite a bit better:
21.7 t/s vs 17.7 t/s for iq5_ks.
* iq6_k
44 t/s -> 164.3 t/s. There is no iq6_k_r4
* iq5_k
46 t/s -> 167 t/s. iq5_k_r4 is at 99.5 t/s.
* iq4_k
46.4 -> 167.2 t/s. iq4_k_r4 is at 115 t/s.
* iq3_k
47.3 t/s -> 166.5 t/s. iq3_k_r4 is at 96.5 t/s.
* iq2_k
47.4 t/s -> 167 t/s. iq2_k_r4 is at 113.3 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
* iq2_kt and iq3_kt work with new int trellis
Much slower than the fp16 based trellis. I guess, Apple doesn't
have int8_t SIMD on the M2-Max GPU.
* q4_0
83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s
* q5_0
74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.
* q6_0
74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.
* q8_0
84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.
* iq4_nl
84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s
* q4_1
74.4 -> 115.4 t/s. There is no repacked variant
* q5_1
64.2 t/s -> 114.9 t/s. There is no repacked variant.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* This seems slightly faster for IQ2_KT, IQ3_KT TG
* This looks better for iq4_kt TG
* WIP
* Cleanup
* With fancy simd also set func16
* Enable next_128() also on AVX2
Despite having just 16 vector registers it is still faster.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adapt iq3_kt to new trellis on NEON
* iq3_kt is now working on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Closes #538
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Removes errant ";" in front of 0xCBAC1FED in non-x86 code
```
error: expected primary-expression before ';' token
constexpr static uint32_t ka = ;0xCBAC1FED;
^
error: expected unqualified-id before numeric constant
constexpr static uint32_t ka = ;0xCBAC1FED;
^
```
|
|
|
|
* New iq4_kt trellis
The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.
* Something is not working with the AVX2 dot product
* New iq4_kt: CUDA MMVQ
* New iq4_kt: CUDA MMQ
* For now have only iq4_kt use the new trellis
* Fix iq2_kt that got broken along the way
* New iq4_kt: AVX2 dot product finally works
We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.
* New iq4_kt: fix vanilla AVX2
* New iq4_kt: NEON implementation
We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.
* New iq4_kt: slightly faster NEON
* New iq4_kt: slightly faster NEON
* New iq4_kt: faster NEON
We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.
* Minor
* New iq4_kt trellis: not working Metal implementation
* Remove the extra 4 bytes of row meta data that is no longer used
* Cleanup
* Adding forgottent file
* Switching iq2_kt to new trellis - CUDA MMQ
* New iq2_kt: CUDA GEMV
* New iq2_kt: AVX2 dequantize
* New iq2_kt: AVX2 GEMM/GEMV
* Adding forgotten file
* New iq2_kt: NEON GEMM/GEMV
* New iq2_kt: slightly faster NEON GEMM
* New iq2_kt: Metal - very slow.
It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.
* Add missing break
* Trying @louiehelm's multiplier
* CPU
* iq3_kt: use integer trellis + CUDA dequantize and MMVQ
* iq3_kt: MMQ
* iq3_kt: AVX2 GEMM
* iq3_kt: AVX2 GEMV
* The trellis quants now need super-blocks of 256, so we need a check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Repack q4_0 and q8_0 to q8_0_R8
q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.
* Change q8_2_x4 to store in16_t sums
With that q4_0 now works.
I need to check all quants that use q8_2_x4!
* q5_0 and use a dequntizing template
* q6_0
129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.
* iq4_nl
137 t/s -> 293 t/s. iq4_nl is at 251 t/s.
* q4_1: 135 t/s -> 262 t/s
* q5_1: 125 t/s -> 253 t/s
* iq3_xs
178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.
* q2_K
202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_ks
203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.
* iq4_k
175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.
PPL is actually lower!
* iq5_ks
180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.
PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct
* iq5_k - accuracy loss is too big
* iq5_k - there was a bug with the shifts
...and that's why PPL was so high. It is also high on main.
This fixes it.
* iq6_k
148 t/s -> 350 t/s. There is no iq6_k_r4
PPL is actually lower because we have a bug in the existing
implementation!
* iq3_k
169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.
* iq2_k
190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.
* iq2_ks
200 t/s -> 367 t/s. There is no iq2_ks_r4.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q6_K dequantizing GEMM
* Much easier: just use different vec_dot types!
* WIP
* Finally q6_K x q8_2_x4 dot product works
* Very slightly better
* We don't need the changes in ggml.c
* Fix AVX2
* iq2_xs
* Fix AVX2
* iq2_s
* q3_K
* Fix q8_k_r8 on Zen4
* q3_K: repack to q8_k_r8 instead of q8_0_r8
With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.
* q3_K: don't scale when all quants in a block are <= 127 when repacking
* iq2_s: repack to q8_k_r8 instead of q8_0_r8
* iq2_xs: rapck to q8_k_r8
* WIP
* iq2_xs: repack to q8_k_r8
* iq3_xxs: repack to q8_k_r8
* iq3_s: use q8_k_r8
* iq1_s: repack to q8_k_r8
* iq1_m: repack to q8_k_r8
* iq1_m: slightly faster
* Slightly faster
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q4_K: dequantize to q8_1_r8 for batch >= 32
We get 268 t/s, up from 186 t/s.
* q4_K: GEMM with q8_2_X4
* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8
* Remove the scales, they are not needed
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
TG is slightly faster too - 24.4 vs 23.1 t/s on the
Ryzen-5975WX
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Much faster iq2_xxs GEMM
PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.
* iq2_xxs: q8_2_x4 GEMM
* iq2_xxs: use template for q8_2_x4 GEMM
* Fix AVX2
* Cleanup
* NEON is not working yet, so still use Q8_K GEMM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* cmake: force MSVC compiler charset to utf-8
* build: apply MSVC /bigobj option to c/cpp files only
* Update CMakeLists.txt
* Fix Compile error (C2668)
* revert hsum_float_8x8
|
|
* Also do the dequantize approach for mul_mat_id
* Also do the dequantize approach for iqk_moe_fused_up_gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_kt: NEON implementation
* iq3_kt: NEON implementation
* iq4_kt: not working NEON implementation
* iq4_kt: NEON implementation
Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.
* Cleanup
* iq4_kt: slightly faster TG on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Somewhat faster iq3_kt (AVX2)
* Cleanup
* Slightly faster iq4_kt
* Slightly faster iq4_kt
PP is now almost 50% better than original, TG is ~20% better
* Cleanup
* Very slightly faster iq4_kt TG
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix MSVC compilation
* MSVC cannot capture constexpr in lambdas
* Arghhh
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP
* WIP
* WIP
* Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
* Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
* Testing Trellis quantization: playing with scales and generators
* iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
* iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
* WIP
* WIP
* WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
* iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
* iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
* iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
* iq2_kt: very slightly faster CUDA dot product
* iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
* iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).
My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
* iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
* Minor
* Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
* Forgotten change
* WIP
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
* iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
* iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
* iq3_kt: CUDA dot product
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B, 4096) = 6.4179
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* Adding iq4_kt - not competitive at this point
* WIP
* WIP
* iq4_kt: CUDA dot product
* iq4_kt: minor tweaks
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B, 4096) = 6.3913
Ah, quantization is faster too. About 20% faster.
* iq3_kt: small improvements and faster quantization
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B, 4096) = 6.3825
Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
* iq3_kt: small progress
* WIP
* iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
* iq4_kt: very slightly better
at the expense of much longer quantization time.
* iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
* DRY
* DRY
* iq4_kt: CUDA dot product works
* DRY
* Report actual bpw
* Minor tweaks
* Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).
I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
* WIP for IQ2_KT
* WIP - working basic iq2_kt
* still super slow (0.17t/s eval)
* flatten 3inst iters + avx2 (0.3t/s eval)
* iq3_kt (0.3t/s eval) and renames
* wip buggy iq4_KT
* fix (0.22t/s eval)
* naming and remove unused fn
* cleanup
* more cleanup
* delete unused and noncompiling mmvq functions
* Some performance tweaks
* Slighty faster iq2_kt
* port Trellis struct to iq3_kt, iq4_kt
* oops untracked files
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Refactor iqk: WIP
* Refactor iqk: Factor out float GEMM (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
* Refactor iqk: Factor out GEMM for repacked legacy quants
* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
* Refactor iqk: Factor out GEMM for repacked i-quants
* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
* Refactor iqk: factor out 1-bit quants (NEON)
* Refactor iqk: factor out k-quants (NEON)
* Refactor iqk: factor out floats (NEON)
* Also iq4_xs belongs to k-quants
* Refactor iqk: factor out iqk quants (NEON)
* Refactor iqk: factor out legacy quants (NEON)
* Refactor iqk: factor out repacked legacy quants (NEON)
* Refactor iqk: factor out repacked k-quants (NEON)
* Refactor iqk: factor out repacked iqk quants (NEON)
* Refactor iqk: GEMM kernels are refactored on NEON
* Refactor iqk: FA compiles
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
* Refactor iqk: FA refactored (Zen4)
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
* Adding forgotten file
* Most helpers don't need to be templates
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
* Fix bf16
* Refactor iqk: FA refactored (NEON)
* Forgotten MMQ ref and typo (#431)
* Adding forgotten iq5_k_r4
* Fix iq4_k_r4 on NEON
* Fix iq4_ks on NEON
It was broken before the refactoring (the shifts were not correctly
applied).
* Fix q8_0 on NEON
* Fix q6_0 K cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4: faster PP for iq4_ks and iq5_ks
* Zen4: faster PP for iq2_ks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks_r4: basics
* iq5_ks_r4: Zen4 works
* iq5_ks_r4: AVX2 works
* iq5_ks_r4: NEON
* Fix iq5_ks on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix IQ4_K on AVX2
* Fix IQ4_KS on AVX2
* Fix IQ5_K on AVX2
* Fix IQ6_K on AVX2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fixing SER bugs
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Better CPU FA performance for DeepSeek-Lite
* It must be like this
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Another attempt to fix #367
* Yet another
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|