Age | Commit message (Collapse) | Author |
|
* iq1_kt: basics
* iq1_kt: CUDA dequantize
Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.
* iq1_kt: CUDA MMQ
* iq1_kt: CUDA MMVQ
* iq1_kt: AVX2 GEMM/GEMV
* iq1_kt: convert/repack to q8_0_r8 (AVX2)
* iq1_kt: slightly faster GEMV
18.6 t/s -> 19.4 t/s
* iq1_kt: NEON GEMM/GEMV
Pathetic as usual
* iq1_kt: slightly faster NEON - still pathetic
* iq1_kt: tiny bit better GEMV on NEON
* iq1_kt: convert/repack to q8_0_r8 (NEON)
* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON
* Adding frgotten file
* iq1_kt: add to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experiments for 2.6875 bpw quants
At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.
* iq2_kl: basics
* iq2_kl: CUDA dequantize
* iq2_kl: small improvement in PPL
Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.
* iq2_kl: MMQ
Quite good: PP-512(L3-8B) = 8472 t/s.
* iq2_kl: MMVQ
We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.
* iq2_kl: Zen4 GEMM/GEMV
Not particularly fast. I may need to think about rearranging the bits.
* iq2_kl: better Zen4
* iq2_kl: convert/repack to q8_k_r8 (AVX2)
* iq2_kl: AVX2 GEMM/GEMV
* iq2_kl: WIP NEON
The compiler started crashing!!!
* iq2_kl: NEON
Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.
* iq2_kl: convert/repack to q8_k_r8 (NEON)
* iq2_kl: Metal dequantize
* iq2_kl: Metal GEMV - pretty slow
* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)
* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)
* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 476 t/s up from 466 t/s.
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 492 t/s up from 476 t/s.
* Add iq2_kl to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_ks: basics
* iq3_ks: CUDA dequantize
* iq3_ks: CUDA mmvq
* iq3_ks: mmq
* iq3_ks: faster mmq
* iq3_ks: Zen4
* iq3_ks: AVX2 convert to q8_k_r8
This gives usPP-512 = 360 t/s.
* iq3_ks: AVX2 GEMM/GEMV
* iq3_ks: NEON GEMM/GEMV
* iq3_ks: NEON convert to q8_k_r8
This gives us PP-512 = 164 t/s.
* iq3_ks: Metal dequantize
* iq3_ks: Metal gemv - pathetic performance
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* New iq4_kt trellis
The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.
* Something is not working with the AVX2 dot product
* New iq4_kt: CUDA MMVQ
* New iq4_kt: CUDA MMQ
* For now have only iq4_kt use the new trellis
* Fix iq2_kt that got broken along the way
* New iq4_kt: AVX2 dot product finally works
We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.
* New iq4_kt: fix vanilla AVX2
* New iq4_kt: NEON implementation
We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.
* New iq4_kt: slightly faster NEON
* New iq4_kt: slightly faster NEON
* New iq4_kt: faster NEON
We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.
* Minor
* New iq4_kt trellis: not working Metal implementation
* Remove the extra 4 bytes of row meta data that is no longer used
* Cleanup
* Adding forgottent file
* Switching iq2_kt to new trellis - CUDA MMQ
* New iq2_kt: CUDA GEMV
* New iq2_kt: AVX2 dequantize
* New iq2_kt: AVX2 GEMM/GEMV
* Adding forgotten file
* New iq2_kt: NEON GEMM/GEMV
* New iq2_kt: slightly faster NEON GEMM
* New iq2_kt: Metal - very slow.
It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.
* Add missing break
* Trying @louiehelm's multiplier
* CPU
* iq3_kt: use integer trellis + CUDA dequantize and MMVQ
* iq3_kt: MMQ
* iq3_kt: AVX2 GEMM
* iq3_kt: AVX2 GEMV
* The trellis quants now need super-blocks of 256, so we need a check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Repack q4_0 and q8_0 to q8_0_R8
q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.
* Change q8_2_x4 to store in16_t sums
With that q4_0 now works.
I need to check all quants that use q8_2_x4!
* q5_0 and use a dequntizing template
* q6_0
129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.
* iq4_nl
137 t/s -> 293 t/s. iq4_nl is at 251 t/s.
* q4_1: 135 t/s -> 262 t/s
* q5_1: 125 t/s -> 253 t/s
* iq3_xs
178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.
* q2_K
202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q6_K dequantizing GEMM
* Much easier: just use different vec_dot types!
* WIP
* Finally q6_K x q8_2_x4 dot product works
* Very slightly better
* We don't need the changes in ggml.c
* Fix AVX2
* iq2_xs
* Fix AVX2
* iq2_s
* q3_K
* Fix q8_k_r8 on Zen4
* q3_K: repack to q8_k_r8 instead of q8_0_r8
With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.
* q3_K: don't scale when all quants in a block are <= 127 when repacking
* iq2_s: repack to q8_k_r8 instead of q8_0_r8
* iq2_xs: rapck to q8_k_r8
* WIP
* iq2_xs: repack to q8_k_r8
* iq3_xxs: repack to q8_k_r8
* iq3_s: use q8_k_r8
* iq1_s: repack to q8_k_r8
* iq1_m: repack to q8_k_r8
* iq1_m: slightly faster
* Slightly faster
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* cmake: force MSVC compiler charset to utf-8
* build: apply MSVC /bigobj option to c/cpp files only
* Update CMakeLists.txt
* Fix Compile error (C2668)
* revert hsum_float_8x8
|
|
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP
* WIP
* WIP
* Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
* Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
* Testing Trellis quantization: playing with scales and generators
* iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
* iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
* WIP
* WIP
* WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
* iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
* iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
* iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
* iq2_kt: very slightly faster CUDA dot product
* iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
* iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).
My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
* iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
* Minor
* Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
* Forgotten change
* WIP
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
* iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
* iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
* iq3_kt: CUDA dot product
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B, 4096) = 6.4179
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* Adding iq4_kt - not competitive at this point
* WIP
* WIP
* iq4_kt: CUDA dot product
* iq4_kt: minor tweaks
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B, 4096) = 6.3913
Ah, quantization is faster too. About 20% faster.
* iq3_kt: small improvements and faster quantization
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B, 4096) = 6.3825
Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
* iq3_kt: small progress
* WIP
* iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
* iq4_kt: very slightly better
at the expense of much longer quantization time.
* iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
* DRY
* DRY
* iq4_kt: CUDA dot product works
* DRY
* Report actual bpw
* Minor tweaks
* Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).
I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
* WIP for IQ2_KT
* WIP - working basic iq2_kt
* still super slow (0.17t/s eval)
* flatten 3inst iters + avx2 (0.3t/s eval)
* iq3_kt (0.3t/s eval) and renames
* wip buggy iq4_KT
* fix (0.22t/s eval)
* naming and remove unused fn
* cleanup
* more cleanup
* delete unused and noncompiling mmvq functions
* Some performance tweaks
* Slighty faster iq2_kt
* port Trellis struct to iq3_kt, iq4_kt
* oops untracked files
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks_r4: basics
* iq5_ks_r4: Zen4 works
* iq5_ks_r4: AVX2 works
* iq5_ks_r4: NEON
* Fix iq5_ks on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Another attempt to fix #367
* Yet another
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_k: slightly better quantization
Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).
* Small improvement for type-1 quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* q8_0_r8 on avx2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Repack a model with the quantize tool
* WIP
* Fixed various issues
As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.
* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved
* Fix GCC 13.3 compilation error
* Another one
* Add missing include
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q8_KV - Basics + AVX2 gemm/gemv
* q8_KV: Better AVX2 gemm
* q8_KV: Better Zen4 gemm
We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.
* q8_KV: AVX2 gemm/gemv
We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.
* q8_KV: be able to use it for K cache
This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
removed most of it when implementing the quants with per row scale,
bit it was stull lurking in ggml_copy. Not sure if these were the last
remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
the K-cache as a 2D tensor so we can have per row meta data as needed
by q8_KV.
Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.
* q8_KV: be able to use it for K cache in FA
* q8_KV: repack it for K*Q in FA
* q8_KV: slightly faster gemv on Zen4
* q8_KV: slightly faster gemv on Zen4
* q8_KV: ARM_NEON
We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.
* q8_KV: use it in FA on NEON
* q8_KV_r8 - repacked q8_KV
On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).
* q8_KV_r8: don't use nrc_y = 16 on Zen4
This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.
* q8_KV: nrc_y = 16 also doesn't pay off in FA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4)
* iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4)
* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon)
* iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon)
* Simdify q8_K128 quantization also on Neon
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving
* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving
* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_m_r4: basics (quantize/dequantize)
* iq1_m_r4: Zen4 gemm
* iq1_m_r4: neon gemm
* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4
With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.
* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s_r4: basics - quantize/dequantize
* iq1_s_r4: gemm/gemv works on AVX2/Zen4
* Don't forget to make sure we have a multiple of 4 rows per thread
* iq1_s_r4: this is better
* iq1_s_r4: fix Zen4 after AVX2 changes
* iq1_s_r4: NEON gemm/gemv
* iq1_s_r4: more bits for shared experts
With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.
On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.
* Forgotten counter increment
* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv
* Compiler warnings
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
* Try interleaving 8 iq4_xs rows
It is also faster on AVX2.
This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).
So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
* Cleanup
* 8-rows interleaved q8_0 (AVX2)
* 8-rows interleaved q8_0 (Zen4)
* 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
* 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
* FA: repack Q8_0 to Q8_0_R8
* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)
* FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
* q4_0_r8 (AVX2)
* q4_0_r8 (NEON)
Tiny bit faster PP (~128 vs ~126 t/s), same TG.
* q4_0_r8 (Zen4)
Somehow only marginally faster?
268 t/s vs 261 t/s
* q4_0_r8 (Zen4) - slightly better
282 t/s for a pure q4_0 L3-8B quantization.
* Apply platform specific modifications when repacking
E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0.
This results in a ~3% performance improvement.
Hence,
* Changed the signature of the repack_X functions to take a
bool argument indicating if the repacking is done online and,
if so, apply modifications as appropriate while repacking.
* Added iqk_modify_tensor to apply modifications to models that
have already been repacked while loading the model. Caveat:
just like rtr, this needs to have mmap disabled (else one would
need to move the data to a not mmap-ed buffer, so much more
complicated).
* Apply platform specific modifications when repacking
On Zen4 we can pre-convert the signed quants in q8_0_r4 and
q8_k_r8 to unsigned thus avoiding these operations in matrix
multiplications. With this change we hit
PP-512 = 382.40 t/s (q8_k_r8)
PP-512 = 306.92 t/s (q8_0_r4)
for L3-8B on a Ryzen-7950X using q8_0 KV-cache.
* Process up to 16 columns per kernel call for q8_k_r8
This brings PP-512 up to 389 t/s.
* Be able to load Deepseek-v2-Lite
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try interleaving 8 rows for iq4_xs
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
* Try interleaving 8 iq4_xs rows
It is also faster on AVX2.
This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).
So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
* Cleanup
* 8-rows interleaved q8_0 (AVX2)
* 8-rows interleaved q8_0 (Zen4)
* 8-rows interleaved q8_0 (Zen4) - slightly better
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
* 8-rows interleaved q8_0 (NEON)
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
* FA: repack Q8_0 to Q8_0_R8
* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)
* FA: repack Q8_0 to Q8_0_R8 (NEON)
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
(#174)
This massively improves performance. As this is opt-in, we do not worry
about possible precision loss in the f16 -> bf16 conversion.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Slightly faster FA for bf16 KV cache
~2-3% sort of thing. Sadly, when we go beyond 8k tokens, the
advantage kind of goes away.
* Slightly faster FA for Q8_0 KV cache
* FA: allow bf16 for V-cache with any supported K-cache
E.g., -ctk q8_0 -ctv bf16 is slightly faster than
-ctk q8_0 -ctv q8_0 on Zen4 for not too long context lengths
(say, <= 4096).
* FA: much better bf16 kv-cache speed for large contexts
We now hit 122 t/s for LLaMA-3.1-8B (quantized as iq4_xs and
run-time-repacked) with a context of 32768. IIRC, the previous
best for such large context was ~90 t/s.
Non-negligible improvement at 16384 and 8192 as well:
173.4 and 214 t/s.
* FA: slightly better quantized kv-cache speed for large contexts
E.g., for q8_0 and context of 32768, we are now at 113 t/s
for LLaMA-3.1-8B.
Also simplified the quantized K*Q multiplication.
* Fix q8_0 KV cache when not using FA - WIP (AVX2)
1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use
those to quantize activations for quants that use Q8_0 or Q8_1
as their vec_dot type.
2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1
3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type
4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than
GGML_TYPE_Q8_0_X4 as the K and V types
5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4
in iqk_mul_mat
Also added an optimization in ggml_compute_forward_mul_mat when
ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process
n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using
nthread/GCD(n12*ne13, nthread) threads per head. This results in
a non-negligible performance gain for large contexts.
Question: why is it not allowed to use quantized V-cache when
not using FA?
* Fix q8_0 KV cache when not using FA - NEON
* Fix AVX2
Again the issue with _mm256_maddubs_epi16 overflowing that I
keep forgetting.
* FA: don't use large Q steps on AVX2 for fp16 K-cache
* On Zen4 it is also better to not use large Q steps for fp16 K-cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Add Falcon3 pre-tokinizer (same as llama3)
* q8_k16: use integer arithmetic to sum row values
The existing implementation that just sums up the f32 quantizations
works fine for the original BitNet models and also for the TriLM
ternary models. But for Falcon3 I see a significant difference between
the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum
up the values in a row, then the CPU-GPU PPL difference becomes much
smaller, and we get a lower PPL than Microsoft BitNet, which claims
to be "losless".
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_s_r4: WIP
* iq3_s_r4: Zen4
* iq3_s_r4: slightly better Zen4
* iq3_s_r4: AVX2
* iq3_s_r4: NEON
* iq3_s_r4: rearrange quants
* iq3_s_r4: rearranged quants - AVX2
* iq3_s_r4: rearranged quants - NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Closes #160
* MSVC fixes
* One more
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_s_r4: Zen4
* Minor
* iq2_s_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xs_r4: Zen4
* iq2_xs_r4: AVX2
* iq2_xs_r4: slightly better matrix x vector on AVX2
* iq2_xs_r4: NEON - not much better than iq2_xs
* iq2_xs_r4: slightly better NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_xxs_r4: Zen4
Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread
* Minor
* iq2_xxs_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_xxs_r4: 1st shot on Zen4
PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s
* iq4_xxs_r4: WIP
* iq4_xxs_r4: 1st shot at AVX2
Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.
* iq3_xxs_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_ks_r4: Zen4
* iq4_ks_r4: AVX2
* iq4_ks_r4: WIP
* iq4_ks_r4: slightly better Zen4
* iq4_ks_r4: slightly better Zen4
* iq4_ks_r4: NEON
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq5_k_r4: Zen4
Much slower than the others.
* iq5_k_r5: WIP
* Minor
* iq5_k_r4: fix AVX2 nrc_y = 1 case
* iq5_k_r4: better Zen4
But TG is still slower than iq5_k
* iq5_k_r4: slightly better AVX2
* iq5_k_r4: NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Be able to repack tensors at run time
* Repack: also add bf16 as repackable type
* Repack: make sure number of rows is a multiple of the packing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_k_r4: Zen4
* iq2_k_r4: NEON
* iq2_k_r4: better matrix x vector multiplication on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq3_k_r4 WIP
* iq3_k_r4: Zen4
* iq3_k_r4: AVX2
* iq3_k_r4: NEON
* iq3_k_r4: faster matrix x vector multiplication on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Not working bf16_r4
* Adding bf16_r8
Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.
* bf16_rx: Very slightly faster by interleaving 16 rows
258 t/s -> 263 t/s
* Rename bf16_r4 to bf16_r16
We are interleaving 16 rows now.
* Cleanup unused stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q8_k_r8: fastest matrix multiplication known to human kind
We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!
* q8_k_r8: AVX2
I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.
* q8_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.
* q8_k_r4: go to signed ints
Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
stay within the signed int range and use _mm256_sign_epi8.
Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
needed tto convert from unsigned to signed seems to be extremely
slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
or that we now use to convert fro signed to unsigned seems to be
much faster than on M2-Max)
* Shutup useless compiler warnings
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_k_r4: WIP
* iq4_k_r4: Zen4 and hopefully AVX2
On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.
* iq4_k_r4: AVX2
PP-512 = 227.60 t/s. The shifts are really costly.
* iq4_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q2_k_r4: Zen4
PP-512(LLaMA-3.1-8B) = 256 t/s
* q3_k_r4: AVX2
* q2_k_r4: AVX2
We get PP-512(LLaMA-3.1-8B) = 287 t/s.
Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow
forgot to push upstream.
* q2_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 106.2 t/s.
TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S.
* Make sure rows per thread are a multiple of 4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q3_k_r4: Zen4 works, but not as good as it should be
238 t/s, so sloghtly slower than q6_k_r4.
* q3_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 106.9 t/s.
This is 1.93X faster than q3_K_S!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q5_k_r4: WIP
* q5_k_r4: Zen4 and AVX2
We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4.
Q5_K_S has PP-512 = 190 t/s.
* q5_k_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 96.1 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q6_k_r4
* q6_k_r4: 1st functional AVX2 version
* q6_k_r4: AVX2 and simple Zen4
"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.
* q6_k_r4: 1st NEON version
PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.
* q6_k_r4: slightly faster NEON
PP-512(LLaMA-3.1-8B) = 83.25 t/s
* q6_k_r4: slightly faster Zen4
238.3 t/s -> 243.2 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|