Age | Commit message (Collapse) | Author |
|
* POC: per row scale
This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.
* POC per row scale: iq2_tn on NEON
* POC per row scale: iq2_tn on Metal
* Per row scale Metal templates
* iq1_tn: shrink to 1.625 bpw (NEON and Metal)
* POC per row scale: CUDA
* POC per row scale: add CUDA TODOs
There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.
* Per row scales - CUDA
The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.
* Update IQ1_TN and IQ2_TN bpw shown to user
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Fix C++ compilation warnings caused by ggml-common.h
* Disable c99-extensions warning
I get tons of those on macOS due to the arm_neon.h header.
* Disable c99-extensions warning only for APPLE
* Fix warnings in iqk_quantize.cpp
Also add GGML_ABORT when implementation is missing.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* BF16 support on Metal
* Faster BF16 Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
It looks like ArmV8 ISA has support for bf16, but my M2 Max
does not have it, so resorting to bf16 -> f32 conversion and
computations in f32. This is 2x slower than f16, but 8x better
compared to what I get if I try to run a bf16 model on the M2
(NEON and Metal).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Adding bf16 support to CUDA - matrix multipications
* Adding bf16 support to CUDA - cleanup
* Adapt to latest master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Some tweaks for i-quants
Improve Gemma2 PPL while reducing size
* Some tweaks for iq2_k and iq3_k
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1
* NEON Flash Attention: quantized K*Q for q4_0
I could finally take advantage of the matrix multiplication
templates. We get quite a bit of speedup that way for q4_0:
For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step>
results in PP-2048 = 287 t/s vs 268 t/s when converting the
q4_0 k-cache and Q to fp16 and using fp16 multiplication.
* NEON Flash Attention: quantized K*Q for q4_1
* NEON Flash Attention: quantized K*Q for q8_0
This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.
* Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0
* AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0
* Tidy up FlashMS
* Delete no longer used stuff
With the usage of quantized matrix multiplications for
quantized k- and/or v-cache, we no longer need the
helper methods loading entire rows.
* Disallow mixing bf16 with other types for kv caches
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* AVX2 Flash Attention: add ability to use Q8_0 for kv-cache
* AVX2 Flash Attention: add ability to use Q4_0 for kv-cache
* AVX2 Flash Attention: add ability to use Q4_1 for kv-cache
* Fix Zen4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* NEON Flash Attention - first working version
Simply reuse the Zen4/AVX2 implementation, but use
f16 for the K*Q multiplication and V*softmax(K*Q) accumulation.
This makes the FlashMS portion somewhat awkward because we
do not have fast f16 implementations for expf (and tanh when
softcap is enabled), so we need to convert back-and-fort
to f32.
FA is slightly faster than no-FA for the 4B TriLM model,
but lightly slower for Gemma-2b.
* NEON Flash Attention - convert Q to f16 before computing Q*K
* NEON Flash Attention - use fp32 for K*Q operations
Else I get wrong results for LLaMA-3.1-8B (but it works for
Gemma-2b).
* Delete commented out stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* First version of AVX2 Flash attention
I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.
Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.
It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.
* Fix Zenn4 parts broken via the AVX2 change
* Try smaller q_step - no improvement
* Fix ARM_NEON
I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_tn: Metal implementation
Rquires to change the get_rows and matrix multiplication kernels
to use a dequantizer type rather than a dequantization function.
But once this is done, we can simply reuse the iq1_bn implementation.
This change will also allow to add other quantization types that
have meta data (such as a row scale) stored at the beginning of
a row (or change existing quantization types to row-wise scales).
* Some cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_tn: adding CUDA dequantize
* iq1_tn: adding CUDA dot product
* Delete commented out stuff
* Delete forgotten TODO
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models
* iq1_tn: NEON
* iq1_tn: faster NEON
* iq2_bn: improve performance on NEON
We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!
* iq1_tn: improve AVX2
PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq1_tn: improve Zen4
PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq2_bn: improve on Zen4
We now get PP-512 = 614 t/s up from 542 t/s
* iq2_bn: improve AVX2 implementation
We now get PP-512 = 753 t/s up from 680 t/s.
* Remove unnecessary barrier in ggml_compute_forward_mul_mat
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fused rms_norm: works on the CPU
* Fused rms_norm WIP
* Fused rms_norm WIP
* Fused rms_norm WIP
* Fused rms_norm WIP
* Fused rms_norm WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP: adding BF16 support to iqk_mul_mat
* Minor
* Improve TG speed (when not memory bound)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4 Flash Attnetion: WIP bf16
* Zen4 Flash Attnetion: bf16 seems to be working
* Zen4 Flash Attnetion: improving bf16
* Zen4 Flash Attnetion: improving bf16
It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP: trying to improve legacy quants
* WIP: trying to improve legacy quants
With this commit PP-512 for LlaMA-3.1-8B goes from
72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s
for q4_1, so 20+% improvement for both.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4 Flash Attnetion: WIP generalize to other types
Now loading of data from K and V is done via a template parameter,
so this should make it easy to generalize to typ[es other than
F16 for the K and V cache.
* Zen4 Flash Attnetion: it works for q4_0 and q8_0
* Zen4 Flash Attnetion: small q8_0 performance improvement
* Zen4 Flash Attnetion: add q4_1
* Delete unused stuff
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4 flash attention: moving useful parts from the kq_fused_softmax branch
* Add flash attention with soft-cap and fix D = 256 case
* Flash attention refinements
* Update FlashAttn comment
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Ref #29
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* soft_cap_max: initial CPU version of fused softcap + soft_max
With this vanilla CPU implementation I'm already getting a ~3% speedup
for Gemma-2-9b and a prompt of 8192 tokens.
* soft_cap_max: WIP - something is wrong with CUDA
* soft_cap_max: looks good on CPU and CUDA
* Add softcap to flash attention
Just CPU and CUDA for now (but, as we know, flash attention
on the CPU is useless in llama.cpp).
On CUDA this improves PP performance quite a bit, especially for
long contexts. E.g., for PP-16384, I now get 3777 t/s.
Without this change, one cannot use FA, and one gets 2300 t/s
(after fusing softcap and softmax), or 2000 t/s without the
fused softcap+softmax.
In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before
PR-8542 (where Johannes Gaessler has also added softcap to FA),
and PP-16384 = 3097 t/s after this PR.
* soft_cap_max: Metal
* Flash attention with softcap: Metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Softcap: WIP
Fuses scale + tanh + scale as used for softcaping in some
models.
Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG.
Somewhat surprisingly the improvement does not increase as I
go to longer contexts. Gemma2 does softcap on K*Q, which grows
quadratically with context length, so I would have thought
the benefit from fusing scale, tanh, scale would increase.
But no, no luck.
* softcap: CUDA
* softcap: CUDA
~1% speedup for Gemma2-9b
* softcap: Metal and NEON
About 1% speedup.
* Simdified gelu
Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2.
It looks like the gelu operation is memory bound on my CPU's
after SIMD-ifying it. By not using the 128 kb gelu lookup table
we gain a small advantage.
On the M2-Max the lookup table is slightly faster than the SIMD
version, so left the lookup table for ARM_NEON.
* softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512)
Not that I have encountered this in practice, but just to be sure.
This does it for AVX512 and AVX2, still need a guard for ARM_NEON.
* llama-bench: add ability to turn off warmup runs
So we don't need to wait forever on, e.g., benchmarks involving
long contexts.
* softcap, tanh: avoid NaNs for large arguments (NEON)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
This improves size vs quality balance for Gemma-2 models.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
It has been there for a while, but forgot to add here.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
This allows for a better comparison between different models
or different tensors of the same model where the magnitude of
the model weights may differ.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
For LLaMA-3.1 models:
* It is better to quantize all of attn_v with iq3_k instead of
half of attn_v with iq4_k
* Quantizing attn_output with iq3_k results in a larger PPL decrease
compared to what one expects from the added bpw.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE,
along with GGML_OP_NONE, are all noops. I.e., nothinh happens.
But ggml still has a barrier after them, which wastes time.
The waste is not too bad for large models where computations are
long compared to the time taken for thread synchronization.
But for small models skipping those unnecessary waits makes
a significant difference. E.g., for the 99M TriLMamodel,
TG-500 goes up to 1426 t/s from 1240 t/s.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
I always use cmake, so had forgotten to pay attention to the
Makefile.
|
|
See comments in f3a823ce729a7db33e7d4375eae7291bbe6196db
|
|
|
|
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128.
Someone has screwed up Q6_K TG performance on Metal? With the
cobntinuous "improvements" in ggml I wouldn't be surprised.
Need to look into it later.
|
|
Respectable performance, only slightly slower than Q6_K.
|
|
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.
My last commit message is wrong: also iq3_k needs a fix
for overflow.
|
|
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.
And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.
For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
|
|
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
|
|
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
|