Age | Commit message (Collapse) | Author |
|
* Adapting iq2_bn to work without separate scale tensors
Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).
* Adapting iq1_bn to work without separate scale tensors
* Adapting iq2_bn: CUDA dequantize
* Adapting iq2_bn: CUDA works
* Adapting iq1_bn: CUDA works
* Adapting iq1_bn, iq2_bn: NEON
* Adapting iq1_bn, iq2_bn: Metal
Dequantize works, but there is still something wrong
with the dot products.
* WIP
Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.
* Remove iq1_tn and iq2_tn - Part 1
Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.
* Remove iq1_tn and iq2_tn - Part 2
* Bitnet: use the standard llm_build_kv to build self attention
My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).
* Revert "Avoid rebuild of GGML graph for each token (#98)"
This reverts commit f2d315b46f7aacc7df4b86bd8acba387b30e11ca.
As far as I can tell, the commit breaks Metal TG.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
To complement the token_embd.weight and output.weight :
attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up
|
|
* iq4_kss: WIP
* iq4_kss: CUDA dequantize works
So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.
* iq4_kss: slightly better quantization
* iq4_kss: another small quantization improvement
* iq4_kss: CUDA works
TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.
* iq4_kss: new bit arrangement - CUDA and Zen4 work
Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.
* iq4_kss: ARM_NEON. Predictably very slow
* iq4_kss: Metal
PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.
* iq4_kss: somewhat faster Metal dot product
45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0
* iq4_kss: AVX2
Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.
* iq4_kss: very slightly faster Metal dot product
48.7 t/s -> 49.3 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experimenting
* iq2k: Try make_qx_quants for the scale
Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5
* iq2k with make_qx_quants: adjust scale
* iq2ks: basics
* iq2_ks: CUDA works
* iq2_ks: WIP
* iq2_ks: WIP
* iq2_ks: Zen4
* iq2_ks: AVX2
* iq2_ks: scalar dot product
* iq2_ks: ARM_NEON
* iq2_ks: Metal
* iq2_ks: faster Metal
LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 = 45.32 ± 0.03 t/s
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_k_xxs: basics
* WIP + adding iq3_kl quantization mix
* iq4_xxs: this looks very viable compared to iq4_xs
At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.
* iq4_xxs: CUDA dot product
We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.
* iq4_xxs: scalar CPU dot product
Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.
* iq4_xxs: Zen4
I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).
* Fix iq4_xs (Zen4)
* iq4_xxs: AVX2
* iq4_xxs: ARM_NEON
* iq4_xxs: Metal
* iq4_xxs: slightly faster TG on Metal
* iq4_xxs: rename to iq4_ks
After all, tt is a smaller variant of iq4_k.
* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q6_0 - basics + AVX2/Zen4 working
* Adding q6_0: CUDA dequantize works, but not mmvq
* Adding q6_0: CUDA mmvq works
* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache
* Add q6_0 to CPU flash attention
Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?
* q6_0: slightly better kv-cache result
Better than q8_0+q4_0, but not as good as q8_0+iq4_nl
* q6_0: works on ARM_NEON
* q6_0: dequantize works on Metal, but not vector dot product
* q6_0: it now works on Metal
Outperforms q5_0 by a significant margin. E.g.
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 |
* q6_0: can now be used for kv-cache on Metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* POC: per row scale
This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.
* POC per row scale: iq2_tn on NEON
* POC per row scale: iq2_tn on Metal
* Per row scale Metal templates
* iq1_tn: shrink to 1.625 bpw (NEON and Metal)
* POC per row scale: CUDA
* POC per row scale: add CUDA TODOs
There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.
* Per row scales - CUDA
The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.
* Update IQ1_TN and IQ2_TN bpw shown to user
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models
* iq1_tn: NEON
* iq1_tn: faster NEON
* iq2_bn: improve performance on NEON
We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!
* iq1_tn: improve AVX2
PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq1_tn: improve Zen4
PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.
* iq2_bn: improve on Zen4
We now get PP-512 = 614 t/s up from 542 t/s
* iq2_bn: improve AVX2 implementation
We now get PP-512 = 753 t/s up from 680 t/s.
* Remove unnecessary barrier in ggml_compute_forward_mul_mat
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* iq2_tn: TriLM specific 2.0625 bpw quantization
Quantize/dequantize/scale dot product.
I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.
* iq2_tn: AVX512
Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.
* iq2_tn: AVX512
With this tweak we get to PP-512 = 431 t/s.
* iq2_tn: AVX512
With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.
* iq2_tn: AVX2
PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.
* iq2_tn: initial NEON version
* iq2_tn: NEON
For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.
* iq2_tn: Metal
For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.
* iq2_tn: CUDA
For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.
* iq2_tn: AVX2 PP improvement
We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.
* iq2_tn: small NEON improvement
For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
|
|
Quantize/dequantize, CUDA dequantize
|
|
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
|
|
* iq4_k: basics
* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
* iq4_k: TG now works on CUDA
* iq4_k: AVX512 implementation
For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.
* iq4_k: AVX2 implementation
For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.
* iq4_k: NEON implementation
For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.
* iq4_k: Metal implementation
For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.
* iq4_k: scalar dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
|
|
The scalar dot product already chieves 37 t/s for TG!
|
|
|
|
* common : normalize naming style
ggml-ci
* common : match declaration / definition order
* zig : try to fix build
|
|
|
|
* Introduce bfloat16 support
Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 brain16
This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───────────────────┐
0b00000000000000000000000000000000 IEEE binary32
The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌─┴─┐┌─┴──────┐
0b0000000000000000 IEEE binary16
This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16
* Remove GGML code that's not needed
* Minimize the GGML API surface area for BF16
* Remove bf16 luts
* Make the GGML header look nicer
* Fix documentation
* Apply ggerganov's fixes for test-backend-ops
* Add BF16 code for new ggml_validate_row_data() function
|
|
* imatrix: save the dataset file used in the output file
* llama: support kv overrides type string string
* common: factorize KV Overrides parsing between common and server
* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656
* llama: remove kv override str_value initialization as it does not compile on some toolchain
* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`
* quantize: add imatrix filename in KV
* llama: add llama_model_kv_override_free
* common: add llama_model_kv_override_free
common: free kv override if used after model loading
* llama: finally move the string KV override value to the stack
* llama : minor
* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.
Co-authored-by: slaren <slarengh@gmail.com>
* kv override: ensure string termination
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* Implement '--keep-split' to quantize model into several shards
* Add test script
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Split model correctly even if tensor id is out-of-order
* Update llama_model_quantize_params
* Fix preci failures
---------
Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* ggml : update mul_mat_id to use the same tensor for all the experts
* update cuda
* minor
* update metal
* update test-backend-ops
* fix cuda
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update convert.py
* update convert-hf-to-gguf.py
* update convert.py for mixtral hf models
* Update convert-hf-to-gguf.py
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* cuda : support non-pow-2 number of experts
* allow quantize to work for split and merged experts models in the same way
* cleanup + disable mmap automatically with split tensors models
* update imatrix
* test-backend-ops : test qwen argsort
* update grok model loading
* llama : add merged experts tensors to the grok tensor map
* minor
* gguf : bump version
* fix quantizing of merged experts
* convert-hf-to-gguf.py : update grok (untested)
* make linter happy
* cuda/argsort : use shared memory instead of pool memory
* convert : fix grok tensor names
* metal : add support for non-pow-2 argsort
* llama : more loader cleanup, better error checking
* cuda : fix warning
* llama : still use mmap for loading old models, but copy the data to a host buffer
* add review note
* llama : remove ffn tensor counting + add sanity check
ggml-ci
* convert : fix handling of n_experts == None
ggml-ci
* imatrix : fix ncall counters
* llama : produce error if imatrix size does not match
* quantize : terminate on errors + trace logs
ggml-ci
* metal : pad shared memory to 16 bytes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* iq1_m: basics
* iq1_m: basics-2
* iq1_m: CUDA dequantize works
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
* iq1_m: separate shifts for each group of 8 in a block
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105
Not bad, but slightly higher than
sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
PPL = 9.14 for LLaMA-v2-7B
PPL = 6.63 for LLaMA-v2-13B
* iq1_m: go to 3-bit scales
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.
We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
* iq1_m: scalar dot product
* iq1_m: AVX2 dot product
* iq1_m: very slightly faster AVX2 dot product
* iq1_m: ARM_NEON dot product
Works, but very slow (10.5 t/s)
* iq1_m: Metal - dequantize works, dot product does not
* iq1_m: Metal now works
About the same performance as iq1_s.
* iq1_m: minor
* iq1_m: checking pure iq1_m quantization
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
* iiq1_m: slightly faster ARM_NEON dot product
10.5 t/s -> 11.65 t/s
* iq1_m: faster ARM_NEON dot product
11.65 t/s -> 14.9 t/s
* iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
* iq1_m: small PPL improvement via super-block scale adjustment
After quantizing block scales redo the super-block scale fit.
PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B ) = 8.1624
* iq1_m: adapt to CUDA refactoring
* iq1_m: remove unused variable
We have progressed to warnings being errors.
* iq1_m: add to backend-ops tests
* iq1_m: fix Windows ARM
* iq1_m: use common definition of iq1m_scale_t
* cuda: assert -> NO_DEVICE_CODE
* iq1_M: PR comments
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* quantize: be able to override metadata by key
* minor : spacing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* quantize: be able to specify the output tensor type
* quantize: be able to specify the token embedding tensor type
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* iq4_nl: Fix after merging with master
* iq4_nl: another fix after merging with master
* Use IQ4_NL instead of Q4_K when using k-quants is not possible
* Fix typo that makes several tests fail
* It was the ggml_vdotq thing missed inside the brackets
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s: WIP basics
* iq1_s: CUDA is working
* iq1_s: scalar CPU dot product
* iq1_s: WIP AVX2 dot product - something is not right
* Fix tests
* Fix shadow warnings
* Fix after merge with latest master
* iq1_s: AVX2 finally works
* iq1_s: ARM_NEON dot product. Works, but not very fast
* iq1_s: better grid
* iq1_s: use IQ2_XXS for attn_output
At a cost of 0.04 extra bpw this gives a big improvement in PPL.
* iq1_s: Metal basics
Dequantize works, but not dot product
* iq1_s: Metal works, but quite slow
As usual, Apple Silicon does not like the code I write.
* iq1_s: Tests
* iq1_s: slightly faster dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverted Makefile
* Fixed include
* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
* removed trailing whitespace
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverting Makefile
* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
* Removing MIRROR_MODE code for this PR
* Removing last bit of MIRROR_MODE code for this PR
* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
* Fixed lingering init_llama_backend() bool calls in tests and examples
* Remote enum llama_numa_strategies
* Revert bad merge with dynatemp flags
* add missing enum ggml_numa_strategies declaration and revert sync problem with master
* add missing enum ggml_numa_strategies declaration
* fixed ggml_init_numa variable
* Update ggml.h
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
* Fix up some boolean vs enum comparisons
* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
* Update ggml.h
Align enum values
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
Remove whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
align paremeters
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/server.cpp
remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
* Update ggml.c
simplified return for platforms without NUMA support
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* removed redundant else from cli argument processing of --numa
* whitespace
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|
|
|
|
* iq3_xxs: quantize/dequantize
RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.
* iq3_xxs: CUDA dequantize works
* iq2_xxs: tuning quantization
* iq3_xxs: starting to look better
PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717
This is better than Q3_K_XS, with a 5% reduction in quantized model
size.
* iq3_xxs: CUDA dot product
We have
PP-512: 5891 t/s
TG-128: 143.9 t/s
* iq3_xxs: scalar and AVX2 dot products
* iq3_xxs: ARM_NEON and Metal
Metal performance is decent, ARM_NEON is pathetic
* iq3_xxs: slightly better grid points
* Faster iq3_xxs and iq2_xs dot products on CUDA
* iq3_xxs: add some quant mix
* iq3_xxs: fix failing quantization test
Dot product still fails. Is this real?
* iq3_xxs: hopefully fix ROCm
* iq3_xxs: failing tests
This time the dot product accuracy did find an actual bug
in the AVX2 implementation.
* Add IQ3_XXS to test-backend-ops
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Fix misprint in quantize help
|
|
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S
* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K
Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Restore intended k-quants quantization mixes for MoE models
* Update Q2_K_S values in the quantize tool
Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
|
|
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
|
|
|
|
|
|
|
|
Co-authored-by: xaedes <xaedes@gmail.com>
|
|
* Allow quantize tool to only copy tensors to allow repackaging models.
* Slightly better logic when requantizing.
* Change help message to go to `stdout`.
|