summaryrefslogtreecommitdiff
path: root/examples/quantize
AgeCommit message (Collapse)Author
2024-12-08Rename iq4_nl_x4 to iq4_nl_r4 (#126)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)Kawrakow
* Adding iq2_bn_r4 This Zen4-only implementation achieves PP-512 = 826 t/s (!!!) for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn. * Make sure rows per thread are a multiple of the number of interleaved rows With this I can run iq2_bn_r4 with 32 threads and this increases PP-512 to 872 t/s. * iq2_bn_r4: 1st shot at NEON PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s for Bitnet-1.58b-3B). TG-128 is ~5% slower. * iq2_bn_r4: NEON PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn for 1 thread, but saturates to about the same 93 t/s at 8 threads. * iq2_bn_r4: Experimenting on NEON The matrix x vvector multiplication is erratic. iq2_bn_r4 is faster at 1, 2, and 4 threads, but saturates to a lower t/s at 8 threads compared to iq2_bn. iq2_bn actually manages 99 t/s at 8 threads and not 93 as I wrore in the last commit. iq2_bn_r4 performance has huge fluctuations at 4 and 8 threads. * Some cleanup * iq2_bn_r4: AVX2 As expected, PP is slightly slower as we just don;t have enough vector registers (690 vs 710 t/s). TG is slightly faster (18.2 vs 16.7 t/s at 1 thread). * iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn. * iq2_bn_r4: simdify q8_K16 quantization (AVX2) PP-512 becomes 834 t/s and TG-128 now saturates to the same performance as iq2_bn for 4 threads. * iq2_bn_r4: simdify q8_K16 quantization (NEON) PP-512 is now 304.7 t/s, and TG-128 @ 8 threads very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s) * iq2_bn_r4: fix AVX2 after breaking it two commits ago * iq2_bn_r4: better AVX2 As we don't have enough vector registers on AVX2, it is better to do two passes per row needing only half of the accumulator registers that way. With this, we now beat iq2_bn PP also on AVX2 by a small margin. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04IQ4_XS_R4 (#123)Kawrakow
* Adding iq4_xs_r4 This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4. * iq4_xs_r4: WIP * iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4 * iq4_xs_r4: NEON We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs! * DRY --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03Q6_0_R4 (#122)Kawrakow
* Adding q6_0_r4 We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X. * q6_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max. In terms of ops, q6_0_r4 is identical to q5_0_r4 except for loading the high bits being vld1q_u8_x2 instead of vld1q_u8. It is strange that this can make a 5% difference in performance, especially considering that this is amortized (re-used) over 8 columns in the right matrix. Or am I running out of vector registers? * Fix AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03Q5_0_R4 (#121)Kawrakow
* Adding q5_0_r4 We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X. We even get TG-128 improvement to 11.7 t/s from 11.1 t/s. * q5_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max, up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp is no longer funny: they get 26.5 t/s for Q5_0. For TG, we are nor able to fully saturate memory bandwidth and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets 20.6 t/s for Q5_0. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03Q8_0_R4 (#120)Kawrakow
* Adding q8_0_r4 We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared to 175.6 t/s for Q8_0. * q8_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max. * q8_0_r4: Zen4 matrix-vector specialization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02Q4_0_R4 (#119)Kawrakow
* Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02IQ4_NL_X4 (#118)Kawrakow
* Adding iq4_nl_x4 Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s on the Ryzen-7950X! This is faster than any other quant and ~40% faster than iq4_nl. * iq4_nl_x4: getting amazing This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s! * iq4_nl_x4: AVX2 Here we gain only 25% compared to iq4_nl * iq4_nl_x4: NEON On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from 82.4 t/s for iq4_nl. * iq4_nl_x4: minor NEON improvement and cleanup This gets us to 110.3 t/s. In comparison, IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s. * iq4_nl_x4: NEON specialization for matrix x vector --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25Bitnet changes (#106)Kawrakow
* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit f2d315b46f7aacc7df4b86bd8acba387b30e11ca. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-18CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)Nexes the Elder
To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up
2024-10-16Adding IQ4_KSS: 4.0 bpw quants (#89)Kawrakow
* iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13IQ2_KS: 2.1875 bpw non-linear quantization (#85)Kawrakow
* Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09New SOTA quantization: 4.25 bpw IQ4_KS (#83)Kawrakow
* iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02Adding Q6_0 (#77)Kawrakow
* Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27Adding ability to have meta data per tensor row (#61)Kawrakow
* POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)Kawrakow
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12Merge mainline - Aug 12 2024 (#17)Kawrakow
* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-09iq6_k: WIP (quantize/dequantize)Iwan Kawrakow
2024-08-07Adding IQ2_TN for use with ternary models (#13)Kawrakow
* iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-05q2_K: allow it to detect ternary nets and quantize accordinglyIwan Kawrakow
2024-08-01iq3_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01iq5_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize
2024-08-01iq2_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-07-28IQ4_K: SOTA 4-bit quantization (#6)Kawrakow
* iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-06-24Bitnet: tiny bity faster 1.625 bpw variant on MetalIwan Kawrakow
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-22bitnet: add 2 bpw quantizationIwan Kawrakow
The scalar dot product already chieves 37 t/s for TG!
2024-06-22bitnet: CUDA, scalar, AVX2Iwan Kawrakow
2024-06-13`build`: rename main → llama-cli, server → llama-server, llava-cli → ↵Olivier Chafik
llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-04common : refactor cli arg parsing (#7675)Georgi Gerganov
* common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params
2024-05-22common : normalize naming style (#7462)Georgi Gerganov
* common : normalize naming style ggml-ci * common : match declaration / definition order * zig : try to fix build
2024-05-20tests : fix --keep_split -> --keep-split (#7374)Georgi Gerganov
2024-05-19quantize : fix --keep-split check (#7374)Fred Douglas
2024-05-16doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)Vaibhav Srivastav
* chore: add references to the quantisation space. * fix grammer lol. * Update README.md Co-authored-by: Julien Chaumond <julien@huggingface.co> * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-08ggml : introduce bfloat16 support (#6412)Justine Tunney
* Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function
2024-04-26quantize: add imatrix and dataset metadata in GGUF (#6658)Pierrick Hymbert
* imatrix: save the dataset file used in the output file * llama: support kv overrides type string string * common: factorize KV Overrides parsing between common and server * quantize: add imatrix n entries and dataset KV metadata quantize: factorize KV Overrides parsing between common #6656 * llama: remove kv override str_value initialization as it does not compile on some toolchain * quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count` * quantize: add imatrix filename in KV * llama: add llama_model_kv_override_free * common: add llama_model_kv_override_free common: free kv override if used after model loading * llama: finally move the string KV override value to the stack * llama : minor * no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators. Co-authored-by: slaren <slarengh@gmail.com> * kv override: ensure string termination --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
2024-04-25tests : minor bash stuff (#6902)Georgi Gerganov
* tests : minor bash stuff ggml-ci * llama : fix build ggml-ci * tests : fix CUR_DIR -> ROOT_DIR ggml-ci * tests : fix fname ggml-ci
2024-04-25quantize : add '--keep-split' to quantize model into shards (#6688)jiez
* Implement '--keep-split' to quantize model into several shards * Add test script * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Split model correctly even if tensor id is out-of-order * Update llama_model_quantize_params * Fix preci failures --------- Co-authored-by: z5269887 <z5269887@unsw.edu.au> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-12chore: Fix markdown warnings (#6625)Rene Leonhardt
2024-04-03ggml : mul_mat_id use the same tensor for all the experts (#6387)slaren
* ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26IQ1_M: 1.75 bpw quantization (#6302)Kawrakow
* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-26quantize : be able to override metadata by key (#6321)Kawrakow
* quantize: be able to override metadata by key * minor : spacing --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-22quantize: options for output and token embedding tensors qtype (#6239)Kawrakow
* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27IQ4_XS: a 4.25 bpw quantization (#5747)Kawrakow
* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-26Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization ↵Kawrakow
range (#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-24IQ3_S: a much better alternative to Q3_K (#5676)Kawrakow
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-21IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)Kawrakow
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-181.5 bit quantization (#5453)Kawrakow
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-16ggml : add numa options (#5377)bmwl
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-03refactor : switch to emplace_back to avoid extra object (#5291)Michael Klimenko