summaryrefslogtreecommitdiff
path: root/examples
AgeCommit message (Collapse)Author
2024-12-03Q8_0_R4 (#120)Kawrakow
* Adding q8_0_r4 We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared to 175.6 t/s for Q8_0. * q8_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max. * q8_0_r4: Zen4 matrix-vector specialization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02Q4_0_R4 (#119)Kawrakow
* Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02IQ4_NL_X4 (#118)Kawrakow
* Adding iq4_nl_x4 Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s on the Ryzen-7950X! This is faster than any other quant and ~40% faster than iq4_nl. * iq4_nl_x4: getting amazing This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s! * iq4_nl_x4: AVX2 Here we gain only 25% compared to iq4_nl * iq4_nl_x4: NEON On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from 82.4 t/s for iq4_nl. * iq4_nl_x4: minor NEON improvement and cleanup This gets us to 110.3 t/s. In comparison, IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s. * iq4_nl_x4: NEON specialization for matrix x vector --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25Bitnet changes (#106)Kawrakow
* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit f2d315b46f7aacc7df4b86bd8acba387b30e11ca. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-18CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)Nexes the Elder
To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up
2024-10-16Adding IQ4_KSS: 4.0 bpw quants (#89)Kawrakow
* iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13IQ2_KS: 2.1875 bpw non-linear quantization (#85)Kawrakow
* Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-10Better model info (#84)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09New SOTA quantization: 4.25 bpw IQ4_KS (#83)Kawrakow
* iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02Adding Q6_0 (#77)Kawrakow
* Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27Adding ability to have meta data per tensor row (#61)Kawrakow
* POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)Kawrakow
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05Zen4 Flash Attention - bf16 support (#38)Kawrakow
* Zen4 Flash Attnetion: WIP bf16 * Zen4 Flash Attnetion: bf16 seems to be working * Zen4 Flash Attnetion: improving bf16 * Zen4 Flash Attnetion: improving bf16 It is better (slightly faster) to first convert Q to bf16 before processing each block of q_step rows. This requires D*q_step*sizeof(bf16) bytes, so at most 4 kb for the head sizes we support, so we can just allocate on the stack instead of reserving and passing a work buffer in ggml. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-20Fused soft cap and SIMD-ified GeLU (#9)Kawrakow
* Softcap: WIP Fuses scale + tanh + scale as used for softcaping in some models. Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG. Somewhat surprisingly the improvement does not increase as I go to longer contexts. Gemma2 does softcap on K*Q, which grows quadratically with context length, so I would have thought the benefit from fusing scale, tanh, scale would increase. But no, no luck. * softcap: CUDA * softcap: CUDA ~1% speedup for Gemma2-9b * softcap: Metal and NEON About 1% speedup. * Simdified gelu Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2. It looks like the gelu operation is memory bound on my CPU's after SIMD-ifying it. By not using the 128 kb gelu lookup table we gain a small advantage. On the M2-Max the lookup table is slightly faster than the SIMD version, so left the lookup table for ARM_NEON. * softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512) Not that I have encountered this in practice, but just to be sure. This does it for AVX512 and AVX2, still need a guard for ARM_NEON. * llama-bench: add ability to turn off warmup runs So we don't need to wait forever on, e.g., benchmarks involving long contexts. * softcap, tanh: avoid NaNs for large arguments (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19quantize_stats: print rmse and max error as fraction of <x> (#21)Kawrakow
This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12Merge mainline - Aug 12 2024 (#17)Kawrakow
* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-09iq6_k: WIP (quantize/dequantize)Iwan Kawrakow
2024-08-07Adding IQ2_TN for use with ternary models (#13)Kawrakow
* iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-05q2_K: allow it to detect ternary nets and quantize accordinglyIwan Kawrakow
2024-08-01iq3_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01iq5_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA dequantize
2024-08-01iq2_k: BasicsIwan Kawrakow
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-07-28IQ4_K: SOTA 4-bit quantization (#6)Kawrakow
* iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-24Add copyright noticesIwan Kawrakow
Only on the files where I have contributed in a significant way, or the files I wrote myself.
2024-06-26imatrix: be able to specify the name of the output tensorIwan Kawrakow
For some models the same tensor is used for token embeddings and output. This tensor tends to be named token_embedding.weight rather than output.weight, which prevernts us from collecting imatrix data for this tensor. With this commit we can tell the name of the output tensor to the imatrix tool.
2024-06-24Bitnet: tiny bity faster 1.625 bpw variant on MetalIwan Kawrakow
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-22bitnet: add 2 bpw quantizationIwan Kawrakow
The scalar dot product already chieves 37 t/s for TG!
2024-06-22bitnet: CUDA, scalar, AVX2Iwan Kawrakow
2024-06-21llama : allow pooled embeddings on any model (#7477)Douglas Hanley
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling
2024-06-21swiftui : enable stream updating (#7754)Shuichi Tsutsumi
2024-06-20[SYCL] Fix windows build and inference (#8003)luoyu-intel
* add sycl preset * fix debug link error. fix windows crash * update README
2024-06-20server : fix smart slot selection (#8020)sasha0552
2024-06-18Only use FIM middle token if it exists (#7648)Sigbjørn Skjæret
* Only use FIM middle if it exists * Only use FIM middle if it exists
2024-06-17Add support for sqrt on CUDA (#7953)Calvin Laurenson
* cuda sqrt support * enable cuda in pca * fix comments in pca * add test * add sqrt to ggml_backend_cuda_supports_op * fix test * new line * Use F32 sqrtf instead of F64 sqrt Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-15Add `cvector-generator` example (#7514)Xuan Son Nguyen
* add control-vector-generator * calc diff * add comments * proof-of-concept stdlib implementation Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish. * param parsing, refactor, comments Added basic command-line parameters for outfile and one each positive/negative prompt. Refactored some messy code in PCA computation and GGUF exporting. Left a bunch of comments regarding further work needed. * example template completions Implements an example template set built from the positive/negative prompts like the control vector Python implementation. * add multi prompts, multi-thread for PCA * fix mem error * add debugs * fix matrix transpose multiplication you have got to be kidding me * preliminary template/multiprompt support model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish * fix zero output & param parsing, functional templating fixed a bug where the output file had no tensor data/was all zero fixed a bug where single hyphen flags were not being correctly parsed implements creation of templated prompts from input (still need to adapt based on model) * fix square_diff matmul index range and CRLF->LF line endings fixed a logic error where square_diff would not multiply all rows fixed a formatting error where the provided completions.txt had CRLF line endings * add command-line args for num threads, num completions file lines, always reload model refactored a few things and did what the commit message says on the tin * code aestheticization * fix compiler warnings * in-series multithreading for prompt embedding? added commented-out code to attempt to start implementing mutlithreading for embedding in main * remove unnecessary multithreading * interim fix memory leak * translated everything but PCA (I think) * tentatively translate the rest * fix ggml errors and make new ones at least it compiles and runs * fix cb_eval * temporary commit while I move dev environments it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent * update debug statements * pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped * update comments * (wip) refactor * clean up PCA ggml implementation * fix shape of v_diff_original * add n_batch for pca * working version * remember to copy back the last_eigenvector * fix n_completions * bring back n_completions * default n_pca_batch to 20 * fix macos build * add to makefile all targets * use ggml_format_name * add readme * fix .editorconfig * use ggml_backend_tensor_copy * attemp to fix compile problem on mac * fix compile warn * reuse allocr * move param parser to common * better error handling * clean up a bit * add print_usage * shorten help msg * beautify help msg * escape prompt by default * change compile target to llama-cvector-generator * typo * disable GPU for PCA * code style --------- Co-authored-by: Christian Zhou-Zheng <christianzhouzheng@gmail.com>
2024-06-14llama-bench : fix RPC indication (#7936)Radoslav Gerganov
Show "<backend_name>+RPC" when RPC offloading is used
2024-06-13move BLAS to a separate backend (#6210)slaren
* move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13`build`: rename main → llama-cli, server → llama-server, llava-cli → ↵Olivier Chafik
llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-12server : restore numeric prompts (#7883)Georgi Gerganov
2024-06-11llama-bench: more compact markdown tables (#7879)Johannes Gäßler
2024-06-11json: refine constraint for whitespace to avoid runaways yet allow pretty ↵Olivier Chafik
print (#7866)
2024-06-11`json`: document schema conversion in GBNF readme, align manual grammar ↵Olivier Chafik
examples & converters (#7841) * json: fix char pattern in grammar converters * json: prevent number precision & whitespace runaways in example grammars * json: add doc to grammar readme
2024-06-10examples : remove --instruct remnants (#7846)Georgi Gerganov
2024-06-10server : improve "prompt" handling (#7847)Georgi Gerganov
2024-06-09imatrix : handle partial entries (#7833)Georgi Gerganov
2024-06-09server: do not remove whitespace at the start of a completion chunk (#7830)mgroeber9110
2024-06-09Revert "[SYCL] Update rpc-server.cpp to include SYCL backend (#7682)" (#7808)slaren
This reverts commit 9422c5e34bbd302493b77a8f6d546154a1f4fe82.
2024-06-08server : smart slot selection using Longest Common Prefix (#7728)sasha0552
* server : Smart selection of available slot using Longest Common Substring * add usage * remove trailing whitespaces * Use Longest Common Prefix (LCP) instead of LCS * Rename argument
2024-06-07gguf-split : change binary multi-byte units to decimal (#7803)Christian Zhou-Zheng