summaryrefslogtreecommitdiff
path: root/llama.cpp
AgeCommit message (Collapse)Author
2024-01-24llama : pre-allocate input tensors in a separate buffer (#5100)slaren
2024-01-23minor : clean-up some warnings and style (#5094)Georgi Gerganov
* minor : clean-up some warnings and style ggml-ci * ggml : add comment
2024-01-22llama : fix not enough space in buffer with Qwen (#5086)slaren
2024-01-22llama : support StableLM 2 1.6B (#5052)compilade
* llama : support StableLM 2 1.6B * convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}] * convert : refactor Qwen's set_vocab to use it for StableLM 2 too * nix : add tiktoken to llama-python-extra * convert : use presence of tokenizer.json to determine StableLM tokenizer loader It's a less arbitrary heuristic than the vocab size.
2024-01-22llama : add Q3_K_XS (#5060)Kawrakow
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22llama : add more qwen2 models (#5071)Shijie
2024-01-20llama : run all KQV ops on the CPU with no KV offload (#5049)slaren
ggml-ci
2024-01-19llama : support upcoming Qwen2 (#5037)Shijie
2024-01-19llama : add CodeShell support (#5016)chiranko
* llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19llama : fix falcon arch for tied output embeddings (#4978)John
* falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-18llama : fix mlock with no-mmap with Metal (#5025)slaren
2024-01-17ggml : add IQ2 to test-backend-ops + refactoring (#4990)Georgi Gerganov
* ggml : add IQ2 to test-backend-ops + refactoring ggml-ci * cuda : update supports_op for IQ2 ggml-ci * ci : enable LLAMA_CUBLAS=1 for CUDA nodes ggml-ci * cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci * tests : avoid creating RNGs for each Q tensor ggml-ci * tests : avoid creating RNGs for each tensor ggml-ci
2024-01-17backend : add eval callback (#4935)Georgi Gerganov
* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo
2024-01-17llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 (#4996)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-16ggml : importance matrix support for legacy quants (#4969)Kawrakow
* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-15llama : apply classifier-free guidance to logits directly (#4951)David Friehs
2024-01-15llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14llama : fix missing quotes (#4937)David Pflug
2024-01-14llama : check LLAMA_TRACE env for extra logging (#4929)Georgi Gerganov
* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci
2024-01-14llama : use LLAMA_LOG_ macros for loggingGeorgi Gerganov
2024-01-14Fix ffn_down quantization mix for MoE models (#4927)Kawrakow
* Fix ffn_down quantization mix for MoE models In #4872 I did not consider the part where every third tensor is quantized with more bits. Fir MoE this leads to tensors of the same layer being quantized with different number of bits, which is not considered as a possibility in the inference implementation (it is assumed all experts use the same quantization). * Fix the fix * Review suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14llama : support WinXP build with MinGW 8.1.0 (#3419)Karthik Kumar Viswanathan
2024-01-142-bit quantizations (#4897)Kawrakow
* imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-13metal : remove old API (#4919)Georgi Gerganov
ggml-ci
2024-01-13llama : fix detokenization of non-special added-tokens (#4916)Georgi Gerganov
Co-authored-by: goerch <jhr.walter@t-online.de>
2024-01-13llama : minimize size used for state save/load (#4820)David Friehs
* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes
2024-01-13convert : update phi-2 to latest HF repo (#4903)Georgi Gerganov
* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff
2024-01-12llama : ggml-backend integration (#4766)slaren
* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12llama : remove redundant assert for StableLM (#4901)Georgi Gerganov
2024-01-12llama : fix typo "imp_embd" -> "inp_embd"Georgi Gerganov
2024-01-12llama : fix llm_build_k_shift to use correct n_rot (#4889)Georgi Gerganov
* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot
2024-01-11llama : restore intended k-quants mixes for MoE models (#4872)Kawrakow
* Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)Kawrakow
* iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11main : print total token count and tokens consumed so far (#4874)pudepiedj
* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn
2024-01-10llama : add additional suffixes for model params (#4834)Brian
* llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=
2024-01-10llama : recognize 1B phi models (#4847)Austin
This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.
2024-01-08SOTA 2-bit quants (#4773)Kawrakow
* iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08examples : add passkey test (#3856)Georgi Gerganov
* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme
2024-01-07llama : remove unused vars (#4796)Georgi Gerganov
2024-01-07llama : remove redundant GQA check (#4796)Georgi Gerganov
2024-01-07llama : print tensor meta for debuggingGeorgi Gerganov
2024-01-02llama : llama_model_desc print number of expertsGeorgi Gerganov
2024-01-02llama : replace all API facing `int`'s with `int32_t` (#4577)Marcus Dunn
* replaced all API facing `int`'s with `int32_t` * formatting and missed `int` in `llama_token_to_piece`
2024-01-02llama : differentiate the KV dims in the attention (#4657)postmasters
* Add n_key_dim and n_value_dim Some models use values that are not derived from `n_embd`. Also remove `n_embd_head` and `n_embd_gqa` because it is not clear which "head" is referred to (key or value). Fix issue #4648. * Fix `llm_build_kqv` to use `n_value_gqa` * Rebase * Rename variables * Fix llm_build_kqv to be more generic wrt n_embd_head_k * Update default values for n_embd_head_k and n_embd_head_v Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix llm_load_tensors: the asserts were not backcompat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-30ggml : add ggml_cpu_has_avx_vnni() (#4589)automaticcat
* feat: add avx_vnni based on intel documents * ggml: add avx vnni based on intel document * llama: add avx vnni information display * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * Update ggml.c Fix indentation upgate Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-28gpt2 : Add gpt2 architecture integration (#4555)manikbhandari
2023-12-27llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)Nam D. Tran
* update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io> Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-26cuda : fix vmm pool with multi GPU (#4620)slaren
* cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device
2023-12-24llama : add PLaMo model (#3557)Shintarou Okada
* add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>