summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-12-15ggml : group mul_mat_id rows by matrix (cpu only) (#4480)slaren
* ggml : group mul_mat_id rows by matrix (cpu only) * remove mmid parameters from mm forward * store row groups in wdata and calculate only once in GGML_TASK_INIT ggml-ci
2023-12-14ggml : use ggml_row_size where possible (#4472)slaren
* ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14ggml : remove n_dims from ggml_tensor (#4469)slaren
ggml-ci
2023-12-14py : add protobuf dependency (#4466)wonjun Jang
2023-12-14ggml : add ggml_row_size() (fixes llama out of space) (#4461)LostRuins
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453)Georgi Gerganov
2023-12-14convert : support loading vocab from fast tokenizer config (#3633)wonjun Jang
* Add HFVocab into convert.py * Update convert.py * Update convert.py * add bytes_to_unicode function * change add_meta_vocab fucntion * remove debug code * remove byte_encoder * Add newline between classes * Check tokenizer.json when tokenizer.model is not exist. * Move transformers dependency to local code * Add error context with 'raise from' * Add fast tokenizer option to BpeVocab * Update convert.py * Add VocabLoader and remove *Vocab class * Add transformers dependency * remove added tokens and check newline token to decide spm or bpe * Update convert.py * Add special token type * Update convert.py * Update convert.py * Update convert.py * Fix typo in convert.py * Fix when params.n_vocab < tokenizer vocab size * update vocab class * change funtion name * Remove unused variable/functions, add types to class variable and methods, delete blank liens * fix flake8 warnings * code style cleanup * make mypy happy * change exception --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14readme : update supported model list (#4457)BarfingLemurs
2023-12-13server : fix handling of characters that span multiple tokens when streaming ↵shibe2
(#4446)
2023-12-13sync : ggml (SD ops, tests, kernels) (#4444)Georgi Gerganov
* sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13build : detect host compiler and cuda compiler separately (#4414)Jared Van Bortel
2023-12-13common : add `--version` option to show build info in CLI (#4433)Siwen Yu
2023-12-13readme : update hot topicsGeorgi Gerganov
2023-12-13llama : add Mixtral support (#4406)slaren
* convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92 * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-12server : tweak default sampling parameters (#4367)kalomaze
* Set a more typical Top P setting as the default * Update temp max
2023-12-12english : use `typos` to fix comments and logs (#4354)Richard Kiss
2023-12-12build : target Windows 8 for standard mingw-w64 (#4405)Jared Van Bortel
* build : target Windows 8 for standard mingw-w64 * make : fix missing console.o deps This was causing a link error with `make all` on Windows.
2023-12-12llama : document logits_all deprecation (#4418)crasm
llama_context_params.logits_all is a parameter for controlling llama_eval. This documents that logits_all should not be used with llama_decode and llama_batch.
2023-12-12server : fix local model name in server (#4420)Vladimir Zorin
2023-12-12ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424)Taikono-Himazin
2023-12-10Update README.md (#4388)Yueh-Po Peng
Fix small typo.
2023-12-09grammar : revert the replacement of llama_token_to_piece with id_to_token ↵Xiang (Kevin) Li
(#4396)
2023-12-07sync : ggml (new ops, tests, backend, etc.) (#4359)Georgi Gerganov
* sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07llama : per-layer KV cache + quantum K cache (#4309)Georgi Gerganov
* per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07train : fix #4227 (double free in ↵Hongyu Ouyang
examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351) On commit b1108 (44c117f4) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free.
2023-12-06server : recognize cache_prompt parameter in OAI API (#4347)Georgi Gerganov
2023-12-06common : fix compile warningGeorgi Gerganov
2023-12-06speculative : support `--color` (#4343)stduhpf
* speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05grammar : pre-computed pieces + reserve mem + less string copies (#4330)Marcus Dunn
* reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit 3773328080e6a139ee83198329a13cf4ff61d707. * changed decode_utf8 to take src by ref
2023-12-05llama : allow overriding GGUF metadata when loading model (#4092)Kerfuffle
* feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05sampling : custom samplers order (#4285)MaggotHATE
* Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces
2023-12-05swift : revert compiler checks for swift package (#4332)kchro3
2023-12-04simple : update error message for KV cache check (#4324)Daniel Bevenius
This commit updates the error message that is printed when the KV cache is not big enough to hold all the prompt and generated tokens. Specifically it removes the reference to n_parallel and replaces it with n_len. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-04swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325)Miwa / Ensan
2023-12-04swift : fix prompt tokenization logic (#4321)Miwa / Ensan
2023-12-04grammar-parser : fix typo (#4318)Ikko Eltociear Ashimine
preceeding -> preceding
2023-12-03ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308)Georgi Gerganov
* ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci
2023-12-03ggml : fix soft max out-of-bounds access (#4307)Georgi Gerganov
ggml-ci
2023-12-03server : fix OpenAI API `stop` field to be optional (#4299)Ed Lee
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84ae3bcbf0d617b7ee6a5413bcbd58af)
2023-12-03py : add grammar to oai like api (#4294)Rickard Edén
2023-12-03llama : pad KV cache size (#4280)Georgi Gerganov
* llama : pad KV cache size to 32 * metal : try to improve batched decoding
2023-12-01llama : avoid using "optional" keyword (#4283)Georgi Gerganov
2023-12-01llama : support optional tensors (#4283)Georgi Gerganov
2023-12-01swift : fix token_to_piece implementation (#4278)Miwa / Ensan
* Fix token_to_piece implementation in Swift * Fix errors
2023-12-01build : enable libstdc++ assertions for debug builds (#4275)Jared Van Bortel
2023-12-01llama : support attention bias on LLaMA architecture (#4283)CausalLM
* Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp
2023-12-01llama : add Qwen support (#4281)Shijie
* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-01llama : fix integer overflow during quantization (#4284)Georgi Gerganov
happens with multi-threaded quantization of Qwen-72B ggml-ci
2023-12-01py : add requirements file for convert-hf-to-gguf.py (#4277)Daniel Bevenius
This commit adds a requirements file for the convert-hf-to-gguf.py script, and also add the torch and transformers packages to it. The motivation for this is that currently running convert-hf-to-gguf.py will produce the following error: ```console $ python3 -m venv venv $ source venv/bin/activate (venv) $ pip install -r requirements.txt Collecting numpy==1.24.4 Collecting sentencepiece==0.1.98 Collecting gguf>=0.1.0 Installing collected packages: sentencepiece, numpy, gguf Successfully installed gguf-0.5.1 numpy-1.24.4 sentencepiece-0.1.98 (venv) $ python convert-hf-to-gguf.py --help Traceback (most recent call last): File "llama.cpp/convert-hf-to-gguf.py", line 16, in <module> import torch ModuleNotFoundError: No module named 'torch' ``` With this commit, and using requirements-hf-to-gguf.txt instead of requirements.txt, the script can be run and shows the help output. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-01ggml : add ggml_soft_max_ext (#4256)Georgi Gerganov
* metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug