summaryrefslogtreecommitdiff
path: root/examples
AgeCommit message (Collapse)Author
2024-02-21llava : add --skip-unknown to 1.6 convert.py (#5632)Daniel Bevenius
This commit adds the `--skip-unknown` option to the convert.py script and removes the saving of the updated checkpoints to avoid updating possibly checked out files. The motivation for this change is that this was done for 1.5 in Commit fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava : update surgery script to not remove tensors") and makes the examples more consistent. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-21IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)Kawrakow
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-20server : support llava 1.6 (#5553)CJ Pais
* server: init working 1.6 * move clip_image to header * remove commented code * remove c++ style from header * remove todo * expose llava_image_embed_make_with_clip_img * fix zig build
2024-02-20llava : add explicit instructions for llava-1.6 (#5611)Daniel Bevenius
This commit contains a suggestion for the README.md in the llava example. The suggestion adds explicit instructions for how to convert a llava-1.6 model and run it using llava-cli. The motivation for this is that having explicit instructions similar to the 1.5 instructions will make it easier for users to try this out. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20Server: use llama_chat_apply_template (#5593)Xuan Son Nguyen
* server: use llama_chat_apply_template * server: remove trailing space * server: fix format_chat * server: fix help message Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: fix formatted_chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20server : health endpoint configurable failure on no slot (#5594)Pierrick Hymbert
2024-02-19examples : support minItems/maxItems in JSON grammar converter (#5039)nopperl
* support minLength and maxLength in JSON schema grammar converter * Update examples/json-schema-to-grammar.py --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19llava : remove extra cont (#5587)Georgi Gerganov
2024-02-19llava : replace ggml_cpy with ggml_contslaren
2024-02-19llava : avoid changing the original BakLLaVA model (#5577)Daniel Bevenius
This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava : update surgery script to not remove tensors") but this time the change is to the BakLLaVA specific part of the surgery script. I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works as expected using the instructions in README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-19baby-llama : allocate graphs in ggml_context (#5573)NawafAlansari
* Fixed the baby-llama issue (see issue #4830) * minor : fix whitespaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-18ci : fix wikitext url + compile warnings (#5569)Georgi Gerganov
ggml-ci
2024-02-18common, server : surface min_keep as its own parameter (#5567)Robey Holderith
* Feature - surface min_keep as its own parameter * Updated README with min_keep param
2024-02-18server : slots monitoring endpoint (#5550)Pierrick Hymbert
2024-02-18server : enhanced health endpoint (#5548)Pierrick Hymbert
* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
2024-02-18server : --n-predict option document and cap to max value (#5549)Pierrick Hymbert
* server: document --n-predict * server: ensure client request cannot override n_predict if set * server: fix print usage LF in new --n-predict option
2024-02-18server : graceful server shutdown (#5244)Daniel Hiltgen
This updates the server queue to support graceful shutdown of the server on signals.
2024-02-18ggml, common, examples, tests : fixed type arguments in printf (#5528)Herman Semenov
2024-02-18llava : update surgery script to not remove tensors (#5536)Daniel Bevenius
This commit updates the surgery script to not remove the tensors from the model file. For this to work the `--skip-unknown` flag is added as an argument to the convert.py script in README.md. The motivation for this change is that the surgery script currently removes the projector tensors from the model file. If the model was checked out from a repository, the model file will have been updated and have to be checked out again to reset this effect. If this can be avoided I think it would be preferable. I did not perform this change for BakLLaVA models as I am not sure how that part works.
2024-02-181.5 bit quantization (#5453)Kawrakow
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-17ci : add an option to fail on compile warning (#3952)Ananta Bastola
* feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-16llava : removed excess free(NULL) operation (#5531)Herman Semenov
2024-02-16server : add "samplers" param to control the samplers order (#5494)Alexey Parfenov
2024-02-16server : fix system prompt cli (#5516)Rőczey Barnabás
2024-02-16ggml : add numa options (#5377)bmwl
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-16llava : fix clip-model-is-vision flag in README.md (#5509)Daniel Bevenius
* llava: fix clip-model-is-vision flag in README.md This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: update link to vit config in README.md Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-15clip : fix wrong loop conditionGeorgi Gerganov
2024-02-15llava : fix memory management bug (#5491)Elbios
* Fix memory management in llava and server code Fixes this error: llama_new_context_with_model: graph splits (measure): 3 Available slots: -> Slot 0 - max context: 6000 {"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 - loaded image slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0 - encoding image [id: 1] munmap_chunk(): invalid pointer Aborted * Make it cleaner by checking size in batch free wrapper
2024-02-15llaba : hotfix for llava-1.6 image number (#5495)John
Co-authored-by: John <cmt-nct@users.noreply.github.com>
2024-02-14llava : update README.md (#5489)John
* Update README.md * Update README.md * Update examples/llava/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14llava : support v1.6 (#5267)John
* Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-13gguf : add python reader example (#5216)John
* Update CMakeLists.txt * Create reader.py * Update reader.py * Update reader.py another whitespace :| * Update reader.py * lintlintlint
2024-02-13finetune : rename feed-forward tensors (w1/w2/w3) (#4839)Daniel Bevenius
* finetune: rename feed-forward tensors (w1/w2/w3) This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * train-text-from-scratch: rename ff tensors This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-13llama : support batched embeddings (#5466)Douglas Hanley
* batched embedding: pool outputs by sequence id. updated embedding example * bring back non-causal attention * embd : minor improvements * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-12llava : remove prog parameter from ArgumentParser (#5457)Daniel Bevenius
* llava: remove prog parameter from ArgumentParser This commit removes the `prog` parameter from `ArgumentParser` so that it uses the default value which is the name of the script. The motivation for this change is that currently the usage output looks like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert_hf_to_gguf.py [-h] ... ``` And with this change it will look like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert-image-encoder-to-gguf.py [-h] ... ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * ci: add W503 to flake8 ignore list This commit adds W503 to the ignore list for flake8. This is done to avoid the following error: W503 line break before binary operator Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-12sync : ggml (#5452)Georgi Gerganov
* ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci * sync : ggml * update llama.cpp, clip.cpp, export-lora.cpp * update finetune.cpp, train-text-from-scratch.cpp ggml-ci * ggml-backend : reduce alignment to 32 to match gguf and fix mmap --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-11Add support for BERT embedding models (#5423)Douglas Hanley
* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11server : allow to specify tokens as strings in logit_bias (#5003)Alexey Parfenov
* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11main : ctrl+C print timing in non-interactive mode (#3873)Georgi Gerganov
2024-02-11lookup: add print for drafting performance (#5450)Johannes Gäßler
2024-02-11server : add llama2 chat template (#5425)Xuan Son Nguyen
* server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-09llava : add requirements.txt and update README.md (#5428)Daniel Bevenius
* llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-09server : fix prompt caching for repeated prompts (#5420)Riley Stewart
2024-02-08llava : add missing .py, and fix paths in README.md (#5414)Daniel Bevenius
This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08llava: fix typo/formatting in README.md (#5405)Daniel Bevenius
This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-07llava-cli : always tokenize special tokens (#5382)Xiao-Yong Jin
* llava-cli: tokenize special tokens in prompt * llava-cli: use the escape CLI argument, remove incomplete separate escaping process
2024-02-07server : update `/props` with "total_slots" value (#5373)Justin Parker
* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
2024-02-06server : remove model.json endpoint (#5371)Alexey Parfenov
2024-02-06server : include total "num_slots" in props endpoint (#5349)Justin Parker
2024-02-06server : add `dynatemp_range` and `dynatemp_exponent` (#5352)Michael Coppola
* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>