summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-28llama.h : add missing struct keyword for C compat in callback type (#2847)igarnier
2023-08-28metal : fix memory leak (#2762)Georgi Gerganov
* metal : fix memory leak * metal : fix encoders memory leak * metal : clean up more memory resources * metal : fix more leaks * metal : reuse dispatch queue + autoreleasepool * metal : reuse array for command buffers and encoders * ggml : assert for odd number of blocks on ARM 15M tinyllama is an example
2023-08-28quantize : make output filename optional again (#2823)Cebtenzzre
* quantize : make output filename optional again * quantize : fix path parsing on Windows suggested by @slaren
2023-08-28devops : added systemd units and set versioning to use date. (#2835)JohnnyB
* Corrections and systemd units * Missing dependency clblast
2023-08-27gguf : fix strings to not be null-terminated (#2839)Georgi Gerganov
* gguf : fix strings to not be null-terminated ggml-ci * gguf : fix gguf_add_tensor name
2023-08-27llama : fix MPI threads (close #2827)Georgi Gerganov
2023-08-27examples : update llama2.c converter to read vocab and write models in GGUF ↵Olivier Chafik
format (#2751) * llama2.c: direct gguf output (WIP) * Simplify vector building logic * llama2.c gguf conversion: fix token types in converter * llama2.c: support copying vocab from a llama gguf model file * llama2.c: update default path for vocab model + readme * llama2.c: use defines for gguf keys * llama2.c: escape whitespaces w/ U+2581 in vocab converter the llama.cpp way * llama2.c converter: cleanups + take n_ff from config
2023-08-27llama : speedup tokenization (#2831)Kawrakow
* Speedup tokenization On current master it takes ~3.2 seconds to tokenize Wikitext. With this change it becomes ~525 ms. * Fixit: it was missing the piece after the last found occurence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-27falcon : fix CUDA inference by making K and Q contiguous (#2830)Georgi Gerganov
* falcon : fix CUDA inference by making K and Q contiguous ggml-ci * cuda : add assert to guard from non-cont ropes
2023-08-27readme : fix headingsGeorgi Gerganov
2023-08-27scripts : helper convert scriptGeorgi Gerganov
2023-08-27k_quants tuning for Falcon-7b (#2816)Kawrakow
* Make ggml-cuda.cu build with QK_K = 64 Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result. * k_quants tuning for Falcon-7b --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-27readme : update hot topicsGeorgi Gerganov
2023-08-27gguf : add 64-bit support (GGUF v2) (#2821)Georgi Gerganov
* gguf : bump version to 2 * gguf : add support for 64-bit (no backwards comp yet) * gguf : v1 backwards comp * gguf.py : bump GGUF version * gguf.py : uint64_t on all lengths, sizes and counts, enums still uint32_t * gguf.py : string lengths uint32_t * gguf : update all counts to 64-bit * gguf.py : string len uint64_t and n_dims uint32_t * gguf : fix typo * llama.cpp : print gguf version --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27llama : more tokenizer fixes (#2810)Georgi Gerganov
* tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27ggml : detect SSSE3 (#2825)Przemysław Pawełczyk
* ggml : add ggml_cpu_has_ssse3 * llama : show SSSE3 in system info
2023-08-27ci : add LoRA test to CI (#2650)slaren
* ci : add lora test ggml-ci * move lora summary to the top, add lora logs ggml-ci * ci : decrease CPU ppl runs to 2 to avoide 20 min timeout ggml-ci * add 7b lora test use 1 thread for CUDA generation tests ggml-ci * add test with q8_0 (cpu only) ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-27server : add `/detokenize` endpoint (#2802)Bruce MacDonald
* Add a /detokenize endpoint to the example server * remove trailing white-space
2023-08-26convert.py : advanced option (#2753)Kerfuffle
* Allow convert.py to convert to q8_0 Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion * Add --concurrency option Minor improvements to help text Clean up bounded_parallel_map function a bit * Massive speed improvement thanks to Cebtenzzre * Refactor types
2023-08-26llama : use Unicode Escape Sequence to replace encoded characters (#2814)Tim Miller
The use of special characters within source files can break compiling on some computers with different region and language settings. Using Unicode escape sequences should allow for the code to be compiled on all setups without needing to change your computers settings or switch regions.
2023-08-26flake.nix : add rocm support and cleanup (#2808)Tungsten842
2023-08-26llama : move #includes out of _GNU_SOURCE conditional (#2817)Cebtenzzre
2023-08-26main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw ↵Dr. Tom Murphy VII Ph.D
(#1528) * Fix bug in main.cpp where penalize_nl=false has no effect. It modifies the underlying logits array, but at this point we are already working on the candidates copy. * Suppress redefinition warning for NOMINMAX on mingw. In my installation, this macro is already defined by /usr/lib/gcc/x86_64-w64-mingw32/11/include/c++/x86_64-w64-mingw32/bits/os_defines.h:45. * main : fix indentation * main : pass ctx to llama_token_nl() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-26llama : use std::abs in llama_sample_tail_free (#2800)Cebtenzzre
Plain 'abs' casts the input to int.
2023-08-26k-quants : remove unnecessary tensor shape restrictions (#2811)Georgi Gerganov
2023-08-26Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (#2807)Kawrakow
* Better perplexity for 2- and 3-bit quantization for the 70B model * PR comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-26Fix HellaSwag (#2805)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-26flake : build llama.cpp on Intel with nix (#2795)Volodymyr Vitvitskyi
Problem ------- `nix build` fails with missing `Accelerate.h`. Changes ------- - Fix build of the llama.cpp with nix for Intel: add the same SDK frameworks as for ARM - Add `quantize` app to the output of nix flake - Extend nix devShell with llama-python so we can use convertScript Testing ------- Testing the steps with nix: 1. `nix build` Get the model and then 2. `nix develop` and then `python convert.py models/llama-2-7b.ggmlv3.q4_0.bin` 3. `nix run llama.cpp#quantize -- open_llama_7b/ggml-model-f16.gguf ./models/ggml-model-q4_0.bin 2` 4. `nix run llama.cpp#llama -- -m models/ggml-model-q4_0.bin -p "What is nix?" -n 400 --temp 0.8 -e -t 8` Co-authored-by: Volodymyr Vitvitskyi <volodymyrvitvitskyi@SamsungPro.local>
2023-08-26Handle null rope scaling value (#2793)Nigel Bosch
2023-08-26Fix spm whitespaces (#2806)klosax
* llama.cpp : fix spm whitespace escaping + clean up * main.cpp : spm - add whitespace in front of prompt * test-tokenizer-0.cpp : spm - add whitespace in front of prompt
2023-08-26examples : skip unnecessary external lib in server README.md how-to (#2804)lon
2023-08-25llama : fix struct decl (#2790)Marcus Dunn
2023-08-25Faster perplexity computation (#2786)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-25llama : add llama_beam_search() (#2267)Matt Pulver
* Add llama_beam_search(). * Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token(). * Add space around * pointers and & references. * Add spaces around comparison and assignment operators. * Prefer west const. * Use llama_ prefix for structs in global namespace. * Delete obsolete comment from an earlier revision. * Change eos to eob in llama_beam and llama_beam_view structs.
2023-08-25convert.py : Get rope scale from HuggingFace models (#2772)Nigel Bosch
* Get rope scale from HF models * Save rope scale only for linear scaling * Rewrite for clarity
2023-08-25llama-bench : add model sizes (#2771)slaren
* llama-bench : add model sizes * more compact markdown output * back to GiB * adjust column sizes
2023-08-25convert.py : export rope freq_base when converting CodeLlama from an HF ↵slaren
model (#2773)
2023-08-25server : display token probabilities in the UI (#2489)Jhen-Jie Hong
* server : add n_probs param in chat UI * server : keep message data array & show in probabilites component * server : add simple popover component * server : fix completion_probabilities undefined if not set n_probs * server : implement Probabilites * server : handle bytes * server : make n_probs max to 10 for easy scroll * server : adjust for dark/light mode * server : Fix regenerated prompt * server : update index.html.hpp * server : convert prob to percentage + show original value as div title * server : fix Probabilites not used if included empty str * server : skip byte pair in display probabilites * server : remove array check of completion_probabilities in messages * skip empty array or byte pair (> 1) in Probabilites * generate index.html.hpp * fix incorrect prob convert if the str is already a known token * use final response to show probabilities on stop * revert unnecessary change * correct probabilites usage * remove unused function * always send partial response for get correct probs of last to_send * fix typo * fix content of format_final_response * refactor probs render & make pColor transparent if not found * send empty string when got stop_pos in partial * avoid unnecessary empty data event & send rest of partial tokens on stop * use <br /> for new line * skip -1 tok in loop to avoid send '' on end * trim last new lines on stop * revert unnecessary change
2023-08-25ci : pip install gguf in editable mode (#2782)Georgi Gerganov
ggml-ci
2023-08-25gguf : export objects to user code (#2780)M. Yusuf Sarıgöz
* gguf export more objects to user code * gguf export all objects to user code for now * gguf : bump version
2023-08-25ROCm Port (#1087)Henri Vasserman
* use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com> Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com> Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com> Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> Co-authored-by: jammm <2500920+jammm@users.noreply.github.com> Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25cuda : add RoPE kernel for mode == 2 (NeoX) (#2760)Georgi Gerganov
* cuda : add RoPE kernel for mode == 2 (NeoX) * falcon : do not offload the embeddings layer
2023-08-25gguf : make gguf pip-installableM. Yusuf Sarıgöz
* gitignore : add dist and rm pyproject.toml * gguf: prepare as Pip package * gguf: prepare as Pip package * gguf : fix line endings * requirements : add gguf * gguf : update readme with build notes * gguf : update readme with build notes * gguf : add notes for tests
2023-08-25ggml-alloc : enlarge size of parse_seq (#2776)Shouzheng Liu
Since we also store barriers in this array, we need to double its size.
2023-08-24Added `enum` to `llama_token_get_type` return type (#2774)Marcus Dunn
2023-08-24convert.py : try to determine n_ctx automatically for CodeLlama (#2770)slaren
2023-08-24gguf : add rope_freq_base parameter for CodeLlama (#2769)slaren
2023-08-24falcon : write file typeGeorgi Gerganov
2023-08-24metal : bug-fix when enable ggml-alloc (#2757)Shouzheng Liu
* metal: better memory alloc w/ concurrency dispatch The ggml-alloc should only free tensors at memory barriers. * ggml-alloc: avoid return silently In certain cases, the allocate_node() function may silently return without performing any memory allocation.
2023-08-24convert : auto-determine model name based on dir + scripts updateGeorgi Gerganov