summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-21HellaSwag: split token evaluation into batches if needed (#2681)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-20ggml : move all type info to ggml_type_traits (#2663)slaren
2023-08-20More efficient Hellaswag implementation (#2677)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-19server : better default prompt (#2646)Georgi Gerganov
2023-08-19server : update xxd usage for older versions compatibility (#2649)Jhen-Jie Hong
* server : update xxd usage for older versions compatibility * remove unused $func
2023-08-18Add link to clojure bindings to Readme. (#2659)Adrian
2023-08-18readme : incoming BREAKING CHANGEGeorgi Gerganov
2023-08-18llama : add benchmark example (#2626)slaren
* llama : add benchmark example * add to examples CMakeLists.txt * fix msvc build * add missing include * add Bessel's correction to stdev calculation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * improve markdown formatting * add missing include * print warning is NDEBUG is not defined * remove n_prompt and n_gen from the matrix, use each value separately instead * better checks for non-optimized builds * llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call * fix json formatting * add sql output * add basic cpu and gpu info (linx/cuda only) * markdown: also show values that differ from the default * markdown: add build id * cleanup * improve formatting * formatting --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18readme : add link to Rust bindings (#2656)mdrokz
2023-08-18perplexity : more meaningful ETA number - 2 decimal pointsGeorgi Gerganov
2023-08-17Fix unicode in grammars (fixes #2501) (#2553)Evan Jones
* Fix unicode in grammars (fixes #2501) * add more comments * fix test-llama-grammar
2023-08-18server : support for saving templates in browser LocalStorage (#2486)staviq
* support for templates in browser LocalStorage * sync accepted #2409 fix from upstream * convert autosave invocation to useEffect * Apply suggestions from code review Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> * Regen index.html.cpp, suggested from code review --------- Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-08-17README: fix LLAMA_CUDA_MMV_Y documentation (#2647)Johannes Gäßler
2023-08-17[Zig] Fixing Zig build and improvements (#2554)Henri Vasserman
* Fix zig after console.o was split * Better include and flag management * Change LTO to option
2023-08-17Add --cfg-negative-prompt-file option for examples (#2591)Kerfuffle
Add --cfg-negative-prompt-file option for examples
2023-08-17llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)Georgi Gerganov
ggml-ci
2023-08-17tests : adds simple llama grammar tests (#2618)drbh
* adds simple llama grammar tests * fix lint and add Makefile * 0 terminate code_points * avoid dangling pointers in candidate cleanup * cleanup grammar at end of test
2023-08-17ggml-alloc : fix discrepency between measure&eval (#2639)Shouzheng Liu
The GGML memory allocator consistently places a tensor within the optimal-fit memory block, which is the smallest block capable of accommodating the tensor's size. During the measurement phase, the final block is generously sized, ensuring it never qualifies as the optimal-fit block as long as there exists another block capable of accommodating the tensor. Nevertheless, in the evaluation phase, the last block is constrained in size and could potentially qualify as the optimal-fit block. Consequently, there exists the possibility of a tensor being allocated to a different region during evaluation, leading to more memory fragmentation in our scratch buffer. This recent commit guarantees uniform behavior of the allocator across both the measurement and evaluation phases, eliminating discrepancies between the two.
2023-08-16cmake : install ggml-meta.metal if LLAMA_METAL (#2449)Kolen Cheung
2023-08-16metal : print error of load pipeline state (#2564)Jhen-Jie Hong
* metal : print error of load pipeline state * metal : return null if load pipeline failed
2023-08-16metal : enable ggml-alloc (#2627)Shouzheng Liu
* metal: enable ggml-alloc Make ggml-alloc work with concurrently dispatch. * style-fix Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-16metal : matrix-matrix multiplication kernel (#2615)Shouzheng Liu
* metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.
2023-08-15scripts : add helper script to get wikitextGeorgi Gerganov
2023-08-15server : add missing /json-schema-to-grammar.mjs (#2616)Jhen-Jie Hong
fixes #2611
2023-08-14metal : return null instead of exit(1) (#2573)Jhen-Jie Hong
2023-08-14server : add --numa support (#2524)Cheng Shao
2023-08-14llama : add missing enum keyword in function signatures (#2610)Kamil Tomšík
2023-08-14CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596)Johannes Gäßler
2023-08-14server : fix default grammar by use empty string in the UI (#2604)Jhen-Jie Hong
2023-08-14server : implement json-schema-to-grammar.mjs & add grammar param in the UI ↵Jhen-Jie Hong
(#2588) * server : implement json-schema-to-grammar.mjs by follow python impl * server : add grammar support in chat.mjs * server : implement grammer param in the UI * server : generate .hpp * server : remove trailing whitespaces * server : generate .hpp * server : fix sort of prop pairs * server : optimize regex & iteration
2023-08-13Enhance Windows 7 and below compatibility. (#2592)vxiiduu
* Enhance Windows 7 compatibility. * Clean away unnecessary preprocessor conditional
2023-08-13test : add simple grammar parsing tests (#2594)drbh
* adds simple grammar parsing tests * adds cassert header
2023-08-13CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590)Johannes Gäßler
2023-08-12Adding support for llama2.c models (#2559)byte-6174
2023-08-12server: fixed wrong variable name in timing json (#2579)Equim
* server: fixed wrong variable name in timing json * remove redunct entry
2023-08-10Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier ↵DannyDaemonic
versions of Windows.
2023-08-10Add --n-predict -2 for stopping generation on full context (#2565)Christian Demsar
2023-08-10Fix grammar-based sampling issue in server (#2566)Martin Krasser
2023-08-09ggml-alloc: Don't try to re-use buffers of external tensors (#2562)Sam Spilsbury
* ggml-alloc: Don't try to re-use buffers of external tensors They might be weights that came from another context, so we have no control over them (and they might be re-used elsewhere so writing to them would be a bad idea). * ggml-alloc: >= when checking for out-of-bounds Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-08-09add log_callback to llama_context_params for custom logging. (#2234)grahameth
* add log_callback to llama_context_params for custom logging. * Fix macro expansion on gcc * Add struct llama_state for global variables and move log_callback there * Turn log level into enum and some minor changes. * Remove model_for_logging parameter (not needed anymore) * Convert remaining fprintf(stderr, ...) calls to use new macros. * Fix enum and initialize g_state * Fix log calls after merge * Fix missing static * Add back all the new lines in the logging strings * Add comment for llama_log_callback and replace remaining printf calls --------- Co-authored-by: grahameth <-> Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>
2023-08-09CUDA: tuned mul_mat_q kernels (#2546)Johannes Gäßler
2023-08-08Allow passing grammar to completion endpoint (#2532)Martin Krasser
* Allow passing grammar to completion endpoint
2023-08-08CUDA: tighter VRAM scratch size for 65b/70b (#2551)Johannes Gäßler
2023-08-08llm.vim : multiline autocompletion, get rid of "^@" (#2543)chaihahaha
2023-08-08vim : bring back simple llm.vim exampleGeorgi Gerganov
2023-08-08vim : streaming and more (#2495)AustinMroz
* Update Vim plugin * Remove getbufoneline usage, Add input bind example. getbufoneline() appears to be a recently added function and has been replaced with getbufline for compatibility. An additional example that explains how to add a keybind that works in insert mode was added.
2023-08-07Add --rope-scale parameter (#2544)klosax
* common.cpp : Add --rope-scale parameter * README.md : Add info about using linear rope scaling
2023-08-07ggml : mul mat tweaks (#2372)Georgi Gerganov
* ggml : mul mat wip ggml-ci * ggml : alternative thread distribution for mul_mat ggml-ci * ggml : mul_mat block tiling attempt * ggml : mul_mat threads yield ggml-ci
2023-08-07ggml : pad result of ggml_nbytes()Georgi Gerganov
2023-08-07ggml : change params pointer (style change) (#2539)Georgi Gerganov
ggml-ci