summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-09llama : do not cap thread count when MoE on CPU (#5419)Paul Tsochantaris
* Not capping thread count when MoE inference is running on CPU * Whitespace
2024-02-09readme : add JavaScript/Wasm repo (#5415)Marko Tasic
2024-02-09ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)Michael Podvitskiy
2024-02-09Fix Vulkan crash on APUs with very little device memory (#5424)0cc4m
* Fix Vulkan crash on APUs with very little device memory * Fix debug output function names
2024-02-08CUDA: more warps for mmvq on NVIDIA (#5394)Johannes Gäßler
2024-02-08llama : do not print "offloading layers" message in CPU-only builds (#5416)slaren
2024-02-08Fix f16_sycl cpy call from Arc (#5411)Abhilash Majumder
* fix f16_sycl cpy call * rm old logic * add fp16 build CI * use macro * format fix
2024-02-08llava : add missing .py, and fix paths in README.md (#5414)Daniel Bevenius
This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08fix trailing whitespace (#5407)Johannes Gäßler
2024-02-08llama : fix MiniCPM (#5392)runfuture
* fix bug for norm_rms_eps missing * to align with the same order as convert.py for model write * fix: undo HF models permute tensor * update for flake8 lint
2024-02-08llava: fix typo/formatting in README.md (#5405)Daniel Bevenius
This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-08sampling: fix top_k <= 0 (#5388)Johannes Gäßler
* sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-08tests : .gitignore obj filesGeorgi Gerganov
2024-02-07CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)Michael Podvitskiy
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-07fix typo in readme (#5399)Ebey Abraham
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
2024-02-07Add Ava in the list of llama.cpp UIs (#4362)Kamil Tomšík
2024-02-07CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)Johannes Gäßler
2024-02-07[SYCL] update install make by w64devkit (#5297)Neo Zhang Jianyu
2024-02-07llava-cli : always tokenize special tokens (#5382)Xiao-Yong Jin
* llava-cli: tokenize special tokens in prompt * llava-cli: use the escape CLI argument, remove incomplete separate escaping process
2024-02-07Basic Vulkan Multi-GPU implementation (#5321)0cc4m
* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-02-07readme : modernize (#5379)Eve
* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
2024-02-07readme : update ui list (#5354)Ben Williams
2024-02-07llama : add MiniCPM support (#5346)runfuture
* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm
2024-02-07server : update `/props` with "total_slots" value (#5373)Justin Parker
* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
2024-02-06convert : fix TypeError on GPT-2 vocab.json (#5288)Sang-Kil Park
2024-02-06server : remove model.json endpoint (#5371)Alexey Parfenov
2024-02-06CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)Johannes Gäßler
2024-02-06Update README.md (#5366)Kawrakow
Add some links to quantization related PRs
2024-02-06Slight quantization improvement for Q4_K and Q5_K (#5361)Kawrakow
* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)BarfingLemurs
2024-02-06CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)Johannes Gäßler
2024-02-06server : include total "num_slots" in props endpoint (#5349)Justin Parker
2024-02-06server : add `dynatemp_range` and `dynatemp_exponent` (#5352)Michael Coppola
* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-02-06server : various fixes for the prompt field in /completion (#5300)Niall Coates
server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert
2024-02-06py : handle byte tokens in `get_token_type` (#5341)Georgi Gerganov
* py : handle byte tokens in `get_token_type` * py : fix empty bytes arg
2024-02-05make: Use ccache for faster compilation (#5318)Johannes Gäßler
* make: Use ccache for faster compilation
2024-02-05README: updated introduction (#5343)Johannes Gäßler
* README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05ggml : make use of ggml-quants.h possible in C++ code (#5338)Kawrakow
* Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05ggml : avoid duplicating function calls using MIN/MAX macros (#5325)Dr. Tom Murphy VII Ph.D
* Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05iq3_xxs: quards for the no-imatrix situation (#5334)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05py : fix internlm2-hf convert to gguf (#5305)Guoteng
* py : fix internlm2-hf convert to gguf * ggml-ci
2024-02-05iq2_xxs: tune quantization (#5320)Kawrakow
We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05server : allow to get default generation settings for completion (#5307)Alexey Parfenov
2024-02-05common : add dynamic temperature parameters to main example cli (#5295)l3utterfly
* added dynamic temp params in main * added help text
2024-02-05scripts : fix typos, cleanup (#5303)Georgi Gerganov
2024-02-05scripts : add non-interactive server-llm.sh (#5303)Нияз Гарифзянов
* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05readme : add CodeShell models to the supported models list (#5330)chiranko
2024-02-05[SYCL] Fix cpy with dims of 3 (#5289)AidanBeltonS
* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-04flake.lock: Updategithub-actions[bot]
Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04Adding some imatrix tools (#5302)Kawrakow
* imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>