summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-16ggml : add AVX512F SIMD (#6088)AmirAli Mirian
2024-03-16gritlm : add initial README.md (#6086)Daniel Bevenius
* gritlm: add initial README.md to examples/gritlm This commit adds a suggestion for an initial README.md for the gritlm example. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Use the `scripts/hf.sh` script to download the model file. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Fix editorconfig-checker error in examples/gritlm/README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-03-16readme : add wllama as a wasm binding (#6100)Xuan Son Nguyen
2024-03-16common : refactor nested if causing error C1061 on MSVC (#6101)DAN™
* Refactor nested if causing error C1061 on MSVC. * Revert back and remove else's. * Add flag to track found arguments.
2024-03-16ci : close inactive issue with workflow (#6053)Pierrick Hymbert
* issues: ci - close inactive issue with workflow * ci: close issue, change workflow schedule time
2024-03-15llama : fix Baichuan2 13B (#6092)slaren
2024-03-15llama : add support for control vectors (#5970)Theia Vogel
* control vector api and implementation * control-vectors : minor code style updates * disable control vector when data == nullptr use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-15llama : add Command-R support (#6033)Andrew Canis
Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
2024-03-15llava : change API to pure C style for Rust FFI bindgen (#6079)Ting Lou
Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>
2024-03-15cuda : disable unused cudaLaunchHostFunc code (#6078)slaren
2024-03-15fix set main gpu error (#6073)Neo Zhang Jianyu
2024-03-15make : ggml-metal.o depends on ggml.hGeorgi Gerganov
2024-03-15[SYCL] Fix non-intel device selection (#6042)AidanBeltonS
* Fix non-intel device selection * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-03-15gguf : add support for I64 and F64 arrays (#6062)Ondřej Čertík
* gguf : add support for I64 and F64 arrays GGML currently does not support I64 or F64 arrays and they are not often used in machine learning, however if in the future the need arises, it would be nice to add them now, so that the types are next to the other types I8, I16, I32 in the enums, and it also reserves their type number. Furthermore, with this addition the GGUF format becomes very usable for most computational applications of NumPy (being compatible with the most common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster, and more versatile alternative to the `npz` format, and a simpler alternative to the `hdf5` format. The change in this PR seems small, not significantly increasing the maintenance burden. I tested this from Python using GGUFWriter/Reader and `gguf-dump`, as well as from C, everything seems to work. * Fix compiler warnings
2024-03-15llama : add Orion chat template (#6066)Xuan Son Nguyen
2024-03-15llama-bench : use random tokens to improve accuracy with mixtral (#6069)slaren
2024-03-14llama : fix integer overflow during quantization (#6063)Georgi Gerganov
2024-03-14gguf : fix resource leaks (#6061)Steve Grubb
There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.
2024-03-14gguf-py : bump version to 0.8.0 (#6060)Ondřej Čertík
2024-03-14llama : support models without vocabulary (#5798)Michael Podvitskiy
* additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert
2024-03-14embedding : add EOS token if not present (#899)Georgi Gerganov
2024-03-14gguf-py : fix dtype check (#6045)Georgi Gerganov
2024-03-14readme : improve readme for Llava-1.6 example (#6044)Jian Liao
Co-authored-by: Jian Liao <jianliao@adobe.com>
2024-03-14server: disable debug release type sanitizer, simplify trigger (#6047)Pierrick Hymbert
- increase time out for server - do not fail fast
2024-03-14llama : fix typoGeorgi Gerganov
2024-03-14llama : optimize defrag moves + fix fragmentation calculation (#6037)Michael Podvitskiy
* attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-14gguf-py : add support for I8, I16 and I32 (#6045)Ondřej Čertík
* Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader
2024-03-14ggml : designate enum vals for integer types (#6050)Georgi Gerganov
2024-03-14embedding : print all resulting embeddings (#899)Georgi Gerganov
2024-03-14metal : build metallib + fix embed path (#6015)Georgi Gerganov
* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library
2024-03-14embedding : print cosine similarity (#899)Georgi Gerganov
2024-03-13readme : update details about running llama in Termux on Android (#6039)Linwei Wang
2024-03-13readme : update API changes and hot topicsGeorgi Gerganov
2024-03-13grammar : handle missing "root" node (#6004)Clint Herron
2024-03-13llama : add pipeline parallelism support (#6017)slaren
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13test-backend-ops : skip CPU backend by default (#6028)slaren
2024-03-13Update get version (#6025)AidanBeltonS
2024-03-13Server: Use multi-task for embeddings endpoint (#6001)Xuan Son Nguyen
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
2024-03-12ci : remove tidy-review (#6021)slaren
2024-03-12ggml : reuse quantum structs across backends (#5943)Georgi Gerganov
* ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci
2024-03-12ggml : fix UB in IQ2_S and IQ3_S (#6012)Georgi Gerganov
2024-03-12sycl : update IQ1_S kernels (WIP - not working!) (#5995)Georgi Gerganov
* sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type
2024-03-11grammar : fix unnecessarily retained pointer to rules (#6003)gliptic
2024-03-111.5 bit: we can do even better (#5999)Kawrakow
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11llama : more consistent names of count variables (#5994)Georgi Gerganov
* llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name
2024-03-11llama : refactor unicode stuff (#5992)Georgi Gerganov
* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref
2024-03-11Update server docker image URLs (#5997)Jakub N
2024-03-11Server: format error to json (#5961)Xuan Son Nguyen
* server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme
2024-03-11ggml, ci : Windows ARM runner and build fixes (#5979)Michael Podvitskiy
* windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`
2024-03-11server : maintain chat completion id for streaming responses (#5988)Minsoo Cheong
* server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>