summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-15llava : change API to pure C style for Rust FFI bindgen (#6079)Ting Lou
Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>
2024-03-15cuda : disable unused cudaLaunchHostFunc code (#6078)slaren
2024-03-15fix set main gpu error (#6073)Neo Zhang Jianyu
2024-03-15make : ggml-metal.o depends on ggml.hGeorgi Gerganov
2024-03-15[SYCL] Fix non-intel device selection (#6042)AidanBeltonS
* Fix non-intel device selection * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-03-15gguf : add support for I64 and F64 arrays (#6062)Ondřej Čertík
* gguf : add support for I64 and F64 arrays GGML currently does not support I64 or F64 arrays and they are not often used in machine learning, however if in the future the need arises, it would be nice to add them now, so that the types are next to the other types I8, I16, I32 in the enums, and it also reserves their type number. Furthermore, with this addition the GGUF format becomes very usable for most computational applications of NumPy (being compatible with the most common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster, and more versatile alternative to the `npz` format, and a simpler alternative to the `hdf5` format. The change in this PR seems small, not significantly increasing the maintenance burden. I tested this from Python using GGUFWriter/Reader and `gguf-dump`, as well as from C, everything seems to work. * Fix compiler warnings
2024-03-15llama : add Orion chat template (#6066)Xuan Son Nguyen
2024-03-15llama-bench : use random tokens to improve accuracy with mixtral (#6069)slaren
2024-03-14llama : fix integer overflow during quantization (#6063)Georgi Gerganov
2024-03-14gguf : fix resource leaks (#6061)Steve Grubb
There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.
2024-03-14gguf-py : bump version to 0.8.0 (#6060)Ondřej Čertík
2024-03-14llama : support models without vocabulary (#5798)Michael Podvitskiy
* additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert
2024-03-14embedding : add EOS token if not present (#899)Georgi Gerganov
2024-03-14gguf-py : fix dtype check (#6045)Georgi Gerganov
2024-03-14readme : improve readme for Llava-1.6 example (#6044)Jian Liao
Co-authored-by: Jian Liao <jianliao@adobe.com>
2024-03-14server: disable debug release type sanitizer, simplify trigger (#6047)Pierrick Hymbert
- increase time out for server - do not fail fast
2024-03-14llama : fix typoGeorgi Gerganov
2024-03-14llama : optimize defrag moves + fix fragmentation calculation (#6037)Michael Podvitskiy
* attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-14gguf-py : add support for I8, I16 and I32 (#6045)Ondřej Čertík
* Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader
2024-03-14ggml : designate enum vals for integer types (#6050)Georgi Gerganov
2024-03-14embedding : print all resulting embeddings (#899)Georgi Gerganov
2024-03-14metal : build metallib + fix embed path (#6015)Georgi Gerganov
* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library
2024-03-14embedding : print cosine similarity (#899)Georgi Gerganov
2024-03-13readme : update details about running llama in Termux on Android (#6039)Linwei Wang
2024-03-13readme : update API changes and hot topicsGeorgi Gerganov
2024-03-13grammar : handle missing "root" node (#6004)Clint Herron
2024-03-13llama : add pipeline parallelism support (#6017)slaren
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13test-backend-ops : skip CPU backend by default (#6028)slaren
2024-03-13Update get version (#6025)AidanBeltonS
2024-03-13Server: Use multi-task for embeddings endpoint (#6001)Xuan Son Nguyen
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
2024-03-12ci : remove tidy-review (#6021)slaren
2024-03-12ggml : reuse quantum structs across backends (#5943)Georgi Gerganov
* ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci
2024-03-12ggml : fix UB in IQ2_S and IQ3_S (#6012)Georgi Gerganov
2024-03-12sycl : update IQ1_S kernels (WIP - not working!) (#5995)Georgi Gerganov
* sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type
2024-03-11grammar : fix unnecessarily retained pointer to rules (#6003)gliptic
2024-03-111.5 bit: we can do even better (#5999)Kawrakow
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11llama : more consistent names of count variables (#5994)Georgi Gerganov
* llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name
2024-03-11llama : refactor unicode stuff (#5992)Georgi Gerganov
* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref
2024-03-11Update server docker image URLs (#5997)Jakub N
2024-03-11Server: format error to json (#5961)Xuan Son Nguyen
* server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme
2024-03-11ggml, ci : Windows ARM runner and build fixes (#5979)Michael Podvitskiy
* windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`
2024-03-11server : maintain chat completion id for streaming responses (#5988)Minsoo Cheong
* server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-11cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985)Gilad S
2024-03-11llama : fix F16/F32 downcast + improve names (#5980)Georgi Gerganov
2024-03-11Better 1.5 bit quantization (#5971)Kawrakow
* Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment * iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11[SYCL] Add q3_s and q1_s (#5886)Abhilash Majumder
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
2024-03-11[SYCL] Add support for SYCL Nvidia target (#5738)AidanBeltonS
* Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors
2024-03-10metal : move mm_id indices to shared mem (#5982)Georgi Gerganov
2024-03-10android : fix utf8 decoding error (#5935)Dean
* examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10readme : update hot topicsGeorgi Gerganov