summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-14ggml : designate enum vals for integer types (#6050)Georgi Gerganov
2024-03-14embedding : print all resulting embeddings (#899)Georgi Gerganov
2024-03-14metal : build metallib + fix embed path (#6015)Georgi Gerganov
* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library
2024-03-14embedding : print cosine similarity (#899)Georgi Gerganov
2024-03-13readme : update details about running llama in Termux on Android (#6039)Linwei Wang
2024-03-13readme : update API changes and hot topicsGeorgi Gerganov
2024-03-13grammar : handle missing "root" node (#6004)Clint Herron
2024-03-13llama : add pipeline parallelism support (#6017)slaren
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13test-backend-ops : skip CPU backend by default (#6028)slaren
2024-03-13Update get version (#6025)AidanBeltonS
2024-03-13Server: Use multi-task for embeddings endpoint (#6001)Xuan Son Nguyen
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
2024-03-12ci : remove tidy-review (#6021)slaren
2024-03-12ggml : reuse quantum structs across backends (#5943)Georgi Gerganov
* ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci
2024-03-12ggml : fix UB in IQ2_S and IQ3_S (#6012)Georgi Gerganov
2024-03-12sycl : update IQ1_S kernels (WIP - not working!) (#5995)Georgi Gerganov
* sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type
2024-03-11grammar : fix unnecessarily retained pointer to rules (#6003)gliptic
2024-03-111.5 bit: we can do even better (#5999)Kawrakow
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11llama : more consistent names of count variables (#5994)Georgi Gerganov
* llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name
2024-03-11llama : refactor unicode stuff (#5992)Georgi Gerganov
* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref
2024-03-11Update server docker image URLs (#5997)Jakub N
2024-03-11Server: format error to json (#5961)Xuan Son Nguyen
* server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme
2024-03-11ggml, ci : Windows ARM runner and build fixes (#5979)Michael Podvitskiy
* windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`
2024-03-11server : maintain chat completion id for streaming responses (#5988)Minsoo Cheong
* server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-11cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985)Gilad S
2024-03-11llama : fix F16/F32 downcast + improve names (#5980)Georgi Gerganov
2024-03-11Better 1.5 bit quantization (#5971)Kawrakow
* Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment * iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11[SYCL] Add q3_s and q1_s (#5886)Abhilash Majumder
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
2024-03-11[SYCL] Add support for SYCL Nvidia target (#5738)AidanBeltonS
* Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors
2024-03-10metal : move mm_id indices to shared mem (#5982)Georgi Gerganov
2024-03-10android : fix utf8 decoding error (#5935)Dean
* examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10readme : update hot topicsGeorgi Gerganov
2024-03-10sync : ggmlGeorgi Gerganov
2024-03-10ggml : try fix 32-bit arm compat (whisper/1938)Georgi Gerganov
* ggml : try fix 32-bit arm compat * ggml : fix cont
2024-03-10ggml : remove __constant__ specifier for CUDA tables (#5940)Georgi Gerganov
2024-03-10server: ci: windows build and tests (#5968)Pierrick Hymbert
* server: ci: windows build and tests * server: ci: remove tmp push branch * server: ci: EOF EOL * Use builti Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * server: tests: server graceful shutdown, then kill, then hard kill * server: tests: remove python2 unicode string * server: tests: remove wrong comment on server starting, close_fds is always true * server: tests: server kill, if pid exists * server: tests: remove dependency to killall * server: tests: ci windows: pid exists better handling --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-03-10llama : add support for GritLM (#5959)DAN™
* add gritlm example * gritlm results match * tabs to spaces * comment out debug printing * rebase to new embed * gritlm embeddings are back babeee * add to gitignore * allow to toggle embedding mode * Clean-up GritLM sample code. * Fix types. * Flush stdout and output ending newline if streaming. * mostly style fixes; correct KQ_mask comment * add causal_attn flag to llama_cparams * gritml : minor * llama : minor --------- Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10grammar : verify parsed state (#5950)Clint Herron
2024-03-10nix: update flake.lock (#5969)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) → 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-09server: benchmark: chat/completions scenario and other llm servers ↵Pierrick Hymbert
comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-09server : print chat template infoGeorgi Gerganov
2024-03-09perplexity : support using multiple sequences to allow larger batch sizes ↵slaren
(#5946) * perplexity : support using multiple sequences to allow larger batch sizes ggml-ci * set cparams.n_parallel to the number of sequences * print tested n_ctx, add assert
2024-03-09readme : update hot topicsGeorgi Gerganov
2024-03-09ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951)Georgi Gerganov
2024-03-09server : fix metrics init (#5964)Georgi Gerganov
2024-03-09ggml : remove old quantization functions (#5942)Georgi Gerganov
* ggml : remove old quantization functions ggml-ci * ggml : simplify ggml_quantize_chunk ggml-ci * ggml : restrict correctness ggml-ci * ggml : remove hist data from the quantization API ggml-ci * tests : remove hist usage in test-backend-ops ggml-ci * vulkan : remove hist and fix typo
2024-03-09server : clarify some items in the readme (#5957)Georgi Gerganov
* server : clarify some items in the readme * server : fix typo
2024-03-09server : normalize embeddings (#5956)SeungWon Jeong
* output normalize embedding in '/v1/embeddings' * common : reuse llama_embd_normalize * common : better normalize impl --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-09tests : gitignore ggml-common.hGeorgi Gerganov
2024-03-09server : fix passing prompt as tokens (#5955)Alexey Parfenov
* server: fix passing prompt as tokens * Update examples/server/server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-09ggml : add ggml-common.h to deduplicate shared code (#5940)Georgi Gerganov
* ggml : add ggml-common.h to shared code ggml-ci * scripts : update sync scripts * sycl : reuse quantum tables ggml-ci * ggml : minor * ggml : minor * sycl : try to fix build