summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-04-06ci: bench: support sse and fix prompt processing time / server: add tokens ↵Pierrick Hymbert
usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate
2024-04-05gguf.py : add licence and version to gguf writer (#6504)Brian
2024-04-05readme : update UI list (#6503)Hoang Nguyen
* Add MindMac to UI list * Update proprietary description Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-04-05bench : make n_batch and n_ubatch configurable in Batched bench (#6500)Ting Sun
* bench: make n_batch and n_ubatch configurable * bench: update doc for batched bench
2024-04-05[SYCL] Fixed minor bug when enabling FP16 for non intel targets (#6464)Ouadie EL FAROUKI
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> --------- Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-04readme : add Dot to UI list (#6487)alexpinel
2024-04-04readme : fix typo (#6481)Jun Jie
2024-04-04server: add cURL support to server Dockerfiles (#6474)Ed Lepedus
* server: add cURL support to `full.Dockerfile` * server: add cURL support to `full-cuda.Dockerfile` and `server-cuda.Dockerfile` * server: add cURL support to `full-rocm.Dockerfile` and `server-rocm.Dockerfile` * server: add cURL support to `server-intel.Dockerfile` * server: add cURL support to `server-vulkan.Dockerfile` * fix typo in `server-vulkan.Dockerfile` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-04ci: exempt master branch workflows from getting cancelled (#6486)Minsoo Cheong
* ci: exempt master branch workflows from getting cancelled * apply to bench.yml
2024-04-04build CI: Name artifacts (#6482)Ewout ter Hoeven
Name the artifacts in the build CI, so that they get uploaded with separate names, instead of all put into the same `artifact` ZIP. It might be possible to further simplify the packing step (in future PRs).
2024-04-04server: allow penalizing repetition of newlines on server webpage (#6431)Shakhar Dasgupta
2024-04-04ci: bench fix concurrency for workflow trigger dispatch with sha1 (#6478)Pierrick Hymbert
2024-04-04Correct README link (#6458)limitedAtonement
README is called README.md.
2024-04-04ci: bench: add more ftype, fix triggers and bot comment (#6466)Pierrick Hymbert
* ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces
2024-04-04common: remove duplicate check for curl (#6471)Daniel Bevenius
This commit removes one of the two identical checks for curl being NULL in llama_load_model_from_url. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-04examples : add GBNF validator program (#5948)Clint Herron
* Revising GBNF validator program to be much simpler. * Changing from streams to using cstdio * Adding final newline character.
2024-04-04server : remove obsolete --memory-f32 optionGeorgi Gerganov
2024-04-04server : add option to disable KV offload (#6468)Xiao-Yong Jin
2024-04-04convert : fix for lint error complaining of bare except (#6470)Clint Herron
2024-04-03A few small fixes to server's README docs (#6428)Fattire
* Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.
2024-04-03server : handle exception on wrong type in request (#6452)JH23X
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>
2024-04-03llama : add SEA-LION support (#6448)bryanSwk
* initial commit for sealion support * add sealion support * minor fix * q/k ln and pos_embd only if required * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : clear whitespaces --------- Co-authored-by: bryan <bryansiow@aisingapore.org> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03ci : update checkout, setup-python and upload-artifact to latest (#6456)Ewout ter Hoeven
* CI: Update actions/checkout to v4 * CI: Update actions/setup-python to v5 * CI: Update actions/upload-artifact to v4
2024-04-03server: add cURL support to `server.Dockerfile` (#6461)Ed Lepedus
2024-04-03readme : add feature-rich rust bindings (#6465)Francisco Melo
2024-04-03security : create policy (#6354)Joyce
* Create SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * Fix: link on SECURITY.md Signed-off-by: Joyce <joycebrum@google.com> * minor * fix * fix --------- Signed-off-by: Joyce <joycebrum@google.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03Missing tokenizer.model error during gguf conversion (#6443)Abhishek Gopinath K
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-04-03Add OpenChat, Alpaca, Vicuna chat templates (#6397)kaizau
* Add openchat chat template * Add chat template test for openchat * Add chat template for vicuna * Add chat template for orca-vicuna * Add EOS for vicuna templates * Combine vicuna chat templates * Add tests for openchat and vicuna chat templates * Add chat template for alpaca * Add separate template name for vicuna-orca * Remove alpaca, match deepseek with jinja output * Regenerate chat template test with add_generation_prompt * Separate deepseek bos from system message * Match openchat template with jinja output * Remove BOS token from templates, unprefix openchat
2024-04-03readme : update hot topicsGeorgi Gerganov
2024-04-03ggml : mul_mat_id use the same tensor for all the experts (#6387)slaren
* ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03[SYCL] Disable iqx on windows as WA (#6435)Meng, Hengyu
* disable iqx on windows as WA * array instead of global_memory
2024-04-01flake.lock: Update (#6402)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) → 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089' (2024-03-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-04-01compare-llama-bench.py: fix long hexsha args (#6424)Johannes Gäßler
2024-04-01ci: server: verify deps are coherent with the commit (#6409)Pierrick Hymbert
* ci: server: verify deps are coherent with the commit * ci: server: change the ref to build as now it's a pull event target
2024-03-31readme : update hot topicsGeorgi Gerganov
2024-03-30ci: bench: fix Resource not accessible by integration on PR event (#6393)Pierrick Hymbert
2024-03-29Fedora build update (#6388)Mohammadreza Hendiani
* fixed deprecated address * fixed deprecated address * fixed deprecated address * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * reverted back to only the MIT license
2024-03-29split: allow --split-max-size option (#6343)Xuan Son Nguyen
* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size
2024-03-29Vulkan k-quant mmq and ggml-backend offload functionality (#6155)0cc4m
* Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning
2024-03-29sync : ggml (#6351)Georgi Gerganov
* sync : ggml ggml-ci * cuda : move GGML_CUDA_DMMV constants to dmmv.cuh --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-29[Model] Add support for xverse (#6301)hxer7963
* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>
2024-03-29ci : fix BGE wget (#6383)Georgi Gerganov
ggml-ci
2024-03-29readme : add project (#6356)zhouwg
* readme: add Android UI binding * Update README.md
2024-03-29cmake : add explicit metal version options (#6370)Matt Clayton
* cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-29llama : remove redundant reshape in build_kv_store (#6369)Daniel Bevenius
* llama: remove redundant reshape in build_kv_store This commit removes the reshape of the V matrix in the build_kv_store. The motivation for this is that V matrix has the shape: ```console (gdb) p *v_cur $46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` And after reshaping this tensor we get: ```console gdb) p *ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens) $44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_RESHAPE, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0, name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` I noticed that the `src` and `view_src` fields are different but that the dimensions are the same. From the code comment it seems like the reshape call is not needed and perhaps the above can motivate the removal of the reshape call. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add assert --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-29convert : allow conversion of Mistral HF models (#6144)Pedro Cuenca
* Allow conversion of Mistral HF models * Homogenize Llama, Mistral, Mixtral under the same entry. * Fix tokenizer, permute tensors * Use sentencepiece tokenizer, or fall back to hfft. * convert-hf : small fix for mypy * convert-hf : fix duplicated block_count * convert-hf : add vocab size to metadata --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-03-28readme : add notice for UI listGeorgi Gerganov
2024-03-28[SYCL] Revisited & updated SYCL build documentation (#6141)Ouadie EL FAROUKI
* Revisited & updated SYCL build documentation * removed outdated comment * Addressed PR comments * Trimed white spaces * added new end line
2024-03-28convert : refactor vocab selection logic (#6355)Jared Van Bortel
2024-03-28llava : fix MobileVLM (#6364)Ziang Wu
* fix empty bug * Update MobileVLM-README.md added more results on devices * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update MobileVLM-README.md remove gguf links --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>