summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-31scripts: update compare_llama_bench.py [no ci] (#7673)Johannes Gäßler
2024-05-31Improve HIP compatibility (#7672)Daniele
2024-05-31readme : link homebrew discussionGeorgi Gerganov
2024-05-31ggml : fix loongson compile warnings (#7537)Georgi Gerganov
* ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-05-31Somehow '**' got lost (#7663)Galunid
2024-05-31Add convert.py removal to hot topics (#7662)Galunid
2024-05-31[no ci] docs: add aikit to readme (#7650)Sertaç Özercan
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-05-30Fixed painfully slow single process builds. (#7326)JohnnyB
* Fixed painfully slow single process builds. * Added nproc for systems that don't default to nproc
2024-05-31llama : cache llama_token_to_piece (#7587)Georgi Gerganov
* llama : cache llama_token_to_piece ggml-ci * llama : use vectors and avoid has_cache ggml-ci * llama : throw on unknown tokenizer types ggml-ci * llama : print a log of the total cache size
2024-05-31Fix conan badge display [no ci] (#7645)Martin Delille
2024-05-31Add brew installation instruction to README [no ci] (#7616)Manuel
2024-05-30readme : add Conan badge (#7638)Martin Delille
2024-05-30github: add contact links to issues and convert question into research [no ↵Brian
ci] (#7612)
2024-05-30Move convert.py to examples/convert-legacy-llama.py (#7430)Galunid
* Move convert.py to examples/convert-no-torch.py * Fix CI, scripts, readme files * convert-no-torch -> convert-legacy-llama * Move vocab thing to vocab.py * Fix convert-no-torch -> convert-legacy-llama * Fix lost convert.py in ci/run.sh * Fix imports * Fix gguf not imported correctly * Fix flake8 complaints * Fix check-requirements.sh * Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE * Review fixes
2024-05-30faster avx512 exp implementation (#7551)Chris Elrod
* faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`
2024-05-30ggml : fix loongarch build (O2 issue) (#7636)junchao-loongson
2024-05-30README: explain parallel build [no ci] (#7618)Johannes Gäßler
2024-05-30[SYCL] fix intel docker (#7630)Meng, Hengyu
* Update main-intel.Dockerfile * workaround for https://github.com/intel/oneapi-containers/issues/70 * reset intel docker in CI * add missed in server
2024-05-30gguf-py : Add tokenizer.ggml.pre to gguf-new-metadata.py (#7627)Galunid
2024-05-29metal : remove invalid asserts (#7617)Georgi Gerganov
2024-05-29metal : add missing asserts (#7617)Georgi Gerganov
2024-05-29ggml : fix YARN + add tests + add asserts (#7617)Georgi Gerganov
* tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci
2024-05-29cuda : non-cont concat support (#7610)Georgi Gerganov
* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci
2024-05-29llama-bench : add support for the RPC backend (#7435)Radoslav Gerganov
2024-05-29ggml : use atomic_flag for critical section (#7598)slaren
* ggml : use atomic_flag for critical section * add windows shims
2024-05-29scripts : remove mpi remnantsGeorgi Gerganov
2024-05-29sync : ggmlGeorgi Gerganov
2024-05-29ggml : restore ggml_rope_xpos_inplace (ggml/0)Georgi Gerganov
ggml-ci
2024-05-29Add Arc A750 and Arch linux to readme-sycl.md as verified GPU model and ↵Akarshan Biswas
Linux distro (#7605)
2024-05-29ggml : fix typo in ggml.c (#7603)zhouwg
2024-05-29[SYCL] Align GEMM dispatch (#7566)Meng, Hengyu
* align GEMM dispatch
2024-05-28Tokenizer WPM fixes (#7500)jaime-m-p
* Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.
2024-05-28sycl : fix assert (#7563)Georgi Gerganov
2024-05-28llama : support small Granite models (#7481)Giuseppe Scrivano
* Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/llama.cpp/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Steffen Roecker <sroecker@redhat.com>
2024-05-28vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (#7552)k.h.lai
2024-05-28rpc : resource management rework (#7562)Radoslav Gerganov
* rpc : resource management rework * address review comments
2024-05-28Add support for DeepseekV2ForCausalLM (#7519)fairydreaming
* common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28tests : fix test-tokenizer-0.shGeorgi Gerganov
2024-05-28llama : handle unknown utf8 bytes (#7588)Georgi Gerganov
2024-05-28github: add refactor to issue template (#7561)Brian
* github: add refactor issue template [no ci] * Update 07-refactor.yml
2024-05-28[SYCL]fix ggml_sycl_mul_mat_id() to match the change of api (#7436)Neo Zhang
* fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment
2024-05-28ggml : generalize GGML_OP_CONCAT (#7563)Georgi Gerganov
* ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci
2024-05-28server: do not remove whitespace at the start of a completion chunk (#7524)mgroeber9110
2024-05-28Markdownish code block fix (#7571)Nathan Epstein
* markdownish codeblock fix * updating regexes
2024-05-28llava : update clip.h (#7580)Ikko Eltociear Ashimine
overriden -> overridden
2024-05-28update HIP_UMA #7399 (#7414)Djip007
* update HIP_UMA #7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-05-28adding in x64 targets to cmake presets (#7574)kunnis
2024-05-27make: add --device-debug to NVCC debug flags (#7542)Johannes Gäßler
2024-05-27Allow multiple copy function pointers for CUDA graph kernel param updates ↵agray3
(#7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes #7942
2024-05-27Fix q_xxs using mul_mat_q (#7459)AidanBeltonS