summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-21tests : test-tokenizer-0.sh print more info (#7402)Georgi Gerganov
2024-05-21examples: cache hf model when --model not provided (#7353)Amir
* examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided * examples: cache hf model when --model not provided
2024-05-21CUDA: deduplicate mmq code (#7397)Johannes Gäßler
2024-05-21Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)jaime-m-p
* Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"
2024-05-20Tokenizer SPM fixes for phi-3 and llama-spm (#7375)jaime-m-p
* Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes
2024-05-21llama : remove Persimmon (#7408)Georgi Gerganov
* llama : remove Persimmon * requirements : remove
2024-05-20perplexity: update README FP16 results [no ci] (#7413)Johannes Gäßler
2024-05-20rpc : track allocated buffers (#7411)Radoslav Gerganov
* rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly
2024-05-20server : fix temperature + disable some tests (#7409)Georgi Gerganov
* server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo
2024-05-20[SYCL] Update SYCL upscale operation (#7321)AidanBeltonS
* Update SYCL upscale operation * Formatting * Remove messages
2024-05-20Update README.md (#7410)Bingan
2024-05-20ggml-opencl, llama: using reserve() if count already known (#7272)Herman Semenov
2024-05-20ggml : add loongarch lsx and lasx support (#6454)junchao-loongson
* add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-05-20server : tuning tests (#7388)Georgi Gerganov
* server : don't pass temperature as string * server : increase timeout * tests : fix the fix 0.8f -> 0.8 ggml-ci * tests : set explicit temperature
2024-05-20server : return error on too large embedding input (#7389)Georgi Gerganov
2024-05-20tests : fix --keep_split -> --keep-split (#7374)Georgi Gerganov
2024-05-20Add provisions for windows support for BF16 code including CMake provision ↵Srihari-mcw
for enabling AVX512_BF16 (#7258)
2024-05-20llama : remove MPI backend (#7395)slaren
2024-05-19quantize : fix --keep-split check (#7374)Fred Douglas
2024-05-19Vulkan Embedding Fix (#7360)0cc4m
* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers
2024-05-19ggml : fix another case of quants nans (#7387)slaren
2024-05-19ggml: implement quantized KV cache for FA (#7372)Johannes Gäßler
2024-05-19server: add test for token probs (#7347)Johannes Gäßler
2024-05-19server: fix seed being reported back (#7382)Johannes Gäßler
2024-05-19Add StableLM2 pre-tokenizer (#7349)Anas Ahouzi
* Add StableLM pre-tokenizer * Fix space * Fix trailing whitespace
2024-05-19cuda : clear error after buffer allocation failure (#7376)slaren
2024-05-19labeler.yml: Use settings from ggerganov/llama.cpp [no ci] (#7363)Brian
https://github.com/actions/labeler#using-configuration-path-input-together-with-the-actionscheckout-action Recommends the use of checkout action to use the correct repo context when applying settings for PR labels e.g. steps: - uses: actions/checkout@v4 # Uploads repository content to the runner with: repository: "owner/repositoryName" # The one of the available inputs, visit https://github.com/actions/checkout#readme to find more - uses: actions/labeler@v5 with: configuration-path: 'path/to/the/uploaded/configuration/file'
2024-05-19cmake : update android comments (#7341)Georgi Gerganov
2024-05-19Capture CUDA logging output (#7298)fraxy-v
* logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-05-18ci : re-enable sanitizer runs (#7358)Georgi Gerganov
* Revert "ci : temporary disable sanitizer builds (#6128)" This reverts commit 4f6d1337ca5a409dc74aca8c479b7c34408a69c0. * ci : trigger
2024-05-18android : use "ci-android" branch for CI (#7341)Georgi Gerganov
* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir
2024-05-18CUDA: deduplicate FlashAttention code (#7352)Johannes Gäßler
2024-05-18server: correct --threads documentation [no ci] (#7362)Johannes Gäßler
2024-05-18cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263)Engininja2
2024-05-18llama : add support for larger Granite Code Models (20B, 34B) (#7324)Steffen Röcker
Tie the weights for ARCH_STARCODER to support the larger Granite code models. Partially addresses ggerganov/issues/7116 There still remains to be a few things to fix. Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
2024-05-18perplexity : ndot progress and show stats with < 100 tasks (#7348)strawberrymelonpanda
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18Update and fix Vulkan soft_max and argsort implementations (#7237)0cc4m
* Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation
2024-05-18github-actions-labeler: initial commit (#7330)Brian
* github-actions-labeler: initial commit [no ci] * github actions: remove priority auto labeling [no ci]
2024-05-18convert : fix set_vocab_sentencepiece (#6866)Georgi Gerganov
* convert : fix set_vocab_sentencepiece * Update convert-hf-to-gguf.py
2024-05-18ggml : fix quants nans when all the group weights are very close to zero (#7313)slaren
2024-05-18cmake : fix typo in AMDGPU_TARGETS (#7356)Engininja2
2024-05-18Unicode codepoint flags for custom regexs (#7245)jaime-m-p
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM
2024-05-17CUDA: faster large batch FA without tensor cores (#7314)Johannes Gäßler
2024-05-17ROCm: use native CMake HIP support (#5966)Gavin Zhao
Supercedes #4024 and #4813. CMake's native HIP support has become the recommended way to add HIP code into a project (see [here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)). This PR makes the following changes: 1. The environment variable `HIPCXX` or CMake option `CMAKE_HIP_COMPILER` should be used to specify the HIP compiler. Notably this shouldn't be `hipcc`, but ROCm's clang, which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`. Note that since native CMake HIP support is not yet available on Windows, on Windows we fall back to the old behavior. 2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the GPU architectures to build for. Previously this was controled by `GPU_TARGETS`. 3. Updated the Nix recipe to account for these new changes. 4. The GPU targets to build against in the Nix recipe is now consistent with the supported GPU targets in nixpkgs. 5. Added CI checks for HIP on both Linux and Windows. On Linux, we test both the new and old behavior. The most important part about this PR is the separation of the HIP compiler and the C/C++ compiler. This allows users to choose a different C/C++ compiler if desired, compared to the current situation where when building for ROCm support, everything must be compiled with ROCm's clang. ~~Makefile is unchanged. Please let me know if we want to be consistent on variables' naming because Makefile still uses `GPU_TARGETS` to control architectures to build for, but I feel like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're calling `make`.~~ Makefile used `GPU_TARGETS` but the README says to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of `GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`. Thanks to the suggestion of @jin-eld, to maintain backwards compatibility (and not break too many downstream users' builds), if `CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using the original behavior and emit a warning that recommends switching to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but `CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS` to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new HIP support. Signed-off-by: Gavin Zhao <git@gzgz.dev>
2024-05-17rpc : set SO_REUSEADDR for the server socket (#7320)Radoslav Gerganov
ref: #7293
2024-05-17Added a single test function script and fix debug-test.sh to be more robust ↵Brian
(#7279) * run-single-test.sh: added a single test function script and fix debug-test.sh to be more robust * debug-test.sh: combined execute and gdb test mode via -g flag * debug-test.sh: refactor * debug-test: refactor for clarity * debug-test.sh: comment style changes * debug-test.sh: fix gdb
2024-05-17py : convert-hf-to-gguf-update improvements (#7340)Aarni Koskela
* convert-hf-to-gguf-update: automate updating * convert-hf-to-gguf-update: improve download * share requests session for performance * create directories only when needed, don't skip downloads when empty directory encountered * be more graceful about errors
2024-05-17llama : use n_embd_head_v when reshaping kqv (#7327)fairydreaming
* llama : use n_embd_head_v instead of n_embd_head_k when reshaping kqv * llama : use n_embd_v_gqa and n_embd_head_v instead of n_embd_k_gqa and n_embd_head_k when making a view of cached value vectors. --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-17tokenization: add warning for double BOS (#7332)Johannes Gäßler
2024-05-17ggml-quants, llama : removed excess checks (#7274)Herman Semenov