Age | Commit message (Collapse) | Author |
|
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
|
|
* server: added `dynatemp_range` and `dynatemp_exponent`
* Update README.md
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
|
|
server : fix deadlock when prompt array contains strings and numbers
server : removed an unnecessary generation when generating multi-prompts
server : removed an unnecessary assert
|
|
* py : handle byte tokens in `get_token_type`
* py : fix empty bytes arg
|
|
* make: Use ccache for faster compilation
|
|
* README: updated introduction
* readme : update
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* py : fix internlm2-hf convert to gguf
* ggml-ci
|
|
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* added dynamic temp params in main
* added help text
|
|
|
|
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Fix cpy with dims of 3
* rm asserts
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
→ 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
→ 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
→ 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
|
|
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
option() is specifically for booleans.
Fixes #5158
|
|
|
|
|
|
|
|
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
|
|
|
|
* YaRN : store rope scaling type as int32_t in memory
* llama : store mapped names as const char *
|
|
|
|
|
|
|
|
* scripts : parse wtype in server-llm.sh
* scripts : fix check for wfile
|
|
|
|
* Tidy some code in ggml-sycl
* Remove blank space
* Remove std::printf comments
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
* add vulkan dockerfile
* intel dockerfile: compile sycl by default
* fix vulkan dockerfile
* add docs for vulkan
* docs: sycl build in docker
* docs: remove trailing spaces
* docs: sycl: add docker section
* docs: clarify install vulkan SDK outside docker
* sycl: use intel/oneapi-basekit docker image
* docs: correct TOC
* docs: correct docker image for Intel oneMKL
|
|
* get max alloc size from device prop
* fix macro typo
|
|
* update guide for make installation, memory, gguf model link, rm todo for windows build
* add vs install requirement
* update for gpu device check
* update help of llama-bench
* fix grammer issues
|
|
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.
This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
|
|
* add --no-mmap, show sycl backend
* fix conflict
* fix code format, change print for --no-mmap
* ren no_mmap to mmap, show mmap when not default value in printer
* update guide for mmap
* mv position to reduce model reload
|
|
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver
* Fix another Vulkan CPY buffer size bug
|
|
|
|
|
|
* support InternLM2 inference
* add add_space_prefix KV pair
|
|
* build vulkan as object
* vulkan ci
|
|
|
|
* llama : remove LLAMA_MAX_DEVICES from llama.h
ggml-ci
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* server : remove LLAMA_MAX_DEVICES
ggml-ci
* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD
ggml-ci
* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD
* readme : add deprecation notice
* readme : change deprecation notice to "remove" and fix url
* llama : remove gpu includes from llama.h
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|