Age | Commit message (Collapse) | Author |
|
* ggml : change ggml_scale to take a float instead of tensor
* ggml : fix CPU implementation
* tests : fix test-grad0
ggml-ci
|
|
|
|
|
|
* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo
* [github][workflows][docker]: adds `jlumbroso/free-disk-space`
|
|
* llama : initial ggml-backend integration
* add ggml-metal
* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set
* add ggml_backend_buffer_clear
zero-init KV cache buffer
* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data
* disable gpu backends with ngl 0
* more accurate mlock
* unmap offloaded part of the model
* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap
* update quantize and lora
* update session copy/set to use ggml-backend
ggml-ci
* use posix_fadvise instead of posix_fadvise64
* ggml_backend_alloc_ctx_tensors_from_buft : remove old print
* llama_mmap::align_offset : use pointers instead of references for out parameters
* restore progress_callback behavior
* move final progress_callback call to load_all_data
* cuda : fix fprintf format string (minor)
* do not offload scales
* llama_mmap : avoid unmapping the same fragments again in the destructor
* remove unnecessary unmap
* metal : add default log function that prints to stderr, cleanup code
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* allowed getting n_batch from llama_context in c api
* changed to use `uint32_t` instead of `int`
* changed to use `uint32_t` instead of `int` in `llama_n_ctx`
* Update llama.h
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* AMD ROCm: handle UMA memory VRAM expansions
This resolves #2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.
Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.
Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.
* clarify build process for ROCm on linux with cmake
* avoid using deprecated ROCm hipMallocHost
* keep simplifying the change required for UMA
* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
|
|
Regression of 139882392258671ffe5acdfcadc0bc08572d6eef
HIP doesn't have trap, only abort
|
|
|
|
|
|
|
|
Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error:
```
Traceback (most recent call last):
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
model_instance.set_vocab()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
self._set_vocab_gpt2()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
self._load(Path(path))
File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
self._try_load_merges_txt(path)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
for line in fp:
File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```
|
|
* Update ggml-cuda.cu
* Update ggml-cuda.cu
* Update ggml-cuda.cu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* cuda : replace asserts in wrong architecture checks with __trap
* make bad_arch noreturn, remove returns
|
|
|
|
|
|
* CUDA: make MoE tensors contiguous for batch size>1
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
|
|
regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.
Currently using deprecated hipblasDatatype_t since newer ones very recent.
|
|
|
|
|
|
* phi2 implementation
* fix breaking change
* phi-2 : various fixes
* phi-2 : use layer norm eps
* py : whitespaces
* llama : fix meta KV override bug
* convert : phi don't add BOS token
* convert : revert "added_tokens_decoder" change
* phi-2 : scale Q instead of KQ for better precision
* ggml : fix NeoX rope to rotate just first n_dims
* cuda : less diff in the rope_neox kernel
* ggml : add ggml_mul_mat_set_prec
ggml-ci
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision
* cuda : remove oboslete comment
---------
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
|
|
|
|
* llama.swiftui : add bench button
* llama.swiftui : initial bench functionality
* force to use n_gpu_layers on simulator
* add download buttons & expose llamaState.loadModel
* update project.pbxproj
* comment #Preview & fix editorconfig check
* gitignore : xcode stuff
* llama.swiftui : UX improvements
* llama.swiftui : avoid data copy via "downloadTask"
* llama.swiftui : remove model from project
* llama : remove "mostly" from model infos
* llama.swiftui : improve bench
---------
Co-authored-by: jhen <developer@jhen.me>
|
|
|
|
* build : Check the ROCm installation location
* more generic approach
* fixup! It was returning the path instead of the command output
* fixup! Trailing whitespace
|
|
|
|
|
|
Fix bug in identifying the grammar.
|
|
|
|
|
|
|
|
* lora : add support for non-llama models
ggml-ci
* avoid leaking ggml_context on failure
cleanup
ggml-ci
* lora : allow 1d tensors
* lora : include embd and output layers in size calculation
* fix style
|
|
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* ggml : group mul_mat_id rows by matrix (cpu only)
* remove mmid parameters from mm forward
* store row groups in wdata and calculate only once in GGML_TASK_INIT
ggml-ci
|
|
* ggml : use ggml_row_size where possible
ggml-ci
* ggml : move ggml_nbytes_split to ggml-cuda.cu
|
|
ggml-ci
|
|
|
|
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Add HFVocab into convert.py
* Update convert.py
* Update convert.py
* add bytes_to_unicode function
* change add_meta_vocab fucntion
* remove debug code
* remove byte_encoder
* Add newline between classes
* Check tokenizer.json when tokenizer.model is not exist.
* Move transformers dependency to local code
* Add error context with 'raise from'
* Add fast tokenizer option to BpeVocab
* Update convert.py
* Add VocabLoader and remove *Vocab class
* Add transformers dependency
* remove added tokens and check newline token to decide spm or bpe
* Update convert.py
* Add special token type
* Update convert.py
* Update convert.py
* Update convert.py
* Fix typo in convert.py
* Fix when params.n_vocab < tokenizer vocab size
* update vocab class
* change funtion name
* Remove unused variable/functions, add types to class variable and methods, delete blank liens
* fix flake8 warnings
* code style cleanup
* make mypy happy
* change exception
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|
|
|
|
(#4446)
|
|
* sync : ggml (SD ops, tests, kernels)
ggml-ci
* cuda : restore im2col
ggml-ci
* metal : fix accuracy of dequantization kernels
ggml-ci
* cuda : restore correct im2col
ggml-ci
* metal : try to fix moe test by reducing expert size
ggml-ci
* cuda : fix bin bcast when src1 and dst have different types
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
|