summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-12-21ggml : change ggml_scale to take a float instead of tensor (#4573)Georgi Gerganov
* ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci
2023-12-21gguf-py : fix broken linkGeorgi Gerganov
2023-12-21gguf : simplify example dependenciesGeorgi Gerganov
2023-12-21ci : add `jlumbroso/free-disk-space` to docker workflow (#4150)Samuel Maynard
* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo * [github][workflows][docker]: adds `jlumbroso/free-disk-space`
2023-12-21llama : initial ggml-backend integration (#4520)slaren
* llama : initial ggml-backend integration * add ggml-metal * cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST access all tensor data with ggml_backend_tensor_get/set * add ggml_backend_buffer_clear zero-init KV cache buffer * add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data * disable gpu backends with ngl 0 * more accurate mlock * unmap offloaded part of the model * use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap * update quantize and lora * update session copy/set to use ggml-backend ggml-ci * use posix_fadvise instead of posix_fadvise64 * ggml_backend_alloc_ctx_tensors_from_buft : remove old print * llama_mmap::align_offset : use pointers instead of references for out parameters * restore progress_callback behavior * move final progress_callback call to load_all_data * cuda : fix fprintf format string (minor) * do not offload scales * llama_mmap : avoid unmapping the same fragments again in the destructor * remove unnecessary unmap * metal : add default log function that prints to stderr, cleanup code ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21llama : allow getting n_batch from llama_context in c api (#4540)Marcus Dunn
* allowed getting n_batch from llama_context in c api * changed to use `uint32_t` instead of `int` * changed to use `uint32_t` instead of `int` in `llama_n_ctx` * Update llama.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21metal : fix `ggml_metal_log` vargs (#4373)Finn Voorhees
2023-12-21cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)Erik Garrison
* AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
2023-12-21ggml-cuda: Fix HIP build by adding define for __trap (#4569)arlo-phoenix
Regression of 139882392258671ffe5acdfcadc0bc08572d6eef HIP doesn't have trap, only abort
2023-12-21common : remove incorrect --model-draft default (#4568)Jared Van Bortel
2023-12-21CUDA: mul_mat_id always on GPU for batches >= 32 (#4553)Johannes Gäßler
2023-12-21readme : update coding guidelinesGeorgi Gerganov
2023-12-21py : open merges file as 'utf-8' (#4566)howlger
Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error: ``` Traceback (most recent call last): File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module> model_instance.set_vocab() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab self._set_vocab_gpt2() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2 special_vocab = gguf.SpecialVocab(dir_model, load_merges=True) File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__ self._load(Path(path)) File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load self._try_load_merges_txt(path) File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt for line in fp: File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined> ```
2023-12-21cuda : better error message for ggml_get_rows (#4561)bobqianic
* Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21cuda : replace asserts in wrong architecture checks with __trap (#4556)slaren
* cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns
2023-12-21llama : disable per-tensor info prints on model load (#4562)Johannes Gäßler
2023-12-21Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)LoganDark
2023-12-20CUDA: Faster Mixtral prompt processing (#4538)Johannes Gäßler
* CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-19ggml : fixed check for _MSC_VER (#4535)Eric Sommerlade
Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
2023-12-18ggml-cuda: Fix HIP build (#4528)arlo-phoenix
regression of #4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.
2023-12-18llama.swiftui : add tinyllama 1.1B F16Georgi Gerganov
2023-12-18llama.swiftui : add more modelsGeorgi Gerganov
2023-12-18llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)Ebey Abraham
* phi2 implementation * fix breaking change * phi-2 : various fixes * phi-2 : use layer norm eps * py : whitespaces * llama : fix meta KV override bug * convert : phi don't add BOS token * convert : revert "added_tokens_decoder" change * phi-2 : scale Q instead of KQ for better precision * ggml : fix NeoX rope to rotate just first n_dims * cuda : less diff in the rope_neox kernel * ggml : add ggml_mul_mat_set_prec ggml-ci * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * cuda : ggml_cuda_op_mul_mat_cublas support F32 precision * cuda : remove oboslete comment --------- Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18llama : fix try_override for bool_value which always return true (#4519)hankcs
2023-12-17decode : fix logits_valid for legacy API (#4516)Jared Van Bortel
2023-12-17readme : update hot topicsGeorgi Gerganov
2023-12-17llama.swiftui : add bench functionality (#4483)Georgi Gerganov
* llama.swiftui : add bench button * llama.swiftui : initial bench functionality * force to use n_gpu_layers on simulator * add download buttons & expose llamaState.loadModel * update project.pbxproj * comment #Preview & fix editorconfig check * gitignore : xcode stuff * llama.swiftui : UX improvements * llama.swiftui : avoid data copy via "downloadTask" * llama.swiftui : remove model from project * llama : remove "mostly" from model infos * llama.swiftui : improve bench --------- Co-authored-by: jhen <developer@jhen.me>
2023-12-17gguf-py : fail fast on nonsensical special token IDs (#4489)Jared Van Bortel
2023-12-17build : Check the ROCm installation location (#4485)Matheus Gabriel Alves Silva
* build : Check the ROCm installation location * more generic approach * fixup! It was returning the path instead of the command output * fixup! Trailing whitespace
2023-12-17finetune : keep allocs alive until all allocations are done (#4486)slaren
2023-12-17server : disable llm logs if SERVER_VERBOSE is off (#3792)olexiyb
2023-12-17server : fix grammar being ignored (#4494)AdithyanI
Fix bug in identifying the grammar.
2023-12-17server : fix possible ambiguity in content type charset (#4501)Alexey Parfenov
2023-12-17server : allow requests larger than 8K (#4500)mzcu
2023-12-17Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506)Bach Le
2023-12-16lora : add support for non-llama models (#3333)slaren
* lora : add support for non-llama models ggml-ci * avoid leaking ggml_context on failure cleanup ggml-ci * lora : allow 1d tensors * lora : include embd and output layers in size calculation * fix style
2023-12-15llama : sanity checks for access to logits (#4274)Jared Van Bortel
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15server : add optional API Key Authentication example (#4441)ShadovvBeast
* Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15ggml : group mul_mat_id rows by matrix (cpu only) (#4480)slaren
* ggml : group mul_mat_id rows by matrix (cpu only) * remove mmid parameters from mm forward * store row groups in wdata and calculate only once in GGML_TASK_INIT ggml-ci
2023-12-14ggml : use ggml_row_size where possible (#4472)slaren
* ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14ggml : remove n_dims from ggml_tensor (#4469)slaren
ggml-ci
2023-12-14py : add protobuf dependency (#4466)wonjun Jang
2023-12-14ggml : add ggml_row_size() (fixes llama out of space) (#4461)LostRuins
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453)Georgi Gerganov
2023-12-14convert : support loading vocab from fast tokenizer config (#3633)wonjun Jang
* Add HFVocab into convert.py * Update convert.py * Update convert.py * add bytes_to_unicode function * change add_meta_vocab fucntion * remove debug code * remove byte_encoder * Add newline between classes * Check tokenizer.json when tokenizer.model is not exist. * Move transformers dependency to local code * Add error context with 'raise from' * Add fast tokenizer option to BpeVocab * Update convert.py * Add VocabLoader and remove *Vocab class * Add transformers dependency * remove added tokens and check newline token to decide spm or bpe * Update convert.py * Add special token type * Update convert.py * Update convert.py * Update convert.py * Fix typo in convert.py * Fix when params.n_vocab < tokenizer vocab size * update vocab class * change funtion name * Remove unused variable/functions, add types to class variable and methods, delete blank liens * fix flake8 warnings * code style cleanup * make mypy happy * change exception --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14readme : update supported model list (#4457)BarfingLemurs
2023-12-13server : fix handling of characters that span multiple tokens when streaming ↵shibe2
(#4446)
2023-12-13sync : ggml (SD ops, tests, kernels) (#4444)Georgi Gerganov
* sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13build : detect host compiler and cuda compiler separately (#4414)Jared Van Bortel
2023-12-13common : add `--version` option to show build info in CLI (#4433)Siwen Yu