Age | Commit message (Collapse) | Author |
|
|
|
* py : fix missing added_tokens_dict for SPM vocab
* py : pad with unknown tokens when data is missing
ggml-ci
* py : fix BPE vocab conversion
ggml-ci
* py : fix padded dummy tokens (I hope)
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
|
|
|
|
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* imatrix: adding support for legacy quants
* imatrix: guard Q4_0/Q5_0 against ffn_down craziness
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
ggml-ci
|
|
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/34fed993f1674c8d06d58b37ce1e0fe5eebcb9f5' (2023-12-01)
→ 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/e92039b55bcd58469325ded85d4f58dd5a4eaf58?dir=lib' (2023-11-29)
→ 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/cfc3698c31b1fb9cdcf10f36c9643460264d0ca8' (2023-12-27)
→ 'github:NixOS/nixpkgs/317484b1ead87b9c1b8ac5261a8d2dd748a0492d' (2024-01-08)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
|
* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement
* Whitespace
* Collecting command buffer completions on single thread
* Whitespace
* Reduce diff noise
|
|
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
|
|
* Replace loop of dispatch_async with dispatch_apply
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Create pydantic-models-to-grammar.py
* Added some comments for usage
* Refactored Grammar Generator
Added example and usage instruction.
* Update pydantic_models_to_grammar.py
* Update pydantic-models-to-grammar-examples.py
* Renamed module and imported it.
* Update pydantic-models-to-grammar.py
* Renamed file and fixed grammar generator issue.
* Fixed some issues and bugs of the grammar generator. Imporved Documentation
* Update pydantic_models_to_grammar.py
|
|
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
|
|
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* speculative: expose draft threading
* fix usage format
* accept -td and -tbd args
* speculative: revert default behavior when -td is unspecified
* fix trailing whitespace
|
|
|
|
|
|
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* llama : minor fix indent
* llama : check LLAMA_TRACE env for extra logging
ggml-ci
|
|
|
|
|
|
* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad
* log a little bit more info on iOS
|
|
|
|
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
ggml-ci
|
|
|
|
Co-authored-by: goerch <jhr.walter@t-online.de>
|
|
|
|
* examples : save-load-state: save only required state
* llama : only reserve n_vocab * n_batch at most for logits
llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.
* llama : always reserve n_vocab * n_batch for logits
llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.
* llama : only save and restore used logits
for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.
* llama : use ostringstream and istringstream for save and load
* llama : serialize rng into minimum amount of space required
* llama : break session version due to serialization changes
|
|
The fix should be just the `sudo apt-get update`
|
|
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens
* remove empty line
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
|
|
* metal : detect more GPU families
* metal : refactor kernel loading
* metal : set kernel family requirements
* metal : fix kernel init + fix compile options
* metal : take into account simdgroup reduction support
* metal : print only skipped kernels
* metal : fix check for simdgroup reduction support
* metal : check for Metal 3
* metal : free allocations
* metal : normalize encoder:setComputePipelineStatus calls
ggml-ci
* metal : fix Metal3 family check
ggml-ci
* metal : check for simdgroup matrix mul. feature
ggml-ci
|
|
|
|
* * fix deadlock
* * dont ruint all whitespace
|
|
|