Age | Commit message (Collapse) | Author |
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Enable MLA-3 in crippled GGUFs: WIP
* Enable MLA-3 in crippled GGUFs: seems to work
* Add newly created tensors to model.tensors_by_name
Else they don't get run-time repacked.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding GPU offload policy
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding ability to use THP on Linux
* Use the actual page size4 used for mmap also in munmap
* Add -thp to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* A better way to measure the cost of ggml_barrier
* Smart expert selection
* Add ser option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* This reduces compute buffer size for MLA
* This should accomplish it for standard attention
* Much better
* Better concat for contiguous tensors
If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
The `-mla` command line option turns into an int from a bool.
mla = 0: use standard attention
mla = 1: use MLA with transposed cache
mla > 1: use MLA without transposed cache
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Give the user the option to override where model weights are stored
* Fix ggml_nbytes() problem and cleanup
For a tensor with zero elements ggml_nbytes() was returning
uint64_t::max, and this was causing graph allocation failure.
* Add timing info to CUDA graph evaluation
* Add more timing info
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fusing MoE up * unary(gate)
* Fusing MoE up * unary(gate): CUDA
We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite
* On CUDA also fuse MoE down * (up * unary(gate))
in case the MUL_MAT_ID op for the down experts is the next
op in the graph.
* Command line option to enable fused MoE up*unary(gate)
* Add fmoe option to llama-bench
* Adding forgotten gelu, relu, silu on ARM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* examples : add new sweep-bench benchmark
* Change documentation to reference ik_llama.cpp
* Made it compile with ik_llama
* Fix JSONL output
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
|
* Adding q8_KV - Basics + AVX2 gemm/gemv
* q8_KV: Better AVX2 gemm
* q8_KV: Better Zen4 gemm
We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.
* q8_KV: AVX2 gemm/gemv
We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.
* q8_KV: be able to use it for K cache
This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
removed most of it when implementing the quants with per row scale,
bit it was stull lurking in ggml_copy. Not sure if these were the last
remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
the K-cache as a 2D tensor so we can have per row meta data as needed
by q8_KV.
Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.
* q8_KV: be able to use it for K cache in FA
* q8_KV: repack it for K*Q in FA
* q8_KV: slightly faster gemv on Zen4
* q8_KV: slightly faster gemv on Zen4
* q8_KV: ARM_NEON
We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.
* q8_KV: use it in FA on NEON
* q8_KV_r8 - repacked q8_KV
On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).
* q8_KV_r8: don't use nrc_y = 16 on Zen4
This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.
* q8_KV: nrc_y = 16 also doesn't pay off in FA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Load all MoE experts during warmup
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Unify warmup to one token
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
|
* Deepseek MLA Optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make MLA optional
* Remove some unnecessary copies in the MLA attention
* Deepseek MLA Optimizations V2 (#195)
* Avoid allocating MHA KV cache when MLA is turned on
* Added missing gguf-py file
* Added final optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make sure we do have wk_b and wv_b before enabling MLA
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use type_k and type_v to set the types of the MLA caches
They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.
* Better gemm strategy when nth > nhead
It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.
---------
Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Be able to repack tensors at run time
* Repack: also add bf16 as repackable type
* Repack: make sure number of rows is a multiple of the packing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q6_0 - basics + AVX2/Zen4 working
* Adding q6_0: CUDA dequantize works, but not mmvq
* Adding q6_0: CUDA mmvq works
* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache
* Add q6_0 to CPU flash attention
Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?
* q6_0: slightly better kv-cache result
Better than q8_0+q4_0, but not as good as q8_0+iq4_nl
* q6_0: works on ARM_NEON
* q6_0: dequantize works on Metal, but not vector dot product
* q6_0: it now works on Metal
Outperforms q5_0 by a significant margin. E.g.
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 |
* q6_0: can now be used for kv-cache on Metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4 Flash Attnetion: WIP bf16
* Zen4 Flash Attnetion: bf16 seems to be working
* Zen4 Flash Attnetion: improving bf16
* Zen4 Flash Attnetion: improving bf16
It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
For some models the same tensor is used for token embeddings and
output. This tensor tends to be named token_embedding.weight rather
than output.weight, which prevernts us from collecting imatrix data
for this tensor. With this commit we can tell the name of the
output tensor to the imatrix tool.
|
|
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples
* find result_norm/result_embd tensors properly; update output allocation logic
* only use embd output for pooling_type NONE
* get rid of old causal_attn accessor
* take out attention_type; add in llama_set_embeddings
* bypass logits when doing non-NONE pooling
|
|
* common: fix warning
* Update common/common.cpp
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* add control-vector-generator
* calc diff
* add comments
* proof-of-concept stdlib implementation
Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.
* param parsing, refactor, comments
Added basic command-line parameters for outfile and one each positive/negative prompt.
Refactored some messy code in PCA computation and GGUF exporting.
Left a bunch of comments regarding further work needed.
* example template completions
Implements an example template set built from the positive/negative prompts like the control vector Python implementation.
* add multi prompts, multi-thread for PCA
* fix mem error
* add debugs
* fix matrix transpose multiplication
you have got to be kidding me
* preliminary template/multiprompt support
model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish
* fix zero output & param parsing, functional templating
fixed a bug where the output file had no tensor data/was all zero
fixed a bug where single hyphen flags were not being correctly parsed
implements creation of templated prompts from input (still need to adapt based on model)
* fix square_diff matmul index range and CRLF->LF line endings
fixed a logic error where square_diff would not multiply all rows
fixed a formatting error where the provided completions.txt had CRLF line endings
* add command-line args for num threads, num completions file lines, always reload model
refactored a few things and did what the commit message says on the tin
* code aestheticization
* fix compiler warnings
* in-series multithreading for prompt embedding?
added commented-out code to attempt to start implementing mutlithreading for embedding in main
* remove unnecessary multithreading
* interim fix memory leak
* translated everything but PCA (I think)
* tentatively translate the rest
* fix ggml errors and make new ones
at least it compiles and runs
* fix cb_eval
* temporary commit while I move dev environments
it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent
* update debug statements
* pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped
* update comments
* (wip) refactor
* clean up PCA ggml implementation
* fix shape of v_diff_original
* add n_batch for pca
* working version
* remember to copy back the last_eigenvector
* fix n_completions
* bring back n_completions
* default n_pca_batch to 20
* fix macos build
* add to makefile all targets
* use ggml_format_name
* add readme
* fix .editorconfig
* use ggml_backend_tensor_copy
* attemp to fix compile problem on mac
* fix compile warn
* reuse allocr
* move param parser to common
* better error handling
* clean up a bit
* add print_usage
* shorten help msg
* beautify help msg
* escape prompt by default
* change compile target to llama-cvector-generator
* typo
* disable GPU for PCA
* code style
---------
Co-authored-by: Christian Zhou-Zheng <christianzhouzheng@gmail.com>
|
|
* url: save -mu download to new cache location
* url: fs_get_cache_file_path util
* url: tweak sig of fs_get_cache_file
|
|
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
|
|
|
|
* imatrix : migrate to gpt_params
ggml-ci
* imatrix : add --save-frequency cli arg
* common : fix --no-ppl
|
|
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
|
|
ggml-ci
|
|
* Finish Vulkan mul_mat_id implementation
* Add Vulkan sum_rows and div ops
* Fix MUL_MAT_ID matrix matrix shader
* Fix MUL_MAT_ID matrix vector shader dispatch size
* Fix MUL_MAT_ID matrix vector shader and dispatch code
* Update Vulkan CPU offload for MUL_MAT_ID
* Fix crash when using split mode none and setting a main GPU
|
|
This also flips the default behavior of the output to not include control token by default.
|
|
* main : don't print special tokens with --grammar
The CLI interface was recently changed to print special control tokens
like the </s> stop message one. This token shouldn't be printed if the
grammar flag was passed, unless the grammar specifies it, because that
breaks shell-scriptability.
* main: use seperate stream for control characters
* main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out
* main: dprintf isn't part of the IEEE POSIX standard. Just use write().
* main: remove --ctrl-token-fd-out in favor for fcntl() based detection
* common.cpp: accidentally removed --interactive-first
* main: only merge stdout and control token if not in conversation or grammar mode
* main: rejig control token descriptor handling
* main: must check pipe status on very top of program
* main: renamed --no-special from --ctrl-token-no-out and other refactoring
* main: refactor ctrl_token_no_out --> no_special
* llama: rename llama_token_is_control_token() to llama_token_is_control()
* main: remove special token file descriptor feature (#5)
---------
Co-authored-by: Brian <mofosyne@gmail.com>
|
|
* Add SVE support for q4_0_q8_0 q8_0_q8_0
* remove ifdef
|
|
* fix missing slash in fs_get_cache_directory()
* use LOCALAPPDATA for fs_get_cache_directory()
* better code style
|
|
* common : normalize naming style
ggml-ci
* common : match declaration / definition order
* zig : try to fix build
|
|
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
|
|
|
|
* ggml : add RPC backend
The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).
* set TCP_NODELAY
* add CI workflows
* Address review comments
* fix warning
* implement llama_max_devices() for RPC
* Address review comments
* Address review comments
* wrap sockfd into a struct
* implement get_alignment and get_max_size
* add get_device_memory
* fix warning
* win32 support
* add README
* readme : trim trailing whitespace
* Address review comments
* win32 fix
* Address review comments
* fix compile warnings on macos
|
|
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug
./llamafile -m foo.gguf -p bar --grammar 'root::="'
Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
|
|
@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
|
|
|
|
|
|
|
|
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
|
|
* args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf)
* args: main & server now call gpt_params_handle_model_default
* args: define DEFAULT_MODEL_PATH + update cli docs
* curl: check url of previous download (.json metadata w/ url, etag & lastModified)
* args: fix update to quantize-stats.cpp
* curl: support legacy .etag / .lastModified companion files
* curl: rm legacy .etag file support
* curl: reuse regex across headers callback calls
* curl: unique_ptr to manage lifecycle of curl & outfile
* curl: nit: no need for multiline regex flag
* curl: update failed test (model file collision) + gitignore *.gguf.json
|
|
Co-authored-by: root <root@nenya.lothlorien.ca>
|
|
* merged the changes from deepseeker models to main branch
* Moved regex patterns to unicode.cpp and updated unicode.h
* Moved header files
* Resolved issues
* added and refactored unicode_regex_split and related functions
* Updated/merged the deepseek coder pr
* Refactored code
* Adding unicode regex mappings
* Adding unicode regex function
* Added needed functionality, testing remains
* Fixed issues
* Fixed issue with gpt2 regex custom preprocessor
* unicode : fix? unicode_wstring_to_utf8
* lint : fix whitespaces
* tests : add tokenizer tests for numbers
* unicode : remove redundant headers
* tests : remove and rename tokenizer test scripts
* tests : add sample usage
* gguf-py : reader prints warnings on duplicate keys
* llama : towards llama3 tokenization support (wip)
* unicode : shot in the dark to fix tests on Windows
* unicode : first try custom implementations
* convert : add "tokenizer.ggml.pre" GGUF KV (wip)
* llama : use new pre-tokenizer type
* convert : fix pre-tokenizer type writing
* lint : fix
* make : add test-tokenizer-0-llama-v3
* wip
* models : add llama v3 vocab file
* llama : adapt punctuation regex + add llama 3 regex
* minor
* unicode : set bomb
* unicode : set bomb
* unicode : always use std::wregex
* unicode : support \p{N}, \p{L} and \p{P} natively
* unicode : try fix windows
* unicode : category support via std::regex
* unicode : clean-up
* unicode : simplify
* convert : add convert-hf-to-gguf-update.py
ggml-ci
* lint : update
* convert : add falcon
ggml-ci
* unicode : normalize signatures
* lint : fix
* lint : fix
* convert : remove unused functions
* convert : add comments
* convert : exercise contractions
ggml-ci
* lint : fix
* cmake : refactor test targets
* tests : refactor vocab tests
ggml-ci
* tests : add more vocabs and tests
ggml-ci
* unicode : cleanup
* scripts : ignore new update script in check-requirements.sh
* models : add phi-3, mpt, gpt-2, starcoder
* tests : disable obsolete
ggml-ci
* tests : use faster bpe test
ggml-ci
* llama : more prominent warning for old BPE models
* tests : disable test-tokenizer-1-bpe due to slowness
ggml-ci
---------
Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
|
|
* imatrix: save the dataset file used in the output file
* llama: support kv overrides type string string
* common: factorize KV Overrides parsing between common and server
* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656
* llama: remove kv override str_value initialization as it does not compile on some toolchain
* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`
* quantize: add imatrix filename in KV
* llama: add llama_model_kv_override_free
* common: add llama_model_kv_override_free
common: free kv override if used after model loading
* llama: finally move the string KV override value to the stack
* llama : minor
* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.
Co-authored-by: slaren <slarengh@gmail.com>
* kv override: ensure string termination
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* add basic tensor data validation function
* add --check-tensors command line argument
tensor validation is disabled by default and can be enabled by adding
`--check-tensors` to the command line arguments.
quantize always validates tensors.
|