summaryrefslogtreecommitdiff
path: root/ggml-sycl.cpp
diff options
context:
space:
mode:
authorjaime-m-p <167997752+jaime-m-p@users.noreply.github.com>2024-06-18 18:40:52 +0200
committerGitHub <noreply@github.com>2024-06-18 18:40:52 +0200
commit37bef8943312d91183ff06d8f1214082a17344a5 (patch)
tree7713dc5aceb3b181568db3d21b1383762de41c4a /ggml-sycl.cpp
parent91c188d6c296bd3384f2a02a83b71187aa3d18b3 (diff)
tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t
Diffstat (limited to 'ggml-sycl.cpp')
0 files changed, 0 insertions, 0 deletions