diff options
author | jaime-m-p <167997752+jaime-m-p@users.noreply.github.com> | 2024-05-28 21:46:34 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-05-28 21:46:34 +0200 |
commit | 02c1ecad07f0e2d2febe8196271bcc64bdc9c006 (patch) | |
tree | 2208298e9ac6bd0743787d02f35b527f7db47d0b /ggml.c | |
parent | 6bd12ce409f949012935b7d1b15a21ffa473a565 (diff) |
Tokenizer WPM fixes (#7500)
* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
- Fix unicode edge case combinations.
- Split by whitspace in the same pass.
* Discard all tokens when no matching found.
Diffstat (limited to 'ggml.c')
0 files changed, 0 insertions, 0 deletions