diff options
author | wonjun Jang <strutive07@gmail.com> | 2023-12-14 17:09:34 +0900 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-12-14 10:09:34 +0200 |
commit | 873637afc7924f435ac44c067630a28e82eefa7b (patch) | |
tree | 82feb6a53b328eca8552304aca5007f26f768cff /examples | |
parent | 0353a1840134b24b07ab61fd4490192f28c4db6b (diff) |
convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py
* Update convert.py
* Update convert.py
* add bytes_to_unicode function
* change add_meta_vocab fucntion
* remove debug code
* remove byte_encoder
* Add newline between classes
* Check tokenizer.json when tokenizer.model is not exist.
* Move transformers dependency to local code
* Add error context with 'raise from'
* Add fast tokenizer option to BpeVocab
* Update convert.py
* Add VocabLoader and remove *Vocab class
* Add transformers dependency
* remove added tokens and check newline token to decide spm or bpe
* Update convert.py
* Add special token type
* Update convert.py
* Update convert.py
* Update convert.py
* Fix typo in convert.py
* Fix when params.n_vocab < tokenizer vocab size
* update vocab class
* change funtion name
* Remove unused variable/functions, add types to class variable and methods, delete blank liens
* fix flake8 warnings
* code style cleanup
* make mypy happy
* change exception
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Diffstat (limited to 'examples')
0 files changed, 0 insertions, 0 deletions