summaryrefslogtreecommitdiff
path: root/unicode.h
AgeCommit message (Collapse)Author
2024-03-26wpm : portable unicode tolower (#6305)Jared Van Bortel
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
2024-03-11llama : refactor unicode stuff (#5992)Georgi Gerganov
* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref
2024-03-01unicode : switch to multimap based nfd_map (#5799)Douglas Hanley
* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
2024-02-28llama : improve BERT tokenization (#5740)Douglas Hanley
* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-26unicode : reuse iterator (#5726)Georgi Gerganov
2024-02-13tests : multi-thread the tokenizer tests (#5474)Georgi Gerganov
* tests : multi-thread the tokenizer tests ggml-ci * unicode : fix data race for unidentified codepoints ggml-ci * unicode : minor style fixes ggml-ci
2024-01-21add `#include <string>` to unicode.h (#5051)bobqianic
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-10-03Work on the BPE tokenizer (#3252)goerch
* Work on the BPE tokenizer Tokenizer tests work for Falcon-7B * Try to fix build problem * Fix debug assertion failure * Fix MSVC Unicode BOM problem * Cleanup and an improvement * Fix compiler warning * Cleanup * Test doesn't work over the full range of Unicodes * Update .gitignore and Makefile * Another Makefile rule * Testing Aquila * Moving byte decoding back to `token_to_piece` ... ... because everyone is using it. * Guarding some unusable code pathes * Streamlining code and adding some more assertions Important change: I'm classifying added tokens as control tokens now for BPE. * Adding a comment * Adding another assertion * Fixed vocabulary guarding assertions * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fix PR for recent change * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fixes for more compiler warnings * Remove unused code * Fix initialization of static maps * Add scores and token types back, adapt gptneox * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Ported Starcoder and added some assertions * Fix coding style * Apply @jploski 's fix for missing tokens --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>