ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-03-26	wpm : portable unicode tolower (#6305)	Jared Van Bortel
	Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
2024-03-11	llama : refactor unicode stuff (#5992)	Georgi Gerganov
	* llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref
2024-03-01	unicode : switch to multimap based nfd_map (#5799)	Douglas Hanley
	* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
2024-02-28	llama : improve BERT tokenization (#5740)	Douglas Hanley
	* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-26	unicode : reuse iterator (#5726)	Georgi Gerganov

2024-02-13	tests : multi-thread the tokenizer tests (#5474)	Georgi Gerganov
	* tests : multi-thread the tokenizer tests ggml-ci * unicode : fix data race for unidentified codepoints ggml-ci * unicode : minor style fixes ggml-ci
2024-01-21	add `#include <string>` to unicode.h (#5051)	bobqianic
	Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-10-03	Work on the BPE tokenizer (#3252)	goerch
	* Work on the BPE tokenizer Tokenizer tests work for Falcon-7B * Try to fix build problem * Fix debug assertion failure * Fix MSVC Unicode BOM problem * Cleanup and an improvement * Fix compiler warning * Cleanup * Test doesn't work over the full range of Unicodes * Update .gitignore and Makefile * Another Makefile rule * Testing Aquila * Moving byte decoding back to `token_to_piece` ... ... because everyone is using it. * Guarding some unusable code pathes * Streamlining code and adding some more assertions Important change: I'm classifying added tokens as control tokens now for BPE. * Adding a comment * Adding another assertion * Fixed vocabulary guarding assertions * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fix PR for recent change * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fixes for more compiler warnings * Remove unused code * Fix initialization of static maps * Add scores and token types back, adapt gptneox * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Ported Starcoder and added some assertions * Fix coding style * Apply @jploski 's fix for missing tokens --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>