llama : more tokenizer fixes (#2810)

* tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
author: Georgi Gerganov <ggerganov@gmail.com> 2023-08-27 14:19:19 +0300
committer: GitHub <noreply@github.com> 2023-08-27 14:19:19 +0300
commit: edd4c1481708fcd788b0e423268304fd26e2b125 (patch)
tree: 2e7db62ea4816dc18f2518a08c36b6ea480eff05 /examples/embd-input
parent: 1591e2e590762011b43b10a9b6e04f13f98f2aa5 (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/examples/embd-input/embd-input-lib.cpp b/examples/embd-input/embd-input-lib.cpp
index 8a6ad882..036bdb39 100644
--- a/examples/embd-input/embd-input-lib.cpp
+++ b/examples/embd-input/embd-input-lib.cpp
@@ -214,7 +214,7 @@ const char * sampling(struct MyModel * mymodel) {
     if (id == llama_token_eos(ctx)) {
         ret = "</s>";
     } else {
-        ret = llama_token_to_str(ctx, id);
+        ret = llama_token_to_piece(ctx, id);
     }
     eval_id(mymodel, id);
     return ret.c_str();
author	Georgi Gerganov <ggerganov@gmail.com>	2023-08-27 14:19:19 +0300
committer	GitHub <noreply@github.com>	2023-08-27 14:19:19 +0300
commit	edd4c1481708fcd788b0e423268304fd26e2b125 (patch)
tree	2e7db62ea4816dc18f2518a08c36b6ea480eff05 /examples/embd-input
parent	1591e2e590762011b43b10a9b6e04f13f98f2aa5 (diff)