From edd4c1481708fcd788b0e423268304fd26e2b125 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Sun, 27 Aug 2023 14:19:19 +0300 Subject: llama : more tokenizer fixes (#2810) * tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> --- examples/embd-input/embd-input-lib.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'examples/embd-input/embd-input-lib.cpp') diff --git a/examples/embd-input/embd-input-lib.cpp b/examples/embd-input/embd-input-lib.cpp index 8a6ad882..036bdb39 100644 --- a/examples/embd-input/embd-input-lib.cpp +++ b/examples/embd-input/embd-input-lib.cpp @@ -214,7 +214,7 @@ const char * sampling(struct MyModel * mymodel) { if (id == llama_token_eos(ctx)) { ret = ""; } else { - ret = llama_token_to_str(ctx, id); + ret = llama_token_to_piece(ctx, id); } eval_id(mymodel, id); return ret.c_str(); -- cgit v1.2.3