llama : validate special token ids are in range when loading GGUF model (#3635)

* Add validation for special token ids to llama.cpp Small optimization for llama_byte_to_token SPM mode * Fix BPE newline check, only I could break something so simple * Killll meeeeee * Account for GGUF_KEY_KEY only setting when the key exists * Minor code cleanups. * Fix convert.py error msg when added tokens are out of range * Make gguf SpecialVocab vocab size-aware Update conversion scripts accordingly * Avoid a string copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
author: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> 2023-10-22 12:14:56 -0600
committer: GitHub <noreply@github.com> 2023-10-22 21:14:56 +0300
commit: a5e7dbd6141128bfa3c40a19c2945a181df625d3 (patch)
tree: 14cb15291418d4f591d7a58d8239eb02b966b595 /convert-baichuan-hf-to-gguf.py
parent: d3956aea53369455008159cc405ed4c496976692 (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/convert-baichuan-hf-to-gguf.py b/convert-baichuan-hf-to-gguf.py
index a1783f71..3b64ecb8 100755
--- a/convert-baichuan-hf-to-gguf.py
+++ b/convert-baichuan-hf-to-gguf.py
@@ -230,7 +230,7 @@ gguf_writer.add_token_list(tokens)
 gguf_writer.add_token_scores(scores)
 gguf_writer.add_token_types(toktypes)
 
-special_vocab = gguf.SpecialVocab(dir_model)
+special_vocab = gguf.SpecialVocab(dir_model, n_vocab = len(tokens))
 special_vocab.add_to_gguf(gguf_writer)
 
 # TENSORS
author	Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-22 12:14:56 -0600
committer	GitHub <noreply@github.com>	2023-10-22 21:14:56 +0300
commit	a5e7dbd6141128bfa3c40a19c2945a181df625d3 (patch)
tree	14cb15291418d4f591d7a58d8239eb02b966b595 /convert-baichuan-hf-to-gguf.py
parent	d3956aea53369455008159cc405ed4c496976692 (diff)