Falcon3 changes (#168)

* Add Falcon3 pre-tokinizer (same as llama3) * q8_k16: use integer arithmetic to sum row values The existing implementation that just sums up the f32 quantizations works fine for the original BitNet models and also for the TriLM ternary models. But for Falcon3 I see a significant difference between the CPU and the GPU perplexity. If I use the q8_K16 int8_t quants to sum up the values in a row, then the CPU-GPU PPL difference becomes much smaller, and we get a lower PPL than Microsoft BitNet, which claims to be "losless". --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-01-10 15:06:00 +0200
committer: GitHub <noreply@github.com> 2025-01-10 15:06:00 +0200
commit: b1363b6177661556750c110cf876e044e61af365 (patch)
tree: 5314e735bffc0eba02dd6c028e01cdd5fc863b02 /src
parent: 3e6851621c54e8424196810f2798811f069bcff1 (diff)
1 files changed, 2 insertions, 1 deletions
diff --git a/src/llama.cpp b/src/llama.cpp
index 37653478..54b9b118 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -5552,7 +5552,8 @@ static void llm_load_vocab(
             } else if (
                     tokenizer_pre == "llama3"   ||
                     tokenizer_pre == "llama-v3" ||
-                    tokenizer_pre == "llama-bpe") {
+                    tokenizer_pre == "llama-bpe"||
+                    tokenizer_pre == "falcon3") {
                 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3;
                 vocab.tokenizer_ignore_merges = true;
                 vocab.tokenizer_add_bos = true;
author	Kawrakow <iwankawrakow@gmail.com>	2025-01-10 15:06:00 +0200
committer	GitHub <noreply@github.com>	2025-01-10 15:06:00 +0200
commit	b1363b6177661556750c110cf876e044e61af365 (patch)
tree	5314e735bffc0eba02dd6c028e01cdd5fc863b02 /src
parent	3e6851621c54e8424196810f2798811f069bcff1 (diff)