Minor performance improvements (#179)

* Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. * q4_0_r8 (AVX2) * q4_0_r8 (NEON) Tiny bit faster PP (~128 vs ~126 t/s), same TG. * q4_0_r8 (Zen4) Somehow only marginally faster? 268 t/s vs 261 t/s * q4_0_r8 (Zen4) - slightly better 282 t/s for a pure q4_0 L3-8B quantization. * Apply platform specific modifications when repacking E.g., on NEON it is useful to pre-apply q ^ 0x88 to q4_0. This results in a ~3% performance improvement. Hence, * Changed the signature of the repack_X functions to take a bool argument indicating if the repacking is done online and, if so, apply modifications as appropriate while repacking. * Added iqk_modify_tensor to apply modifications to models that have already been repacked while loading the model. Caveat: just like rtr, this needs to have mmap disabled (else one would need to move the data to a not mmap-ed buffer, so much more complicated). * Apply platform specific modifications when repacking On Zen4 we can pre-convert the signed quants in q8_0_r4 and q8_k_r8 to unsigned thus avoiding these operations in matrix multiplications. With this change we hit PP-512 = 382.40 t/s (q8_k_r8) PP-512 = 306.92 t/s (q8_0_r4) for L3-8B on a Ryzen-7950X using q8_0 KV-cache. * Process up to 16 columns per kernel call for q8_k_r8 This brings PP-512 up to 389 t/s. * Be able to load Deepseek-v2-Lite --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <iwankawrakow@gmail.com> 2025-01-27 18:53:47 +0200
committer: GitHub <noreply@github.com> 2025-01-27 18:53:47 +0200
commit: f725576345582144dfebd7f1e6c8ac93eb1eb0ca (patch)
tree: 12de4f7a7c4c9c75e1df955764200102e901a29d /src/llama.cpp
parent: d9c4ea48d1e41d8f7215ff1c094d75e7229b65e2 (diff)
1 files changed, 13 insertions, 3 deletions
diff --git a/src/llama.cpp b/src/llama.cpp
index 836fd97a..b6a4a06d 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -7650,7 +7650,7 @@ static bool llm_load_tensors(
                             layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
                         } else {
                             layer.ffn_gate_inp = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert});
-                            layer.ffn_exp_probs_b = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert} );
+                            layer.ffn_exp_probs_b = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, 1);
 
                             GGML_ASSERT(n_expert      > 0);
                             GGML_ASSERT(n_expert_used > 0);
@@ -8014,6 +8014,16 @@ static bool llm_load_tensors(
         }
     }
 
+    if (!ml.use_mmap) {
+        int n_modified = 0;
+        for (auto& it : model.tensors_by_name) {
+            if (ggml_backend_buffer_is_host(it.second->buffer)) {
+                if (iqk_modify_tensor(it.second)) ++n_modified;
+            }
+        }
+        if (n_modified > 0) printf("============ Modified %d tensors\n", n_modified);
+    }
+
     if (!ml.use_mmap && ml.repack_tensors) {
         int n_repacked = 0;
         for (auto& it : model.tensors_by_name) {
@@ -16910,8 +16920,8 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
                 else chunk_size_multiplier = 8;
             }
             else if (new_type == GGML_TYPE_Q4_0_R4) {
-                if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_Q4_0;
-                else chunk_size_multiplier = 4;
+                if (tensor->ne[1] % 8 != 0) new_type = GGML_TYPE_Q4_0;
+                else chunk_size_multiplier = 8;
             }
             else if (new_type == GGML_TYPE_Q5_0_R4) {
                 if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_Q5_0;
author	Kawrakow <iwankawrakow@gmail.com>	2025-01-27 18:53:47 +0200
committer	GitHub <noreply@github.com>	2025-01-27 18:53:47 +0200
commit	f725576345582144dfebd7f1e6c8ac93eb1eb0ca (patch)
tree	12de4f7a7c4c9c75e1df955764200102e901a29d /src/llama.cpp
parent	d9c4ea48d1e41d8f7215ff1c094d75e7229b65e2 (diff)