From d9c4ea48d1e41d8f7215ff1c094d75e7229b65e2 Mon Sep 17 00:00:00 2001 From: Kawrakow Date: Mon, 27 Jan 2025 16:50:07 +0200 Subject: Interleave 8 rows (Q8_0, IQ4_XS) (#178) * Try interleaving 8 rows for iq4_xs On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads). * Try interleaving 8 iq4_xs rows It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression. * Cleanup * 8-rows interleaved q8_0 (AVX2) * 8-rows interleaved q8_0 (Zen4) * 8-rows interleaved q8_0 (Zen4) - slightly better PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before. * 8-rows interleaved q8_0 (NEON) PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same. * FA: repack Q8_0 to Q8_0_R8 * Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4) * FA: repack Q8_0 to Q8_0_R8 (NEON) Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation. --------- Co-authored-by: Iwan Kawrakow --- src/llama.cpp | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'src/llama.cpp') diff --git a/src/llama.cpp b/src/llama.cpp index c2bc5cc0..836fd97a 100644 --- a/src/llama.cpp +++ b/src/llama.cpp @@ -16906,8 +16906,8 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s else chunk_size_multiplier = 4; } else if (new_type == GGML_TYPE_IQ4_XS_R4) { - if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_IQ4_XS; - else chunk_size_multiplier = 4; + if (tensor->ne[1] % 8 != 0) new_type = GGML_TYPE_IQ4_XS; + else chunk_size_multiplier = 8; } else if (new_type == GGML_TYPE_Q4_0_R4) { if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_Q4_0; @@ -16922,8 +16922,8 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s else chunk_size_multiplier = 4; } else if (new_type == GGML_TYPE_Q8_0_R4) { - if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_Q8_0; - else chunk_size_multiplier = 4; + if (tensor->ne[1] % 8 != 0) new_type = GGML_TYPE_Q8_0; + else chunk_size_multiplier = 8; } else if (new_type == GGML_TYPE_Q2_K_R4) { if (tensor->ne[1] % 4 != 0) new_type = GGML_TYPE_Q2_K; -- cgit v1.2.3