From d9c4ea48d1e41d8f7215ff1c094d75e7229b65e2 Mon Sep 17 00:00:00 2001
From: Kawrakow <iwankawrakow@gmail.com>
Date: Mon, 27 Jan 2025 16:50:07 +0200
Subject: Interleave 8 rows (Q8_0, IQ4_XS) (#178)

* Try interleaving 8 rows for iq4_xs

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).

* Try interleaving 8 iq4_xs rows

It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.

* Cleanup

* 8-rows interleaved q8_0 (AVX2)

* 8-rows interleaved q8_0 (Zen4)

* 8-rows interleaved q8_0 (Zen4) - slightly better

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.

* 8-rows interleaved q8_0 (NEON)

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.

* FA: repack Q8_0 to Q8_0_R8

* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

* FA: repack Q8_0 to Q8_0_R8 (NEON)

Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
---
 ggml/src/ggml-quants.c | 1 -
 1 file changed, 1 deletion(-)

(limited to 'ggml/src/ggml-quants.c')

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 23ac9915..391d9e2e 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -936,7 +936,6 @@ void quantize_row_q8_0(const float * restrict x, void * restrict vy, int64_t k)
 
 #if defined(__ARM_NEON)
     for (int i = 0; i < nb; i++) {
-        int i4 = i/4, ir = i%4;
         float32x4_t srcv [8];
         float32x4_t asrcv[8];
         float32x4_t amaxv[8];
-- 
cgit v1.2.3