ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info
author: snadampal <87143774+snadampal@users.noreply.github.com> 2024-02-11 07:22:33 -0600
committer: GitHub <noreply@github.com> 2024-02-11 15:22:33 +0200
commit: a07d0fee1f05c5c1dc49948ae1a3293db017275f (patch)
tree: 06614ff1364269493e4853333ced56802abd7284 /tests
parent: e4640d8fdf56f14a6db3d092bcd3d2d315cb5d04 (diff)
2 files changed, 2 insertions, 2 deletions
diff --git a/tests/test-quantize-fns.cpp b/tests/test-quantize-fns.cpp
index 43df8022..5e92d574 100644
--- a/tests/test-quantize-fns.cpp
+++ b/tests/test-quantize-fns.cpp
@@ -87,7 +87,7 @@ static float dot_product_error(
     vdot.from_float(test_data2, tmp_q2.data(), test_size);
 
     float result = INFINITY;
-    qfns.vec_dot(test_size, &result, tmp_q1.data(), tmp_q2.data());
+    qfns.vec_dot(test_size, &result, 0, tmp_q1.data(), 0, tmp_q2.data(), 0, 1);
 
     const float dot_ref = dot_product(test_data1, test_data2, test_size);
 
diff --git a/tests/test-quantize-perf.cpp b/tests/test-quantize-perf.cpp
index 8ec81734..48d9fae3 100644
--- a/tests/test-quantize-perf.cpp
+++ b/tests/test-quantize-perf.cpp
@@ -346,7 +346,7 @@ int main(int argc, char * argv[]) {
                     printf("    %zu values (%.2f MB)\n", size, 4*size/(float)(1024*1024));
                     auto quantize_fn = [&](void) -> float {
                         float result;
-                        qfns.vec_dot(size, &result, test_q1, test_q2);
+                        qfns.vec_dot(size, &result, 0, test_q1, 0, test_q2, 0, 1);
                         return result;
                     };
                     size_t quantized_size = ggml_row_size(type, size);
author	snadampal <87143774+snadampal@users.noreply.github.com>	2024-02-11 07:22:33 -0600
committer	GitHub <noreply@github.com>	2024-02-11 15:22:33 +0200
commit	a07d0fee1f05c5c1dc49948ae1a3293db017275f (patch)
tree	06614ff1364269493e4853333ced56802abd7284 /tests
parent	e4640d8fdf56f14a6db3d092bcd3d2d315cb5d04 (diff)