ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-18 11:39:32 +0300
committer	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-18 11:39:32 +0300
commit	744eb9ffa955fa3557cc835995e45448c3c06bcb (patch)
tree	b4e7e894597d6486d866b1814d576236f694d999 /ggml.c
parent	6a132862fd3826d241c0c6f43e5f91450626eeb2 (diff)

iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2

I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!

Diffstat (limited to 'ggml.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: