ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-18 14:00:56 +0300
committer	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-18 14:00:56 +0300
commit	8db01c0804b603cb76bbee82ebb1a144c8d3592e (patch)
tree	c668a7fbf539881c1f2508829973f914f1f8f5a1 /iqk_mul_mat.cpp
parent	744eb9ffa955fa3557cc835995e45448c3c06bcb (diff)

iqk_mul_mat: attentions matrix multiplications

K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications. Before this PR, this meant n_head calls to iqk_mul_mat to perform n_kv_embed x n_token 2D multiplications, each using nth threads. Instead, in this PR, if n_head is a multiple of nth, each thread does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices. This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from 409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B, we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from 139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.

Diffstat (limited to 'iqk_mul_mat.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: