ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-07-27 08:44:18 +0200
committer	GitHub <noreply@github.com>	2024-07-27 08:44:18 +0200
commit	f62615b44f7df586cb58ed9fffca59b96820117b (patch)
tree	422a2b063fd1ba3ef9090c701f4980359d7a4a18 /include/llama.h
parent	154e0d75fccf1784fe9ff6fd76a630b66563da3d (diff)

Simdify and multi-thread tanh (#4)

It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for softcaping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation. After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X). Simd-ifying: We have tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1) so we can just use Justine Tunney's SIMD exp implementation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'include/llama.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: