ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-06-19 19:51:39 +0300
committer	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-06-22 12:02:52 +0300
commit	e73ae1f6d31074f774741a592382ec62a9de6dbf (patch)
tree	e9fc2d42af4a5894703d715af5d3b1d48edd0e0a /llama.cpp
parent	7f968d51b4eb6f403bb7dbc1a5bbf98491ff293b (diff)

bitnet(scale in a separate tensor): mul -> scale on CUDA

On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.

Diffstat (limited to 'llama.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: