summaryrefslogtreecommitdiff
path: root/llama.cpp
diff options
context:
space:
mode:
authorIwan Kawrakow <iwan.kawrakow@gmail.com>2024-06-19 19:51:39 +0300
committerIwan Kawrakow <iwan.kawrakow@gmail.com>2024-06-22 12:02:52 +0300
commite73ae1f6d31074f774741a592382ec62a9de6dbf (patch)
treee9fc2d42af4a5894703d715af5d3b1d48edd0e0a /llama.cpp
parent7f968d51b4eb6f403bb7dbc1a5bbf98491ff293b (diff)
bitnet(scale in a separate tensor): mul -> scale on CUDA
On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.
Diffstat (limited to 'llama.cpp')
0 files changed, 0 insertions, 0 deletions