ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-17 10:17:05 +0300
committer	Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-17 10:17:05 +0300
commit	7024ecfeb4c6ac9b5e1c7351b8775ad829214a8b (patch)
tree	21f85d1969ad4718d6958491590ebc2b1609f8b7 /ggml.c
parent	febb8bbea024f4b965f80eab273754adc6ee52e8 (diff)

iq1bn: faster AVX2

Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s

Diffstat (limited to 'ggml.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: