summaryrefslogtreecommitdiff
path: root/ggml.c
diff options
context:
space:
mode:
authorIwan Kawrakow <iwan.kawrakow@gmail.com>2024-07-17 10:17:05 +0300
committerIwan Kawrakow <iwan.kawrakow@gmail.com>2024-07-17 10:17:05 +0300
commit7024ecfeb4c6ac9b5e1c7351b8775ad829214a8b (patch)
tree21f85d1969ad4718d6958491590ebc2b1609f8b7 /ggml.c
parentfebb8bbea024f4b965f80eab273754adc6ee52e8 (diff)
iq1bn: faster AVX2
Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s
Diffstat (limited to 'ggml.c')
0 files changed, 0 insertions, 0 deletions