diff options
author | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-07-17 10:17:05 +0300 |
---|---|---|
committer | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-07-17 10:17:05 +0300 |
commit | 7024ecfeb4c6ac9b5e1c7351b8775ad829214a8b (patch) | |
tree | 21f85d1969ad4718d6958491590ebc2b1609f8b7 /ggml.c | |
parent | febb8bbea024f4b965f80eab273754adc6ee52e8 (diff) |
iq1bn: faster AVX2
Instead of shuffling quant data into a 128-bit register containing
8-bit ints, and then converting to 16 bit, we directly shuffle into
a 256-bit register containing 16 bit ints.
TG-128 @ 2 threads goes from 18.3 to 21.6 t/s.
TG-128 performance now saturates already at 8 threads getting 60.4 t/s.
There is almost no impact on PP-512 (322 -> 323 t/s). I guess,
we amortize dequantization cost pretty well, so we don't gain much
there.
We get close to 100 GB/s single-threaded float32 throuput:
./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn
iq1_bn
vec_dot_q
4096 values (0.02 MB)
min cycles/32 vals : 3.87
avg cycles/32 vals : 4.40
float32 throughput : 98.27 GB/s
quantized throughput : 4.99 GB/s
Diffstat (limited to 'ggml.c')
0 files changed, 0 insertions, 0 deletions