diff options
author | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-06-08 13:47:02 +0300 |
---|---|---|
committer | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-06-22 12:02:50 +0300 |
commit | 8a80a31ddd5f3239ab1da6deff1efcdf4f43d1d9 (patch) | |
tree | 78ccaf60d6f17dbead0658ac1057006920d5c324 /examples/train-text-from-scratch/train-text-from-scratch.cpp | |
parent | 81409a02f3c10a74ea23167f1782a951d026ab49 (diff) |
iqk_mul_mat: fix q8_0
I was happily using _mm256_packs_epi32() to pack the
q8_0 x q8_0 dot products back to int16_t, and getting useful
results. But theoretically this can overflow, so it is
better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine
the 4 dot products using int32_t additions. This is (almost)
as fast, unlike _mm256_hadd_epi32(), which seems excessively
slow on the Ryzen-7950X.
Diffstat (limited to 'examples/train-text-from-scratch/train-text-from-scratch.cpp')
0 files changed, 0 insertions, 0 deletions