diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-06-18 15:30:56 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-06-18 15:30:56 +0300 |
commit | c410cc72bbfcbdef9ce552b425ab7abbeb250dff (patch) | |
tree | a89b0a94dd7cdf99aef9ee3d0f1abbd48d7a3c3e /ggml/src/iqk/iqk_quantize.cpp | |
parent | dc96820ddb45c639ea4e149e4bbfcb0b67fbcc2b (diff) |
Much faster CPU prompt processing (part 3) (#534)
* Repack q4_0 and q8_0 to q8_0_R8
q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.
* Change q8_2_x4 to store in16_t sums
With that q4_0 now works.
I need to check all quants that use q8_2_x4!
* q5_0 and use a dequntizing template
* q6_0
129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.
* iq4_nl
137 t/s -> 293 t/s. iq4_nl is at 251 t/s.
* q4_1: 135 t/s -> 262 t/s
* q5_1: 125 t/s -> 253 t/s
* iq3_xs
178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.
* q2_K
202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/iqk/iqk_quantize.cpp')
-rw-r--r-- | ggml/src/iqk/iqk_quantize.cpp | 12 |
1 files changed, 6 insertions, 6 deletions
diff --git a/ggml/src/iqk/iqk_quantize.cpp b/ggml/src/iqk/iqk_quantize.cpp index 9261d02e..abd4be61 100644 --- a/ggml/src/iqk/iqk_quantize.cpp +++ b/ggml/src/iqk/iqk_quantize.cpp @@ -875,14 +875,12 @@ void quantize_row_q8_1_x4_T(const float * x, Block * y, int64_t k) { y[i].d = GGML_FP32_TO_FP16(d); } } else { + auto t = GGML_FP32_TO_BF16(d); + d = ggml_bf16_to_fp32(t); if (i < nb4) { - auto t = GGML_FP32_TO_BF16(d); y4[i4].d[ir] = t.bits; - d = ggml_bf16_to_fp32(t); } else { - auto t = GGML_FP32_TO_BF16(d); y[i].d = t.bits; - d = ggml_bf16_to_fp32(t); } } const float id = d > 0 ? 1/d : 0.f; @@ -916,9 +914,11 @@ void quantize_row_q8_1_x4_T(const float * x, Block * y, int64_t k) { } } else { if (i < nb4) { - y4[i4].d[ir+4] = GGML_FP32_TO_BF16(d * isum).bits; + auto i16 = (int16_t *)y4[i4].d; + i16[ir+4] = isum; } else { - y[i].s = GGML_FP32_TO_BF16(d * isum).bits; + auto i16 = (int16_t *)&y[i].s; + i16[0] = isum; } } |