ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-06-18 15:30:56 +0300
committer	GitHub <noreply@github.com>	2025-06-18 15:30:56 +0300
commit	c410cc72bbfcbdef9ce552b425ab7abbeb250dff (patch)
tree	a89b0a94dd7cdf99aef9ee3d0f1abbd48d7a3c3e /src/llama.cpp
parent	dc96820ddb45c639ea4e149e4bbfcb0b67fbcc2b (diff)

Much faster CPU prompt processing (part 3) (#534)

* Repack q4_0 and q8_0 to q8_0_R8 q8_0 is fine, but I observe a very significant PPL increase for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit scale conversions. * Change q8_2_x4 to store in16_t sums With that q4_0 now works. I need to check all quants that use q8_2_x4! * q5_0 and use a dequntizing template * q6_0 129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s. * iq4_nl 137 t/s -> 293 t/s. iq4_nl is at 251 t/s. * q4_1: 135 t/s -> 262 t/s * q5_1: 125 t/s -> 253 t/s * iq3_xs 178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s. * q2_K 202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'src/llama.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: