summaryrefslogtreecommitdiff
path: root/ggml/include
AgeCommit message (Collapse)Author
2025-06-08Fix non rpc build error (#506)firecoperana
* Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui * fix rpc function not found error --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08Revert "Rpc improvement (#480)"Iwan Kawrakow
This reverts commit 8a5f8573aefc23282200041abbfa12886083334a.
2025-06-08Rpc improvement (#480)firecoperana
* Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-05-23Trellis quants with CPU inference (#441)Andrew Chan
* WIP * WIP * WIP * Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable. * Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw. * Testing Trellis quantization: playing with scales and generators * iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B). * iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs. * WIP * WIP * WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster. * iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower. * iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time). * iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU. * iq2_kt: very slightly faster CUDA dot product * iq2_kt: f16 CUDA dot product We arrive at 112 t/s. * iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance. * iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16. * Minor * Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot. * Forgotten change * WIP * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants. * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892 * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v. * iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX. * iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds! * iq3_kt: CUDA dot product * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * Adding iq4_kt - not competitive at this point * WIP * WIP * iq4_kt: CUDA dot product * iq4_kt: minor tweaks * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster. * iq3_kt: small improvements and faster quantization * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX. * iq3_kt: small progress * WIP * iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss. * iq4_kt: very slightly better at the expense of much longer quantization time. * iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug. * DRY * DRY * iq4_kt: CUDA dot product works * DRY * Report actual bpw * Minor tweaks * Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later. * WIP for IQ2_KT * WIP - working basic iq2_kt * still super slow (0.17t/s eval) * flatten 3inst iters + avx2 (0.3t/s eval) * iq3_kt (0.3t/s eval) and renames * wip buggy iq4_KT * fix (0.22t/s eval) * naming and remove unused fn * cleanup * more cleanup * delete unused and noncompiling mmvq functions * Some performance tweaks * Slighty faster iq2_kt * port Trellis struct to iq3_kt, iq4_kt * oops untracked files --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-20Bug fixes from mainline (#439)Kawrakow
* Add __syncthreads() to the new FA kernel * Clearing padding --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17IQ5_KS_R4: row-interleaved IQ5_KS (#426)Kawrakow
* iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15Adding IQ5_KS - 5.25 bpw quants (#422)Kawrakow
* iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12GPU offload policy (#405)Kawrakow
* Adding GPU offload policy * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07Add copyright notices (#317)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27Use bf16 instead of fp16 block scales for q8_1 (#292)Kawrakow
* WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-02SER - Smart Expert Reduction (#239)Kawrakow
* A better way to measure the cost of ggml_barrier * Smart expert selection * Add ser option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-23Fused MoE ffn_up and ffn_gate (#229)Kawrakow
* Fusing MoE up * unary(gate) * Fusing MoE up * unary(gate): CUDA We get ~13% speedup for PP-512 and ~2% for TG-128 for DeepSeek-Lite * On CUDA also fuse MoE down * (up * unary(gate)) in case the MUL_MAT_ID op for the down experts is the next op in the graph. * Command line option to enable fused MoE up*unary(gate) * Add fmoe option to llama-bench * Adding forgotten gelu, relu, silu on ARM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19Q8_KV: 8-bit quantization type targeting the KV cache (#208)Kawrakow
* Adding q8_KV - Basics + AVX2 gemm/gemv * q8_KV: Better AVX2 gemm * q8_KV: Better Zen4 gemm We get 225.7 t/s for L3-8B. In comparison q8_0 without run-tinme-repacking is at 169 t/s. * q8_KV: AVX2 gemm/gemv We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr. * q8_KV: be able to use it for K cache This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_size*type_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left * llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache. * q8_KV: be able to use it for K cache in FA * q8_KV: repack it for K*Q in FA * q8_KV: slightly faster gemv on Zen4 * q8_KV: slightly faster gemv on Zen4 * q8_KV: ARM_NEON We get PP-512 = 167 t/s for L3-8B without interleaving! We do the interleaving on the fly, so I wonder if this could be done for other quants as well. * q8_KV: use it in FA on NEON * q8_KV_r8 - repacked q8_KV On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s) This makes no sense whatsoever as the q8_KV_r8 GEMM is basically the q8_k_r8 GEMM with the unnecessary block stuff removed (so, one would think that it would be faster). * q8_KV_r8: don't use nrc_y = 16 on Zen4 This is faster - 350 t/s. Why? Much better than the 290 t/s we had before, but still slower than the 370 t/s for q8_k_r8. * q8_KV: nrc_y = 16 also doesn't pay off in FA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications (#194)Kawrakow
* iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_m_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (AVX2/Zen4) * iq1_s_r4: Use Q8_K_128 instead of Q8_1_X4 for gemm (Neon) * iq1_m_r4: Use Q8_K_128 instead of Q8_0_X4 for gemm (Neon) * Simdify q8_K128 quantization also on Neon * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-08Revert #79 (#192)Kawrakow
* Revert "Do not quantize activations if not necessary (#79)" This reverts commit 0bf4d99774aa3b6d00ef564acbc4dc211e45db33. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)Kawrakow
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving * Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving * Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06IQ1_M_R4: better 1.75 bpw quants (#187)Kawrakow
* iq1_m_r4: basics (quantize/dequantize) * iq1_m_r4: Zen4 gemm * iq1_m_r4: neon gemm * iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4 With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4. * iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05IQ1_S_R4: better 1.5 bpw quants (#185)Kawrakow
* iq1_s_r4: basics - quantize/dequantize * iq1_s_r4: gemm/gemv works on AVX2/Zen4 * Don't forget to make sure we have a multiple of 4 rows per thread * iq1_s_r4: this is better * iq1_s_r4: fix Zen4 after AVX2 changes * iq1_s_r4: NEON gemm/gemv * iq1_s_r4: more bits for shared experts With this mix we arrive at PPL(512) = 9.4140 for Deepseek-Lite using 1.766 bpw for the repeating layers. On the Ryzen-7950X we get PP-512 = 494 t/s and TG-128 = 52 t/s @ 16 threads. * Forgotten counter increment * iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv * Compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-15CPU Flash Attention improvements (#172)Kawrakow
* Slightly faster FA for bf16 KV cache ~2-3% sort of thing. Sadly, when we go beyond 8k tokens, the advantage kind of goes away. * Slightly faster FA for Q8_0 KV cache * FA: allow bf16 for V-cache with any supported K-cache E.g., -ctk q8_0 -ctv bf16 is slightly faster than -ctk q8_0 -ctv q8_0 on Zen4 for not too long context lengths (say, <= 4096). * FA: much better bf16 kv-cache speed for large contexts We now hit 122 t/s for LLaMA-3.1-8B (quantized as iq4_xs and run-time-repacked) with a context of 32768. IIRC, the previous best for such large context was ~90 t/s. Non-negligible improvement at 16384 and 8192 as well: 173.4 and 214 t/s. * FA: slightly better quantized kv-cache speed for large contexts E.g., for q8_0 and context of 32768, we are now at 113 t/s for LLaMA-3.1-8B. Also simplified the quantized K*Q multiplication. * Fix q8_0 KV cache when not using FA - WIP (AVX2) 1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use those to quantize activations for quants that use Q8_0 or Q8_1 as their vec_dot type. 2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1 3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type 4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than GGML_TYPE_Q8_0_X4 as the K and V types 5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4 in iqk_mul_mat Also added an optimization in ggml_compute_forward_mul_mat when ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using nthread/GCD(n12*ne13, nthread) threads per head. This results in a non-negligible performance gain for large contexts. Question: why is it not allowed to use quantized V-cache when not using FA? * Fix q8_0 KV cache when not using FA - NEON * Fix AVX2 Again the issue with _mm256_maddubs_epi16 overflowing that I keep forgetting. * FA: don't use large Q steps on AVX2 for fp16 K-cache * On Zen4 it is also better to not use large Q steps for fp16 K-cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-01-10Be able to re-quantize MS BitNet I2_S models (#169)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23IQ3_S_R4 (#162)Kawrakow
* iq3_s_r4: WIP * iq3_s_r4: Zen4 * iq3_s_r4: slightly better Zen4 * iq3_s_r4: AVX2 * iq3_s_r4: NEON * iq3_s_r4: rearrange quants * iq3_s_r4: rearranged quants - AVX2 * iq3_s_r4: rearranged quants - NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21IQ2_S_R4 (#156)Kawrakow
* iq2_s_r4: Zen4 * Minor * iq2_s_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21IQ2_XS_R4 (#155)Kawrakow
* iq2_xs_r4: Zen4 * iq2_xs_r4: AVX2 * iq2_xs_r4: slightly better matrix x vector on AVX2 * iq2_xs_r4: NEON - not much better than iq2_xs * iq2_xs_r4: slightly better NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20IQ2_XXS_R4 (#154)Kawrakow
* iq2_xxs_r4: Zen4 Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512 TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread * Minor * iq2_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20IQ3_XXS_R4 (#153)Kawrakow
* iq3_xxs_r4: 1st shot on Zen4 PP-512: 107 t/s -> 137 t/s TG-128(1 thread): 2.64 t/s -> 3.44 t/s * iq4_xxs_r4: WIP * iq4_xxs_r4: 1st shot at AVX2 Note: there is a bug in the AVX2 implementation for nrc_y = 1 for IQ quants with blocks of 32. I have fixed it for now by using the nrc_y > 1 implementation (which works) also for nrc_y = 1. * iq3_xxs_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18IQ4_KS_R4 (#150)Kawrakow
* iq4_ks_r4: Zen4 * iq4_ks_r4: AVX2 * iq4_ks_r4: WIP * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: slightly better Zen4 * iq4_ks_r4: NEON * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18IQ5_K_R4 (#149)Kawrakow
* iq5_k_r4: Zen4 Much slower than the others. * iq5_k_r5: WIP * Minor * iq5_k_r4: fix AVX2 nrc_y = 1 case * iq5_k_r4: better Zen4 But TG is still slower than iq5_k * iq5_k_r4: slightly better AVX2 * iq5_k_r4: NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17IQ2_K_R4 (#146)Kawrakow
* iq2_k_r4: Zen4 * iq2_k_r4: NEON * iq2_k_r4: better matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17IQ3_K_R4 (#145)Kawrakow
* iq3_k_r4 WIP * iq3_k_r4: Zen4 * iq3_k_r4: AVX2 * iq3_k_r4: NEON * iq3_k_r4: faster matrix x vector multiplication on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-15BF16_R16 - 16 interleaved bf16 rows (#142)Kawrakow
* Not working bf16_r4 * Adding bf16_r8 Small performance gain compared to bf16 - 258 t/s vs 234 t/s. I guess, this is still sub-obtimal. * bf16_rx: Very slightly faster by interleaving 16 rows 258 t/s -> 263 t/s * Rename bf16_r4 to bf16_r16 We are interleaving 16 rows now. * Cleanup unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-14Q8_K_R8: Fastest quantized matrix multiplications (#141)Kawrakow
* q8_k_r8: fastest matrix multiplication known to human kind We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X! * q8_k_r8: AVX2 I was worried that we don't have enough vector registrers on AVX2, but it looks like it handles it just fine. We get PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX. Slightly slower than the Zen4 version with double the threads, but still a huge upgrade compared to Q8_0_R4. * q8_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 159.2 t/s. Compare this to the 128 t/s we have fr Q8_0_R4. * q8_k_r4: go to signed ints Why? * On AVX2 _mm256_maddubs_epi16() may overflow, so we need to stay within the signed int range and use _mm256_sign_epi8. Not yet tested on the AVX2 comp, vut expect major slowdown. * It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8() needed tto convert from unsigned to signed seems to be extremely slow on the M2-Max * We only lose ~0.5% in oerformance on Zen4 (there the exclusive or that we now use to convert fro signed to unsigned seems to be much faster than on M2-Max) * Shutup useless compiler warnings --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-12IQ4_K_R4 (#138)Kawrakow
* iq4_k_r4: WIP * iq4_k_r4: Zen4 and hopefully AVX2 On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s for iq4_k. Applying the extra shift costs a ~6 performance penalty. * iq4_k_r4: AVX2 PP-512 = 227.60 t/s. The shifts are really costly. * iq4_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Q2_K_R4 (#136)Kawrakow
* q2_k_r4: Zen4 PP-512(LLaMA-3.1-8B) = 256 t/s * q3_k_r4: AVX2 * q2_k_r4: AVX2 We get PP-512(LLaMA-3.1-8B) = 287 t/s. Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow forgot to push upstream. * q2_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 106.2 t/s. TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S. * Make sure rows per thread are a multiple of 4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11Q3_K_R4 (#134)Kawrakow
* q3_k_r4: Zen4 works, but not as good as it should be 238 t/s, so sloghtly slower than q6_k_r4. * q3_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 106.9 t/s. This is 1.93X faster than q3_K_S! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10Q5_K_R4 (#132)Kawrakow
* q5_k_r4: WIP * q5_k_r4: Zen4 and AVX2 We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4. Q5_K_S has PP-512 = 190 t/s. * q5_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 96.1 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10Q6_K_R4 (#130)Kawrakow
* Adding q6_k_r4 * q6_k_r4: 1st functional AVX2 version * q6_k_r4: AVX2 and simple Zen4 "Simple" as in processing 4 instead of 8 rows at once. On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs 195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s vs 5.38 t/s for Q6_K. * q6_k_r4: 1st NEON version PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K. TG-128 is slightly lower rthan q6_K for low number of threads, becomes very slightly better at 8 threads. * q6_k_r4: slightly faster NEON PP-512(LLaMA-3.1-8B) = 83.25 t/s * q6_k_r4: slightly faster Zen4 238.3 t/s -> 243.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-09Q4_K_R4 (#129)Kawrakow
* Something is still wrong * Simply don't see what is wrong * q4_k_r4: finally works on Zen4 I had forgotten to prevent token_embd.weight being quantized with q4_k_r4! * q4_k_r4: AVX2 We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX. This is ~30% better than Q4_K_S. * q4_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 110 t/s. Not quite as good as q4_0_r4, but still a massive improvement compared to he 69 t/s for q4_K. * q4_k_r4: slightly better AVX2 PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX * Minor * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08Faster IQ4_XS_R4 on Zen4 (#128)Kawrakow
* Faster iq4_xs_r4 on Zen4 The trick is to simply prepare the Q8 block sums for blocks of 32 as floats. This brings PP-512 up to 254.6 t/s from 224 t/s. * Fix broken matrix x vector product on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08Rename iq4_nl_x4 to iq4_nl_r4 (#126)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)Kawrakow
* Adding iq2_bn_r4 This Zen4-only implementation achieves PP-512 = 826 t/s (!!!) for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn. * Make sure rows per thread are a multiple of the number of interleaved rows With this I can run iq2_bn_r4 with 32 threads and this increases PP-512 to 872 t/s. * iq2_bn_r4: 1st shot at NEON PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s for Bitnet-1.58b-3B). TG-128 is ~5% slower. * iq2_bn_r4: NEON PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn for 1 thread, but saturates to about the same 93 t/s at 8 threads. * iq2_bn_r4: Experimenting on NEON The matrix x vvector multiplication is erratic. iq2_bn_r4 is faster at 1, 2, and 4 threads, but saturates to a lower t/s at 8 threads compared to iq2_bn. iq2_bn actually manages 99 t/s at 8 threads and not 93 as I wrore in the last commit. iq2_bn_r4 performance has huge fluctuations at 4 and 8 threads. * Some cleanup * iq2_bn_r4: AVX2 As expected, PP is slightly slower as we just don;t have enough vector registers (690 vs 710 t/s). TG is slightly faster (18.2 vs 16.7 t/s at 1 thread). * iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn. * iq2_bn_r4: simdify q8_K16 quantization (AVX2) PP-512 becomes 834 t/s and TG-128 now saturates to the same performance as iq2_bn for 4 threads. * iq2_bn_r4: simdify q8_K16 quantization (NEON) PP-512 is now 304.7 t/s, and TG-128 @ 8 threads very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s) * iq2_bn_r4: fix AVX2 after breaking it two commits ago * iq2_bn_r4: better AVX2 As we don't have enough vector registers on AVX2, it is better to do two passes per row needing only half of the accumulator registers that way. With this, we now beat iq2_bn PP also on AVX2 by a small margin. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04IQ4_XS_R4 (#123)Kawrakow
* Adding iq4_xs_r4 This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4. * iq4_xs_r4: WIP * iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4 * iq4_xs_r4: NEON We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs! * DRY --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03Q5_0_R4 (#121)Kawrakow
* Adding q5_0_r4 We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X. We even get TG-128 improvement to 11.7 t/s from 11.1 t/s. * q5_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max, up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp is no longer funny: they get 26.5 t/s for Q5_0. For TG, we are nor able to fully saturate memory bandwidth and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets 20.6 t/s for Q5_0. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03Q8_0_R4 (#120)Kawrakow
* Adding q8_0_r4 We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared to 175.6 t/s for Q8_0. * q8_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max. * q8_0_r4: Zen4 matrix-vector specialization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02Q4_0_R4 (#119)Kawrakow
* Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02IQ4_NL_X4 (#118)Kawrakow
* Adding iq4_nl_x4 Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s on the Ryzen-7950X! This is faster than any other quant and ~40% faster than iq4_nl. * iq4_nl_x4: getting amazing This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s! * iq4_nl_x4: AVX2 Here we gain only 25% compared to iq4_nl * iq4_nl_x4: NEON On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from 82.4 t/s for iq4_nl. * iq4_nl_x4: minor NEON improvement and cleanup This gets us to 110.3 t/s. In comparison, IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s. * iq4_nl_x4: NEON specialization for matrix x vector --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-31Faster MoE inference (#112)Kawrakow
* multi_sdd: WIP * multi_sdd: CPU works * multi_add: CUDA * multi_add: simplify * multi_add: Metal * Metal: speed up mul_mat_id For the Granite-1B MoE model PP-512 goes from 156 t/s to 890 t/s, so nearly a 6X speedup! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25Remove forgotten IQ1_TN, IQ2_TN enum valuesIwan Kawrakow
2024-10-25Bitnet changes (#106)Kawrakow
* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit f2d315b46f7aacc7df4b86bd8acba387b30e11ca. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-20Avoid rebuild of GGML graph for each token (#98)agray3
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.
2024-10-16Adding IQ4_KSS: 4.0 bpw quants (#89)Kawrakow
* iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>