summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-06-11IQ2_XXS: much faster CPU prompt processing (#515)Kawrakow
* Much faster iq2_xxs GEMM PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main. * iq2_xxs: q8_2_x4 GEMM * iq2_xxs: use template for q8_2_x4 GEMM * Fix AVX2 * Cleanup * NEON is not working yet, so still use Q8_K GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-10Fix Compile error (C2668) (#508)Gaolingx
* cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt * Fix Compile error (C2668) * revert hsum_float_8x8
2025-06-09Docs update (#509)saood06
* use npm as deps manager and vite as bundler * update XTC docs --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-06-08Fix non rpc build error (#506)firecoperana
* Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui * fix rpc function not found error --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08Revert "Rpc improvement (#480)"Iwan Kawrakow
This reverts commit 8a5f8573aefc23282200041abbfa12886083334a.
2025-06-08Rpc improvement (#480)firecoperana
* Add RPC backend in device list to override tensors. * rpc : prevent crashes on invalid input (#9040) Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : print error message when failed to connect endpoint (#9042) * Fix RPC error * Add vulkan, sycl to rpc backend * add thread in rpc cpu backend * add cache folder and other improvement in rpc * add header file * support for models with non-512 aligned tensors * rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp * fix(rpc): Improve input validation and error handling (#13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> # Conflicts: # ggml/src/ggml-rpc.cpp * rpc : fix cache directory initialization (#13188) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> # Conflicts: # examples/rpc/rpc-server.cpp * rpc : avoid uninitialized memory in serialize_tensor (#13210) Zero out the name and padding buffers. * fix merge error * Add hello command in RPC * bug fix * add rpc header * fix bug for missing rpc names * add tpc no delay for rpc * add back webui --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> Signed-off-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: matt23456 <matt23456> Co-authored-by: Ville Vesilehto <ville@vesilehto.fi> Co-authored-by: xiaofei <hbuxiaofei@gmail.com> Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08Update AUTHORSKawrakow
2025-06-08Webui improvement (#481)firecoperana
* update webui * add token/s in webui * add webui files * fix webui first message disappear in some browser * add missing html files --------- Co-authored-by: firecoperana <firecoperana>
2025-06-07Add an endpoint that lists all the saved prompt caches to server (#502)saood06
2025-06-07Fix #499 (#501)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-06Make prompt cache saving and restoring MLA aware (#497)saood06
* Remove kv_l, kvt_l and just use k_l and v_l * Hopefully take care of missing V cache (MLA) * Fix save and restore when there is no V cache * Fix double print * Update write_kv_cache_data and read_kv_cache_data to be MLA aware --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05IQ1_M_R4 CUDA implementation (#494)Kawrakow
* iq1_m_r4: CUDA dequantize * iq1_m_r4: CUDA dequantize --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)Kawrakow
* MMQ for iq4_ks_r4 * MMQ for iq5_ks_r4 * Add forgotten file * Another forgotten file --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05Faster CPU prompt processing for Trellis quants and MoE models (#488)Kawrakow
* Also do the dequantize approach for mul_mat_id * Also do the dequantize approach for iqk_moe_fused_up_gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05CUDA implementation for IQ1_S_R4 (#492)Kawrakow
* iq1_s_r4: CUDA dequantize * iq1_s_r4: CUDA GEMV * iq1_s_r4: MMQ on CUDA Requires Turing or better (will fall back to dequantize+cuBLAS on older cards). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03Adding top-n-sigma sampler (#489)Kawrakow
* Adding top-n-sigma sampler * Fix typos in XTC PR * Update README.md for main and server * More README * More README --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03Adding the XTC sampler (#486)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483)Nexes the Elder
* Direct conversion from fp16 to Q6_0 * forgotten comma * More precise infos
2025-06-01Minor (~2%) iq2_ks TG performance improvement on CUDA (#468)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01Trellis quants: faster CPU prompt processing (#482)Kawrakow
* Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-01Metal implementatio for the trellis quants. (#475)Kawrakow
* iq2_kt: Metal dequantize * iq2_kt: Metal GEMV Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B * iq3_kt: Metal dequantize * iq3_kt: Metal GEMV Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair. * iq4_kt: Metal dequantize - getting NaNs * iq4_kt: Metal GEMV - also not working * iq4_kt: Metal still not working * Disable iq4_kt on Metal for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-31forgotten refs and typo (#478)Nexes the Elder
2025-05-30Replace MLA-specific KV cache with the standard KV cache (#469)Kawrakow
* Remove kv_l, kvt_l and just use k_l and v_l * Hopefully take care of missing V cache (MLA) * Replace MLA-specific KV cache with the standard KV cache V2 (#473) * Fix save and restore when there is no V cache * Fix double print --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: saood06 <saood05@gmail.com>
2025-05-29NEON implementation for trellis quants (#471)Kawrakow
* iq2_kt: NEON implementation * iq3_kt: NEON implementation * iq4_kt: not working NEON implementation * iq4_kt: NEON implementation Have to use f32 arithmetic else I get gibberish? Correspondigly ridiculously slow. * Cleanup * iq4_kt: slightly faster TG on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-28set cache_prompt default to true (#465)saood06
2025-05-27CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462)Kawrakow
* CUDA: iq4_ks_r4 GEMV and GEMM * CUDA: iq5_ks_r4 GEMV and GEMM --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-26CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461)Kawrakow
* CUDA: iq4_k_r4 dequantize * CUDA: iq4_k_r4 GEMV ~10% slower than iq4_k. * CUDA: slightly faster iq4_k_r4 GEMV * CUDA: slightly faster iq4_k_r4 GEMV We are now within 3% of iq4_k * CUDA: iq5_k_r4 dequantize * CUDA: iq5_k_r4 GEMV ~3% slower than iq5_k. * CUDA: iq3_k_r4 dequantize * CUDA: iq3_k_r4 GEMV * CUDA: slightly faster iq3_k_r4 GEMV * CUDA: iq2_k_r4 GEMV * CUDA: faster iq2_k_r4 GEMV --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-25Add missing gguf-py constants (#458)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-24Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)Nexes the Elder
* Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention
2025-05-24Faster IQ3_KT and IQ4_KT (#453)Kawrakow
* Somewhat faster iq3_kt (AVX2) * Cleanup * Slightly faster iq4_kt * Slightly faster iq4_kt PP is now almost 50% better than original, TG is ~20% better * Cleanup * Very slightly faster iq4_kt TG --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23Fix bug in MMVQ kernel (#446)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23Fix MSVC compilation (#448)Kawrakow
* Fix MSVC compilation * MSVC cannot capture constexpr in lambdas * Arghhh --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23Fix typo in non-AVX2 code branch (#445)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23Trellis quants with CPU inference (#441)Andrew Chan
* WIP * WIP * WIP * Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable. * Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw. * Testing Trellis quantization: playing with scales and generators * iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B). * iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs. * WIP * WIP * WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster. * iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower. * iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time). * iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU. * iq2_kt: very slightly faster CUDA dot product * iq2_kt: f16 CUDA dot product We arrive at 112 t/s. * iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance. * iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16. * Minor * Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot. * Forgotten change * WIP * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants. * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892 * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v. * iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX. * iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds! * iq3_kt: CUDA dot product * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * Adding iq4_kt - not competitive at this point * WIP * WIP * iq4_kt: CUDA dot product * iq4_kt: minor tweaks * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster. * iq3_kt: small improvements and faster quantization * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX. * iq3_kt: small progress * WIP * iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss. * iq4_kt: very slightly better at the expense of much longer quantization time. * iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug. * DRY * DRY * iq4_kt: CUDA dot product works * DRY * Report actual bpw * Minor tweaks * Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later. * WIP for IQ2_KT * WIP - working basic iq2_kt * still super slow (0.17t/s eval) * flatten 3inst iters + avx2 (0.3t/s eval) * iq3_kt (0.3t/s eval) and renames * wip buggy iq4_KT * fix (0.22t/s eval) * naming and remove unused fn * cleanup * more cleanup * delete unused and noncompiling mmvq functions * Some performance tweaks * Slighty faster iq2_kt * port Trellis struct to iq3_kt, iq4_kt * oops untracked files --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23gguf-split : update (#444)Nexes the Elder
gguf-split : improve --split and --merge logic (#9619) * make sure params --split and --merge are not specified at same time * update gguf-split params parse logic * Update examples/gguf-split/gguf-split.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> --------- gguf-split : add basic checks (#9499) * gguf-split : do not overwrite existing files when merging * gguf-split : error when too many arguments are passed Authored-by: slaren <slarengh@gmail.com>
2025-05-22Streamline a bit the quant strategies (#443)Nexes the Elder
* Streamline a bit the quant strategies No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA). The rest is applying the existing patterns to the new IQ_K quants. Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v. * remove <=8 experts condition.
2025-05-22Refactor iqk_mul_mat.cpp (#435)Kawrakow
* Refactor iqk: WIP * Refactor iqk: Factor out float GEMM (AVX2/AVX512) * Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) * Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 * Refactor iqk: Factor out GEMM for repacked legacy quants * Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV * Refactor iqk: Factor out GEMM for repacked i-quants * Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 * Refactor iqk: factor out 1-bit quants (NEON) * Refactor iqk: factor out k-quants (NEON) * Refactor iqk: factor out floats (NEON) * Also iq4_xs belongs to k-quants * Refactor iqk: factor out iqk quants (NEON) * Refactor iqk: factor out legacy quants (NEON) * Refactor iqk: factor out repacked legacy quants (NEON) * Refactor iqk: factor out repacked k-quants (NEON) * Refactor iqk: factor out repacked iqk quants (NEON) * Refactor iqk: GEMM kernels are refactored on NEON * Refactor iqk: FA compiles If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X * Refactor iqk: FA refactored (Zen4) Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before. * Adding forgotten file * Most helpers don't need to be templates Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX * Fix bf16 * Refactor iqk: FA refactored (NEON) * Forgotten MMQ ref and typo (#431) * Adding forgotten iq5_k_r4 * Fix iq4_k_r4 on NEON * Fix iq4_ks on NEON It was broken before the refactoring (the shifts were not correctly applied). * Fix q8_0 on NEON * Fix q6_0 K cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
2025-05-20Bug fixes from mainline (#439)Kawrakow
* Add __syncthreads() to the new FA kernel * Clearing padding --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-18Forgotten MMQ ref and typo (#431)Nexes the Elder
2025-05-17Option to enable disable the IQK CPU FA kernels (#429)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)Kawrakow
* Zen4: faster PP for iq4_ks and iq5_ks * Zen4: faster PP for iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17IQ5_KS_R4: row-interleaved IQ5_KS (#426)Kawrakow
* iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)Kawrakow
* Fix IQ4_K on AVX2 * Fix IQ4_KS on AVX2 * Fix IQ5_K on AVX2 * Fix IQ6_K on AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15Adding forgotten template instance for iq5_ks (#424)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15Adding IQ5_KS - 5.25 bpw quants (#422)Kawrakow
* iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15Fix standard attention on the CPU (#421)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)Kawrakow
* MMQ for iq2_k * This works * MMQ for iq3_k * MMQ for iq2_ks * Fix iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)Kawrakow
* MMQ for iq4_k: WIP (not working) * MMQ for iq4_k: working now * MMQ for iq5_k * Cleanup * MMQ for iq5_k: slightly faster * MMQ for iq6_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14Fix SER (CUDA) (#416)Kawrakow
* Fixing SER bugs * Cleanup * This seems to fix it. * This seems to work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13Fix SER (CPU) (#415)Kawrakow
* Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>