Age | Commit message (Collapse) | Author |
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q6_K dequantizing GEMM
* Much easier: just use different vec_dot types!
* WIP
* Finally q6_K x q8_2_x4 dot product works
* Very slightly better
* We don't need the changes in ggml.c
* Fix AVX2
* iq2_xs
* Fix AVX2
* iq2_s
* q3_K
* Fix q8_k_r8 on Zen4
* q3_K: repack to q8_k_r8 instead of q8_0_r8
With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.
* q3_K: don't scale when all quants in a block are <= 127 when repacking
* iq2_s: repack to q8_k_r8 instead of q8_0_r8
* iq2_xs: rapck to q8_k_r8
* WIP
* iq2_xs: repack to q8_k_r8
* iq3_xxs: repack to q8_k_r8
* iq3_s: use q8_k_r8
* iq1_s: repack to q8_k_r8
* iq1_m: repack to q8_k_r8
* iq1_m: slightly faster
* Slightly faster
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* q4_K: dequantize to q8_1_r8 for batch >= 32
We get 268 t/s, up from 186 t/s.
* q4_K: GEMM with q8_2_X4
* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8
* Remove the scales, they are not needed
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Convert existing News to new format
* Update with new ones
* Add more links and minor fix
* more minor fixes
* requested changes
* Add old PRs
* Add more old PRs
* Add all IQK quants
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: firecoperana <firecoperana>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
TG is slightly faster too - 24.4 vs 23.1 t/s on the
Ryzen-5975WX
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Much faster iq2_xxs GEMM
PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.
* iq2_xxs: q8_2_x4 GEMM
* iq2_xxs: use template for q8_2_x4 GEMM
* Fix AVX2
* Cleanup
* NEON is not working yet, so still use Q8_K GEMM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* cmake: force MSVC compiler charset to utf-8
* build: apply MSVC /bigobj option to c/cpp files only
* Update CMakeLists.txt
* Fix Compile error (C2668)
* revert hsum_float_8x8
|
|
* use npm as deps manager and vite as bundler
* update XTC docs
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
|
|
* Add RPC backend in device list to override tensors.
* rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : print error message when failed to connect endpoint (#9042)
* Fix RPC error
* Add vulkan, sycl to rpc backend
* add thread in rpc cpu backend
* add cache folder and other improvement in rpc
* add header file
* support for models with non-512 aligned tensors
* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
# Conflicts:
# ggml/src/ggml-rpc.cpp
* fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
# examples/rpc/rpc-server.cpp
* rpc : avoid uninitialized memory in serialize_tensor (#13210)
Zero out the name and padding buffers.
* fix merge error
* Add hello command in RPC
* bug fix
* add rpc header
* fix bug for missing rpc names
* add tpc no delay for rpc
* add back webui
* fix rpc function not found error
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
|
|
This reverts commit 8a5f8573aefc23282200041abbfa12886083334a.
|
|
* Add RPC backend in device list to override tensors.
* rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : print error message when failed to connect endpoint (#9042)
* Fix RPC error
* Add vulkan, sycl to rpc backend
* add thread in rpc cpu backend
* add cache folder and other improvement in rpc
* add header file
* support for models with non-512 aligned tensors
* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
# Conflicts:
# ggml/src/ggml-rpc.cpp
* fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
# examples/rpc/rpc-server.cpp
* rpc : avoid uninitialized memory in serialize_tensor (#13210)
Zero out the name and padding buffers.
* fix merge error
* Add hello command in RPC
* bug fix
* add rpc header
* fix bug for missing rpc names
* add tpc no delay for rpc
* add back webui
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
|
|
|
|
* update webui
* add token/s in webui
* add webui files
* fix webui first message disappear in some browser
* add missing html files
---------
Co-authored-by: firecoperana <firecoperana>
|
|
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Remove kv_l, kvt_l and just use k_l and v_l
* Hopefully take care of missing V cache (MLA)
* Fix save and restore when there is no V cache
* Fix double print
* Update write_kv_cache_data and read_kv_cache_data to be MLA aware
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_m_r4: CUDA dequantize
* iq1_m_r4: CUDA dequantize
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* MMQ for iq4_ks_r4
* MMQ for iq5_ks_r4
* Add forgotten file
* Another forgotten file
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Also do the dequantize approach for mul_mat_id
* Also do the dequantize approach for iqk_moe_fused_up_gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq1_s_r4: CUDA dequantize
* iq1_s_r4: CUDA GEMV
* iq1_s_r4: MMQ on CUDA
Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding top-n-sigma sampler
* Fix typos in XTC PR
* Update README.md for main and server
* More README
* More README
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Direct conversion from fp16 to Q6_0
* forgotten comma
* More precise infos
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* iq2_kt: Metal dequantize
* iq2_kt: Metal GEMV
Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B
* iq3_kt: Metal dequantize
* iq3_kt: Metal GEMV
Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.
* iq4_kt: Metal dequantize - getting NaNs
* iq4_kt: Metal GEMV - also not working
* iq4_kt: Metal still not working
* Disable iq4_kt on Metal for now
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Remove kv_l, kvt_l and just use k_l and v_l
* Hopefully take care of missing V cache (MLA)
* Replace MLA-specific KV cache with the standard KV cache V2 (#473)
* Fix save and restore when there is no V cache
* Fix double print
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: saood06 <saood05@gmail.com>
|
|
* iq2_kt: NEON implementation
* iq3_kt: NEON implementation
* iq4_kt: not working NEON implementation
* iq4_kt: NEON implementation
Have to use f32 arithmetic else I get gibberish?
Correspondigly ridiculously slow.
* Cleanup
* iq4_kt: slightly faster TG on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* CUDA: iq4_ks_r4 GEMV and GEMM
* CUDA: iq5_ks_r4 GEMV and GEMM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* CUDA: iq4_k_r4 dequantize
* CUDA: iq4_k_r4 GEMV
~10% slower than iq4_k.
* CUDA: slightly faster iq4_k_r4 GEMV
* CUDA: slightly faster iq4_k_r4 GEMV
We are now within 3% of iq4_k
* CUDA: iq5_k_r4 dequantize
* CUDA: iq5_k_r4 GEMV
~3% slower than iq5_k.
* CUDA: iq3_k_r4 dequantize
* CUDA: iq3_k_r4 GEMV
* CUDA: slightly faster iq3_k_r4 GEMV
* CUDA: iq2_k_r4 GEMV
* CUDA: faster iq2_k_r4 GEMV
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Legacy quants conversion schemes in convert_hf_to_gguf.py
This, notably in order to make smaller conversions to generate an iMatrix file.
`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.
Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022
Original author @chentyjpm
Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
* forgotten IQ5_KS case mention
|
|
* Somewhat faster iq3_kt (AVX2)
* Cleanup
* Slightly faster iq4_kt
* Slightly faster iq4_kt
PP is now almost 50% better than original, TG is ~20% better
* Cleanup
* Very slightly faster iq4_kt TG
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fix MSVC compilation
* MSVC cannot capture constexpr in lambdas
* Arghhh
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* WIP
* WIP
* WIP
* Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
* Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
* Testing Trellis quantization: playing with scales and generators
* iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
* iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
* WIP
* WIP
* WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
* iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
* iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
* iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
* iq2_kt: very slightly faster CUDA dot product
* iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
* iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).
My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
* iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
* Minor
* Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
* Forgotten change
* WIP
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
* iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
* iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
* iq3_kt: CUDA dot product
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B, 4096) = 6.4179
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* Adding iq4_kt - not competitive at this point
* WIP
* WIP
* iq4_kt: CUDA dot product
* iq4_kt: minor tweaks
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B, 4096) = 6.3913
Ah, quantization is faster too. About 20% faster.
* iq3_kt: small improvements and faster quantization
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B, 4096) = 6.3825
Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
* iq3_kt: small progress
* WIP
* iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
* iq4_kt: very slightly better
at the expense of much longer quantization time.
* iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
* DRY
* DRY
* iq4_kt: CUDA dot product works
* DRY
* Report actual bpw
* Minor tweaks
* Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).
I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
* WIP for IQ2_KT
* WIP - working basic iq2_kt
* still super slow (0.17t/s eval)
* flatten 3inst iters + avx2 (0.3t/s eval)
* iq3_kt (0.3t/s eval) and renames
* wip buggy iq4_KT
* fix (0.22t/s eval)
* naming and remove unused fn
* cleanup
* more cleanup
* delete unused and noncompiling mmvq functions
* Some performance tweaks
* Slighty faster iq2_kt
* port Trellis struct to iq3_kt, iq4_kt
* oops untracked files
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time
* update gguf-split params parse logic
* Update examples/gguf-split/gguf-split.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
---------
gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging
* gguf-split : error when too many arguments are passed
Authored-by: slaren <slarengh@gmail.com>
|
|
* Streamline a bit the quant strategies
No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA).
The rest is applying the existing patterns to the new IQ_K quants.
Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v.
* remove <=8 experts condition.
|
|
* Refactor iqk: WIP
* Refactor iqk: Factor out float GEMM (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
* Refactor iqk: Factor out GEMM for repacked legacy quants
* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
* Refactor iqk: Factor out GEMM for repacked i-quants
* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
* Refactor iqk: factor out 1-bit quants (NEON)
* Refactor iqk: factor out k-quants (NEON)
* Refactor iqk: factor out floats (NEON)
* Also iq4_xs belongs to k-quants
* Refactor iqk: factor out iqk quants (NEON)
* Refactor iqk: factor out legacy quants (NEON)
* Refactor iqk: factor out repacked legacy quants (NEON)
* Refactor iqk: factor out repacked k-quants (NEON)
* Refactor iqk: factor out repacked iqk quants (NEON)
* Refactor iqk: GEMM kernels are refactored on NEON
* Refactor iqk: FA compiles
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
* Refactor iqk: FA refactored (Zen4)
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
* Adding forgotten file
* Most helpers don't need to be templates
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
* Fix bf16
* Refactor iqk: FA refactored (NEON)
* Forgotten MMQ ref and typo (#431)
* Adding forgotten iq5_k_r4
* Fix iq4_k_r4 on NEON
* Fix iq4_ks on NEON
It was broken before the refactoring (the shifts were not correctly
applied).
* Fix q8_0 on NEON
* Fix q6_0 K cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
|
|
* Add __syncthreads() to the new FA kernel
* Clearing padding
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|