diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2024-09-27 08:16:06 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-09-27 08:16:06 +0300 |
commit | 6dec4af4b6e65eb72e646a6f8b10d77c9d306281 (patch) | |
tree | b69a6dfdd024ccf6a4d7490666664cbac4bc65ce /ggml/src/ggml-common.h | |
parent | 546f3ef349a7082fbc349897c3c7246baed2a6c6 (diff) |
Adding ability to have meta data per tensor row (#61)
* POC: per row scale
This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.
* POC per row scale: iq2_tn on NEON
* POC per row scale: iq2_tn on Metal
* Per row scale Metal templates
* iq1_tn: shrink to 1.625 bpw (NEON and Metal)
* POC per row scale: CUDA
* POC per row scale: add CUDA TODOs
There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.
* Per row scales - CUDA
The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.
* Update IQ1_TN and IQ2_TN bpw shown to user
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'ggml/src/ggml-common.h')
-rw-r--r-- | ggml/src/ggml-common.h | 7 |
1 files changed, 3 insertions, 4 deletions
diff --git a/ggml/src/ggml-common.h b/ggml/src/ggml-common.h index 40a4b53c..bb0c4864 100644 --- a/ggml/src/ggml-common.h +++ b/ggml/src/ggml-common.h @@ -400,14 +400,13 @@ static_assert(sizeof(block_iq2_bn) == QK_IQ2BN/4, "wrong iq2_bn block size/paddi // TriLM - implemented as 2.0625 bpw // typedef struct { - uint8_t qs[54]; + uint8_t qs[52]; } block_iq1_tn; -static_assert(sizeof(block_iq1_tn) == 54, "wrong iq1_tn block size/padding"); +static_assert(sizeof(block_iq1_tn) == 52, "wrong iq1_tn block size/padding"); typedef struct { - ggml_half d; uint8_t qs[QK_K/4]; } block_iq2_tn; -static_assert(sizeof(block_iq2_tn) == sizeof(ggml_half) + QK_K/4, "wrong iqt_bn block size/padding"); +static_assert(sizeof(block_iq2_tn) == QK_K/4, "wrong iqt_bn block size/padding"); // Used by IQ1_M quants typedef union { |