summaryrefslogtreecommitdiff
path: root/llama.h
diff options
context:
space:
mode:
authorKawrakow <48489457+ikawrakow@users.noreply.github.com>2024-02-18 18:16:55 +0200
committerGitHub <noreply@github.com>2024-02-18 18:16:55 +0200
commitbd2d4e393b2b7d2a1b2e201058e26017c9728ead (patch)
tree5c51109459cf1a25fc92fdb11d420895e16785ac /llama.h
parentc8e0d7efeb7634ecc2e9832e879ab9fca4510e71 (diff)
1.5 bit quantization (#5453)
* iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'llama.h')
-rw-r--r--llama.h1
1 files changed, 1 insertions, 0 deletions
diff --git a/llama.h b/llama.h
index f4ec6ea6..5a97abcc 100644
--- a/llama.h
+++ b/llama.h
@@ -100,6 +100,7 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_Q2_K_S = 21, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_XS = 22, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_XXS = 23, // except 1d tensors
+ LLAMA_FTYPE_MOSTLY_IQ1_S = 24, // except 1d tensors
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
};