quantize : add '--keep-split' to quantize model into shards (#6688)

* Implement '--keep-split' to quantize model into several shards * Add test script * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Split model correctly even if tensor id is out-of-order * Update llama_model_quantize_params * Fix preci failures --------- Co-authored-by: z5269887 <z5269887@unsw.edu.au> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
author: jiez <373447296@qq.com> 2024-04-25 18:29:35 +0800
committer: GitHub <noreply@github.com> 2024-04-25 13:29:35 +0300
commit: 1966eb2615242f224bf9ca939db8905ab6a174a0 (patch)
tree: 3da33a1b5f816723e195a4936d44c4bef2eaa06a /llama.h
parent: 784e11dea1f5ce9638851b2b0dddb107e2a609c8 (diff)
1 files changed, 1 insertions, 0 deletions
diff --git a/llama.h b/llama.h
index 0eb2a1e9..8aa76367 100644
--- a/llama.h
+++ b/llama.h
@@ -288,6 +288,7 @@ extern "C" {
         bool quantize_output_tensor;         // quantize output.weight
         bool only_copy;                      // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored
         bool pure;                           // quantize all tensors to the default type
+        bool keep_split;                     // quantize to the same number of shards
         void * imatrix;                      // pointer to importance matrix data
         void * kv_overrides;                 // pointer to vector containing overrides
     } llama_model_quantize_params;
author	jiez <373447296@qq.com>	2024-04-25 18:29:35 +0800
committer	GitHub <noreply@github.com>	2024-04-25 13:29:35 +0300
commit	1966eb2615242f224bf9ca939db8905ab6a174a0 (patch)
tree	3da33a1b5f816723e195a4936d44c4bef2eaa06a /llama.h
parent	784e11dea1f5ce9638851b2b0dddb107e2a609c8 (diff)