summaryrefslogtreecommitdiff
path: root/include/llama.h
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-05-12 07:49:51 +0300
committerGitHub <noreply@github.com>2025-05-12 07:49:51 +0300
commitf27cd405422307e02dffa8949ac30bc56b4d2900 (patch)
tree722b742827684815ca2cc0fb6379edd4edd2f3fd /include/llama.h
parent465569dff8b49a195450a0eb1974fd72a32fcebc (diff)
Enable faster prompt processing with mainline llama.cpp GGUFs (#409)
* Enable MLA-3 in crippled GGUFs: WIP * Enable MLA-3 in crippled GGUFs: seems to work * Add newly created tensors to model.tensors_by_name Else they don't get run-time repacked. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'include/llama.h')
-rw-r--r--include/llama.h1
1 files changed, 1 insertions, 0 deletions
diff --git a/include/llama.h b/include/llama.h
index f1511548..0f3ae862 100644
--- a/include/llama.h
+++ b/include/llama.h
@@ -325,6 +325,7 @@ extern "C" {
struct llama_model_params {
int32_t n_gpu_layers; // number of layers to store in VRAM
+ int32_t mla; // MLA implementation to use (only applicable to DeepSeek models at this point)
enum llama_split_mode split_mode; // how to split the model across multiple GPUs
// main_gpu interpretation depends on split_mode: