llama : add support for GritLM (#5959)

* add gritlm example * gritlm results match * tabs to spaces * comment out debug printing * rebase to new embed * gritlm embeddings are back babeee * add to gitignore * allow to toggle embedding mode * Clean-up GritLM sample code. * Fix types. * Flush stdout and output ending newline if streaming. * mostly style fixes; correct KQ_mask comment * add causal_attn flag to llama_cparams * gritml : minor * llama : minor --------- Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
author: DAN™ <dranger003@gmail.com> 2024-03-10 11:56:30 -0400
committer: GitHub <noreply@github.com> 2024-03-10 17:56:30 +0200
commit: bcebd7dbf62fd7b293d5ed089023e4e733269c71 (patch)
tree: da8a1c4a76dfa9044a2bda8d1c58caaedd34bf4d /llama.h
parent: 2960eae847f8dbde23be6d170a61bcf44ebf32de (diff)
1 files changed, 4 insertions, 0 deletions
diff --git a/llama.h b/llama.h
index 7a107c7f..c8e05aad 100644
--- a/llama.h
+++ b/llama.h
@@ -643,6 +643,10 @@ extern "C" {
     // n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)
     LLAMA_API void llama_set_n_threads(struct llama_context * ctx, uint32_t n_threads, uint32_t n_threads_batch);
 
+    // Set whether to use causal attention or not
+    // If set to true, the model will only attend to the past tokens
+    LLAMA_API void llama_set_causal_attn(struct llama_context * ctx, bool causal_attn);
+
     // Set abort callback
     LLAMA_API void llama_set_abort_callback(struct llama_context * ctx, ggml_abort_callback abort_callback, void * abort_callback_data);
author	DAN™ <dranger003@gmail.com>	2024-03-10 11:56:30 -0400
committer	GitHub <noreply@github.com>	2024-03-10 17:56:30 +0200
commit	bcebd7dbf62fd7b293d5ed089023e4e733269c71 (patch)
tree	da8a1c4a76dfa9044a2bda8d1c58caaedd34bf4d /llama.h
parent	2960eae847f8dbde23be6d170a61bcf44ebf32de (diff)