diff options
author | Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> | 2024-03-05 03:24:00 +0900 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-03-04 20:24:00 +0200 |
commit | 6d341ab6c53cd51f2921d986d0090cc8b049b39a (patch) | |
tree | f212b497e210c8c73fe52369f6bc81297c7b1dab /common/common.h | |
parent | 4ffcdce2ff877ebb683cd217ea38faf20faa5ffe (diff) |
speculative : implement stochastic speculative sampling (#5625)
* (WIP) Implement stochastic speculative decoding
* sample from residual distribution on draft accept failure
* fix #5657: force greedy sampling with probs when temp is 0
* remove p_accept parameter
* fix style
* remove unused variables
* add srand() in speculative.cpp
* replace use of rand() with mt19937 sampling
* fixes based on review (@JohannesGaessler)
* fix r random generation
* randomly select next sequence to verify + fix bug in memory freeing
* fix bug in active_seqs sync
* fix uniform int distribution initialization
* remove warnings from comparison between int and size_t
* check grammar in `llama_sample_probability_distribution_impl`
* remove malloc code by utilizing vectors
* add PR link to README
Diffstat (limited to 'common/common.h')
-rw-r--r-- | common/common.h | 3 |
1 files changed, 1 insertions, 2 deletions
diff --git a/common/common.h b/common/common.h index b2868833..977ce419 100644 --- a/common/common.h +++ b/common/common.h @@ -53,11 +53,10 @@ struct gpt_params { int32_t n_ctx = 512; // context size int32_t n_batch = 512; // batch size for prompt processing (must be >=32 to use BLAS) int32_t n_keep = 0; // number of tokens to keep from initial prompt - int32_t n_draft = 8; // number of tokens to draft during speculative decoding + int32_t n_draft = 5; // number of tokens to draft during speculative decoding int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited) int32_t n_parallel = 1; // number of parallel sequences to decode int32_t n_sequences = 1; // number of sequences to decode - float p_accept = 0.5f; // speculative decoding accept probability float p_split = 0.1f; // speculative decoding split probability int32_t n_gpu_layers = -1; // number of layers to store in VRAM (-1 - use default) int32_t n_gpu_layers_draft = -1; // number of layers to store in VRAM for the draft model (-1 - use default) |