diff options
author | Georgi Gerganov <ggerganov@gmail.com> | 2023-10-22 22:53:08 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-10-22 22:53:08 +0300 |
commit | 438c2ca83045a00ef244093d27e9ed41a8cb4ea9 (patch) | |
tree | 28e31cb62c99afe935a8bce3fb45b46e6442e891 /examples/server/chat.mjs | |
parent | 9e70cc03229df19ca2d28ce23cc817198f897278 (diff) |
server : parallel decoding and multimodal (#3677)
* implementing parallel decoding in server example
* crash fixed
* save dev progress
* refactored sampling function
* completion endpoint working
* multiple client support
* grammar + no stream completion
* cached prompt support
* chat.mjs support cached prompt + some fixes
* server ui now support multiple clients
* unused change reverted
* fixed timings per slot
* add context swap
* add changes to README.md
* llava multimodal integration
* fixed tokens probs
* add multimodal input - alfa
* refactor code + remove unused comments + improved README.md
* fix compilation errors with llvm
* notify the user from server ui that multimodality is unavialable
* some ci fixes
* fix ci make build undefined ref errors
* fix long prompt than ctx proposed in #3639
* fixed premature end due stop word
* context shift fixed
* fix llava implementation
* sync README.md changes
* readme change
* update api like OpenAI
* multimodal support enabled by default
* fix make bui;d errors
* fix multiple clients
* fix zig build
* new sampling API
* latest changes of sampling API
* server : coding-style normalization
* server : coding-style normalization (part 2)
* server : remove beam-search functionality
* server : bug fix in ingest_images
n_tokens is incremented internally by llama_batch_add
* server : use refs + use llama_batch_clear()
* server : snake case
* server : minor sync
* added thread safe pipeline
* server : bach has to be allocated for n_parallel sequences
* server : no need for atomic int - already using mutex
* server : logs + minor code style
* server : fix multibyte handle in partial response (#3706)
* fix image load + view image in chat
* make : silence stb warnings
* clip : link to ggml, not to llama
* server : fix switch fallthrough
* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)
* server : refactor ctx_sampling init + n_ctx + names
* server : bug fix for prompt caching
* Do not save/load image_data to localStorage
* editorconfig : new line in index.html
* server : completion requests remember slot_id
* Update readme to document multimodal in server
* server : minor style
* Update readme to document multimodal in server
* server : hide ctx_sampling->prev behind API (#3696)
* server : apply fix from #3722
* server : fix slot reuse
* server : add comment about changing slot_state to bool
---------
Co-authored-by: FSSRepo <go778sgt@gmail.com>
Co-authored-by: Damian Stewart <d@damianstewart.com>
Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Diffstat (limited to 'examples/server/chat.mjs')
-rw-r--r-- | examples/server/chat.mjs | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/examples/server/chat.mjs b/examples/server/chat.mjs index 87f4d292..219ebb51 100644 --- a/examples/server/chat.mjs +++ b/examples/server/chat.mjs @@ -7,6 +7,11 @@ const args = process.argv.slice(2); const grammarJsonSchemaFile = args.find( (_, index) => args[index - 1] === "--grammar-json-schema" ); + +const no_cached_prompt = args.find( + (_, index) => args[index - 1] === "--no-cache-prompt" +) ?? "false"; + const grammarFile = args.find((_, index) => args[index - 1] === "--grammar"); // Example usage: function,arguments @@ -30,6 +35,9 @@ if (grammarFile) { grammar = readFileSync(grammarFile, 'utf-8') } +// for cached prompt +let slot_id = -1; + const API_URL = 'http://127.0.0.1:8080' const chat = [ @@ -76,6 +84,8 @@ async function chat_completion(question) { top_p: 0.9, n_keep: n_keep, n_predict: 256, + cache_prompt: no_cached_prompt === "false", + slot_id: slot_id, stop: ["\n### Human:"], // stop completion after generating this grammar, stream: true, @@ -92,6 +102,7 @@ async function chat_completion(question) { const t = Buffer.from(chunk).toString('utf8') if (t.startsWith('data: ')) { const message = JSON.parse(t.substring(6)) + slot_id = message.slot_id answer += message.content process.stdout.write(message.content) if (message.stop) { |