Age | Commit message (Collapse) | Author |
|
|
|
* clip: enable CUDA backend
* add missing kernels
* add enough padding for alignment
* remove ggml_repeat of clip.cpp
* add metal backend
* llava : fixes
- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"
* fix warning in example.
|
|
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.
This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.
See Mozilla-Ocho/llamafile@1cd334f
|
|
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.
See mozilla-Ocho/llamafile@711344b
|
|
|
|
|
|
* Fix main-cmake-pkg compilation
* Use glob to load common files
* cmake : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* fix infinite loop
* slight UI simplification, clearer UX
* clearer UI text, add timings to completion log
|
|
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.
The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.
See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
|
|
This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab: 32000
print_params: n_ctx: 128
print_params: n_embd: 4096
print_params: n_ff: 11008
print_params: n_head: 32
print_params: n_head_kv: 32
print_params: n_layer: 32
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab : 32000
print_params: n_ctx : 128
print_params: n_embd : 4096
print_params: n_ff : 11008
print_params: n_head : 32
print_params: n_head_kv : 32
print_params: n_layer : 32
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
|
|
* initial commit, going through initializations
* main loop finished, starting to debug
* BUG: generates gibberish/repeating tokens after a while
* kv_cache management
* Added colors to distinguish drafted tokens (--color). Updated README
* lookup : fix token positions in the draft batch
* lookup : use n_draft from CLI params
* lookup : final touches
---------
Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* ggml : change ggml_scale to take a float instead of tensor
* ggml : fix CPU implementation
* tests : fix test-grad0
ggml-ci
|
|
|
|
|
|
|
|
* llama.swiftui : add bench button
* llama.swiftui : initial bench functionality
* force to use n_gpu_layers on simulator
* add download buttons & expose llamaState.loadModel
* update project.pbxproj
* comment #Preview & fix editorconfig check
* gitignore : xcode stuff
* llama.swiftui : UX improvements
* llama.swiftui : avoid data copy via "downloadTask"
* llama.swiftui : remove model from project
* llama : remove "mostly" from model infos
* llama.swiftui : improve bench
---------
Co-authored-by: jhen <developer@jhen.me>
|
|
|
|
|
|
Fix bug in identifying the grammar.
|
|
|
|
|
|
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
ggml-ci
|
|
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
(#4446)
|
|
* Set a more typical Top P setting as the default
* Update temp max
|
|
|
|
|
|
Fix small typo.
|
|
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312)
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351)
On commit b1108 (44c117f4) xaedes added
ggml_allocr * alloc = NULL;
... (many lines in between)
if (alloc) {
ggml_allocr_free(alloc);
}
Which is correct, but it's easy to lose context after many lines in between.
On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.
alloc = ggml_allocr_new(...)
... (short lines of code)
ggml_allocr_free(alloc)
This happens a few times, but alloc is never set to NULL, and many lines below,
we still have
if (alloc) {
ggml_allocr_free(alloc);
}
which causes a double-free.
|
|
|
|
* speculative: add some colors
* minor : add braces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Samplers sequence order w parameter
* Cleaned commented code
* Fixed formatting
* Rewrote with unordered_map
* Revert and rewrite, too many problems and safeguards would be needed
* Fixed code style
* Code style fixes according to review
* More readable samplers input string, fixed help
* Style fix in sampler_queue
* Formatting fixes
* Fixing whitespaces
|
|
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
|
|
|
|
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84ae3bcbf0d617b7ee6a5413bcbd58af)
|
|
|
|
|
|
* Fix token_to_piece implementation in Swift
* Fix errors
|
|
* metal : implement soft_max_ext
* cuda : implement soft_max_ext
* ggml : implement soft_max_ext (CPU)
* batched-bench : print threads
ggml-ci
* metal : simplify soft_max encoding
ggml-ci
* cuda : use 512 threads for soft_max instead of 32
* ggml : update soft max cpu
* cuda : do warp-based block reduce
* cuda : increase max block size to 1024
* cuda : fix warp reduction initialization of shared mem
* metal : warp-based reduction for soft max kernel
* metal : warp-based reduce for rms_norm
* metal : simplify soft max kernel
ggml-ci
* alloc : fix build with debug
|
|
* * add --log-disable to disable logging to file in the server example
* * typo fix
|
|
* * add multiprompt support
* * cleanup
* * more cleanup
* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests
* * remove all references to mutex_multitasks
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* * change to set
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
* ShareGPT4 compatibility (vision encoder only loading)
Load only a CLIP vision encoder (as supplied by ShareGPT finetunes)
Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access)
Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them
* Update convert-image-encoder-to-gguf.py
|
|
* main : Call llama_log_set to use LOG_TEE
* tabs to spaces
|
|
docs: update how to run
|
|
* fix oai proxy
fix generation not stoped while bot stop talking in chat mode
fix possible `slot_id` not exist
response for cors (and pre flight)
* oai proxy: workaround for some client (such as Chatbox)
* use stop as separator to replace hardcoded `\n`
|