Age | Commit message (Collapse) | Author |
|
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.
The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* server: use llama_chat_apply_template
* server: remove trailing space
* server: fix format_chat
* server: fix help message
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: fix formatted_chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Add maid to ui list
* Specify licence
|
|
* add build support for embedded metal library
* Update Makefile
---------
Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Update ggml_sycl_op_mul_mat_vec_q
* Apply suggestions from code review
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
* revert suggestion on macro
* fix bug
* Add quant type GGML_TYPE_IQ1_S to unsupported
* fix format
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
Author: Philip Taron <philip.taron@gmail.com>
Date: Tue Feb 13 20:28:02 2024 +0000
|
|
|
|
up ggml_vk_instance_init()
|
|
|
|
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
|
|
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
|
|
Closes #5304
|
|
* cuda : ignore peer access already enabled errors
* fix hip
|
|
|
|
* support minLength and maxLength in JSON schema grammar converter
* Update examples/json-schema-to-grammar.py
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
ggml-ci
|
|
|
|
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
|
|
* cmake : pass -Werror through -Xcompiler
ggml-ci
* make, cmake : enable CUDA errors on warnings
ggml-ci
|
|
|
|
|
|
* rm unwanted sycl compile options
* fix bug
* fix bug
* format fix
|
|
|
|
This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.
I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* Fixed the baby-llama issue (see issue #4830)
* minor : fix whitespaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama: add llama_chat_apply_template
* test-chat-template: remove dedundant vector
* chat_template: do not use std::string for buffer
* add clarification for llama_chat_apply_template
* llama_chat_apply_template: add zephyr template
* llama_chat_apply_template: correct docs
* llama_chat_apply_template: use term "chat" everywhere
* llama_chat_apply_template: change variable name to "tmpl"
|
|
* cuda : fix nans in soft_max
* metal : fix nans in soft_max
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Added 1.5-bit on README.md
|
|
* #ifdef out some code NUMA blocks for Android due to lack of support
* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
|
|
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
|
|
|
|
ggml-ci
|
|
|
|
* Feature - surface min_keep as its own parameter
* Updated README with min_keep param
|
|
|
|
|
|
|
|
* server: enrich health endpoint with available slots, return 503 if not slots are available
* server: document new status no slot available in the README.md
|
|
* server: document --n-predict
* server: ensure client request cannot override n_predict if set
* server: fix print usage LF in new --n-predict option
|
|
This updates the server queue to support graceful shutdown of the server on signals.
|
|
|
|
|
|
This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.
The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.
I did not perform this change for BakLLaVA models as I am not sure
how that part works.
|
|
* iq1_s: WIP basics
* iq1_s: CUDA is working
* iq1_s: scalar CPU dot product
* iq1_s: WIP AVX2 dot product - something is not right
* Fix tests
* Fix shadow warnings
* Fix after merge with latest master
* iq1_s: AVX2 finally works
* iq1_s: ARM_NEON dot product. Works, but not very fast
* iq1_s: better grid
* iq1_s: use IQ2_XXS for attn_output
At a cost of 0.04 extra bpw this gives a big improvement in PPL.
* iq1_s: Metal basics
Dequantize works, but not dot product
* iq1_s: Metal works, but quite slow
As usual, Apple Silicon does not like the code I write.
* iq1_s: Tests
* iq1_s: slightly faster dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
→ 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
|
|
* ggml : avoid recomputing alibi slopes (CPU)
* llama : reuse hparams.f_max_alibi_bias in all cases
ggml-ci
* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)
ggml-ci
* ggml : handle all SRCs (do not break on first null)
ggml-ci
* tests : do not use slope for large soft_max
accumulates too much error
ggml-ci
* ggml : alternative ALiBi without extra tensor
We compute the slopes in the kernel
ggml-ci
* cuda : add ALiBi support in ggml_soft_max_ext
ggml-ci
* ggml : deprecate ggml_alibi
* ggml : support multi-sequence ALiBi (Metal)
ggml-ci
* cuda : add multi-seq ALiBi + remote F16 soft_max
ggml-ci
* ggml : update deprecation message
* ggml : fix pos ptr when no ALiBi
ggml-ci
* cuda : fix performance (pow -> powf)
* cuda : precompute ALiBi constants
* metal : pre-compute ALiBi slopes
ggml-ci
* llama : init kq_pos only if needed
ggml-ci
* test-backend-ops : add null pos test to soft_max
test-backend-ops : replace soft_max tests
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* feat(ci): add an option to fail on compile warning
* Update CMakeLists.txt
* minor : fix compile warnings
ggml-ci
* ggml : fix unreachable code warnings
ggml-ci
* ci : disable fatal warnings for windows, ios and tvos
* ggml : fix strncpy warning
* ci : disable fatal warnings for MPI build
* ci : add fatal warnings to ggml-ci
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|