Age | Commit message (Collapse) | Author |
|
There are couple things in this architecture:
1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.
More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
|
|
* [SYCL] conext add name
* name should start with SYCL*
|
|
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* iq4_nl: Fix after merging with master
* iq4_nl: another fix after merging with master
* Use IQ4_NL instead of Q4_K when using k-quants is not possible
* Fix typo that makes several tests fail
* It was the ggml_vdotq thing missed inside the brackets
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* server: init working 1.6
* move clip_image to header
* remove commented code
* remove c++ style from header
* remove todo
* expose llava_image_embed_make_with_clip_img
* fix zig build
|
|
|
|
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.
The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* server: use llama_chat_apply_template
* server: remove trailing space
* server: fix format_chat
* server: fix help message
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: fix formatted_chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Add maid to ui list
* Specify licence
|
|
* add build support for embedded metal library
* Update Makefile
---------
Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Update ggml_sycl_op_mul_mat_vec_q
* Apply suggestions from code review
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
* revert suggestion on macro
* fix bug
* Add quant type GGML_TYPE_IQ1_S to unsupported
* fix format
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
Author: Philip Taron <philip.taron@gmail.com>
Date: Tue Feb 13 20:28:02 2024 +0000
|
|
|
|
up ggml_vk_instance_init()
|
|
|
|
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
|
|
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
|
|
Closes #5304
|
|
* cuda : ignore peer access already enabled errors
* fix hip
|
|
|
|
* support minLength and maxLength in JSON schema grammar converter
* Update examples/json-schema-to-grammar.py
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
ggml-ci
|
|
|
|
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
|
|
* cmake : pass -Werror through -Xcompiler
ggml-ci
* make, cmake : enable CUDA errors on warnings
ggml-ci
|
|
|
|
|
|
* rm unwanted sycl compile options
* fix bug
* fix bug
* format fix
|
|
|
|
This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.
I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* Fixed the baby-llama issue (see issue #4830)
* minor : fix whitespaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama: add llama_chat_apply_template
* test-chat-template: remove dedundant vector
* chat_template: do not use std::string for buffer
* add clarification for llama_chat_apply_template
* llama_chat_apply_template: add zephyr template
* llama_chat_apply_template: correct docs
* llama_chat_apply_template: use term "chat" everywhere
* llama_chat_apply_template: change variable name to "tmpl"
|
|
* cuda : fix nans in soft_max
* metal : fix nans in soft_max
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Added 1.5-bit on README.md
|
|
* #ifdef out some code NUMA blocks for Android due to lack of support
* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
|
|
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
|
|
|
|
ggml-ci
|
|
|
|
* Feature - surface min_keep as its own parameter
* Updated README with min_keep param
|
|
|
|
|
|
|
|
* server: enrich health endpoint with available slots, return 503 if not slots are available
* server: document new status no slot available in the README.md
|
|
* server: document --n-predict
* server: ensure client request cannot override n_predict if set
* server: fix print usage LF in new --n-predict option
|
|
This updates the server queue to support graceful shutdown of the server on signals.
|
|
|
|
|