summaryrefslogtreecommitdiff
path: root/examples
diff options
context:
space:
mode:
authorKawrakow <48489457+ikawrakow@users.noreply.github.com>2024-09-05 07:46:47 +0300
committerGitHub <noreply@github.com>2024-09-05 07:46:47 +0300
commit7b1b2b2c06c1729139135c9e47611af7161de6f7 (patch)
treeab79924dbb9f2ff780dd669fa65f826aae74d0b7 /examples
parentf17d0d72f565bf24d6eb8aa67d6618cdc143961d (diff)
Zen4 Flash Attention - bf16 support (#38)
* Zen4 Flash Attnetion: WIP bf16 * Zen4 Flash Attnetion: bf16 seems to be working * Zen4 Flash Attnetion: improving bf16 * Zen4 Flash Attnetion: improving bf16 It is better (slightly faster) to first convert Q to bf16 before processing each block of q_step rows. This requires D*q_step*sizeof(bf16) bytes, so at most 4 kb for the head sizes we support, so we can just allocate on the stack instead of reserving and passing a work buffer in ggml. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
-rw-r--r--examples/llama-bench/llama-bench.cpp3
1 files changed, 3 insertions, 0 deletions
diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 813d7bae..fc77be50 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -306,6 +306,9 @@ static ggml_type ggml_type_from_name(const std::string & s) {
if (s == "f16") {
return GGML_TYPE_F16;
}
+ if (s == "bf16") {
+ return GGML_TYPE_BF16;
+ }
if (s == "q8_0") {
return GGML_TYPE_Q8_0;
}