diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2024-08-20 17:15:47 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-08-20 17:15:47 +0300 |
commit | d259a50ca6fd3a0821abe6a16b73c0b19c5b4651 (patch) | |
tree | 4f83bbbbbbd9323192d8c0bceb51de5b0fb620c2 /examples | |
parent | a325745000114a43c1546323f91720db503ed0a9 (diff) |
Fused soft cap and SIMD-ified GeLU (#9)
* Softcap: WIP
Fuses scale + tanh + scale as used for softcaping in some
models.
Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG.
Somewhat surprisingly the improvement does not increase as I
go to longer contexts. Gemma2 does softcap on K*Q, which grows
quadratically with context length, so I would have thought
the benefit from fusing scale, tanh, scale would increase.
But no, no luck.
* softcap: CUDA
* softcap: CUDA
~1% speedup for Gemma2-9b
* softcap: Metal and NEON
About 1% speedup.
* Simdified gelu
Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2.
It looks like the gelu operation is memory bound on my CPU's
after SIMD-ifying it. By not using the 128 kb gelu lookup table
we gain a small advantage.
On the M2-Max the lookup table is slightly faster than the SIMD
version, so left the lookup table for ARM_NEON.
* softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512)
Not that I have encountered this in practice, but just to be sure.
This does it for AVX512 and AVX2, still need a guard for ARM_NEON.
* llama-bench: add ability to turn off warmup runs
So we don't need to wait forever on, e.g., benchmarks involving
long contexts.
* softcap, tanh: avoid NaNs for large arguments (NEON)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
-rw-r--r-- | examples/llama-bench/llama-bench.cpp | 24 |
1 files changed, 18 insertions, 6 deletions
diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp index 42918bfc..813d7bae 100644 --- a/examples/llama-bench/llama-bench.cpp +++ b/examples/llama-bench/llama-bench.cpp @@ -237,6 +237,7 @@ struct cmd_params { ggml_numa_strategy numa; int reps; bool verbose; + bool warmup; output_formats output_format; output_formats output_format_stderr; }; @@ -263,6 +264,7 @@ static const cmd_params cmd_params_defaults = { /* numa */ GGML_NUMA_STRATEGY_DISABLED, /* reps */ 5, /* verbose */ false, + /* warmup */ true, /* output_format */ MARKDOWN, /* output_format_stderr */ NONE, }; @@ -295,6 +297,7 @@ static void print_usage(int /* argc */, char ** argv) { printf(" -o, --output <csv|json|md|sql> (default: %s)\n", output_format_str(cmd_params_defaults.output_format)); printf(" -oe, --output-err <csv|json|md|sql> (default: %s)\n", output_format_str(cmd_params_defaults.output_format_stderr)); printf(" -v, --verbose (default: %s)\n", cmd_params_defaults.verbose ? "1" : "0"); + printf(" -w, --warmup <0|1> (default: %s)\n", cmd_params_defaults.warmup ? "1" : "0"); printf("\n"); printf("Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.\n"); } @@ -338,6 +341,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) { params.output_format_stderr = cmd_params_defaults.output_format_stderr; params.reps = cmd_params_defaults.reps; params.numa = cmd_params_defaults.numa; + params.warmup = cmd_params_defaults.warmup; for (int i = 1; i < argc; i++) { arg = argv[i]; @@ -555,6 +559,12 @@ static cmd_params parse_cmd_params(int argc, char ** argv) { invalid_param = !output_format_from_str(argv[i], params.output_format_stderr); } else if (arg == "-v" || arg == "--verbose") { params.verbose = true; + } else if (arg == "-w" || arg == "--warmup") { + if (++i >= argc) { + invalid_param = true; + break; + } + params.warmup = std::stoi(argv[i]); } else { invalid_param = true; break; @@ -1429,12 +1439,14 @@ int main(int argc, char ** argv) { llama_kv_cache_clear(ctx); // warmup run - if (t.n_prompt > 0) { - //test_prompt(ctx, std::min(t.n_batch, std::min(t.n_prompt, 32)), 0, t.n_batch, t.n_threads); - test_prompt(ctx, t.n_prompt, 0, t.n_batch, t.n_threads); - } - if (t.n_gen > 0) { - test_gen(ctx, 1, 0, t.n_threads); + if (params.warmup) { + if (t.n_prompt > 0) { + //test_prompt(ctx, std::min(t.n_batch, std::min(t.n_prompt, 32)), 0, t.n_batch, t.n_threads); + test_prompt(ctx, t.n_prompt, 0, t.n_batch, t.n_threads); + } + if (t.n_gen > 0) { + test_gen(ctx, 1, 0, t.n_threads); + } } for (int i = 0; i < params.reps; i++) { |