Fused soft cap and SIMD-ified GeLU (#9)

* Softcap: WIP Fuses scale + tanh + scale as used for softcaping in some models. Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG. Somewhat surprisingly the improvement does not increase as I go to longer contexts. Gemma2 does softcap on K*Q, which grows quadratically with context length, so I would have thought the benefit from fusing scale, tanh, scale would increase. But no, no luck. * softcap: CUDA * softcap: CUDA ~1% speedup for Gemma2-9b * softcap: Metal and NEON About 1% speedup. * Simdified gelu Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2. It looks like the gelu operation is memory bound on my CPU's after SIMD-ifying it. By not using the 128 kb gelu lookup table we gain a small advantage. On the M2-Max the lookup table is slightly faster than the SIMD version, so left the lookup table for ARM_NEON. * softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512) Not that I have encountered this in practice, but just to be sure. This does it for AVX512 and AVX2, still need a guard for ARM_NEON. * llama-bench: add ability to turn off warmup runs So we don't need to wait forever on, e.g., benchmarks involving long contexts. * softcap, tanh: avoid NaNs for large arguments (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-08-20 17:15:47 +0300
committer: GitHub <noreply@github.com> 2024-08-20 17:15:47 +0300
commit: d259a50ca6fd3a0821abe6a16b73c0b19c5b4651 (patch)
tree: 4f83bbbbbbd9323192d8c0bceb51de5b0fb620c2 /examples
parent: a325745000114a43c1546323f91720db503ed0a9 (diff)
1 files changed, 18 insertions, 6 deletions
diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 42918bfc..813d7bae 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -237,6 +237,7 @@ struct cmd_params {
     ggml_numa_strategy numa;
     int reps;
     bool verbose;
+    bool warmup;
     output_formats output_format;
     output_formats output_format_stderr;
 };
@@ -263,6 +264,7 @@ static const cmd_params cmd_params_defaults = {
     /* numa                 */ GGML_NUMA_STRATEGY_DISABLED,
     /* reps                 */ 5,
     /* verbose              */ false,
+    /* warmup               */ true,
     /* output_format        */ MARKDOWN,
     /* output_format_stderr */ NONE,
 };
@@ -295,6 +297,7 @@ static void print_usage(int /* argc */, char ** argv) {
     printf("  -o, --output <csv|json|md|sql>      (default: %s)\n", output_format_str(cmd_params_defaults.output_format));
     printf("  -oe, --output-err <csv|json|md|sql> (default: %s)\n", output_format_str(cmd_params_defaults.output_format_stderr));
     printf("  -v, --verbose                       (default: %s)\n", cmd_params_defaults.verbose ? "1" : "0");
+    printf("  -w, --warmup <0|1>                  (default: %s)\n", cmd_params_defaults.warmup ? "1" : "0");
     printf("\n");
     printf("Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.\n");
 }
@@ -338,6 +341,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
     params.output_format_stderr = cmd_params_defaults.output_format_stderr;
     params.reps = cmd_params_defaults.reps;
     params.numa = cmd_params_defaults.numa;
+    params.warmup = cmd_params_defaults.warmup;
 
     for (int i = 1; i < argc; i++) {
         arg = argv[i];
@@ -555,6 +559,12 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
             invalid_param = !output_format_from_str(argv[i], params.output_format_stderr);
         } else if (arg == "-v" || arg == "--verbose") {
             params.verbose = true;
+        } else if (arg == "-w" || arg == "--warmup") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.warmup = std::stoi(argv[i]);
         } else {
             invalid_param = true;
             break;
@@ -1429,12 +1439,14 @@ int main(int argc, char ** argv) {
         llama_kv_cache_clear(ctx);
 
         // warmup run
-        if (t.n_prompt > 0) {
-            //test_prompt(ctx, std::min(t.n_batch, std::min(t.n_prompt, 32)), 0, t.n_batch, t.n_threads);
-            test_prompt(ctx, t.n_prompt, 0, t.n_batch, t.n_threads);
-        }
-        if (t.n_gen > 0) {
-            test_gen(ctx, 1, 0, t.n_threads);
+        if (params.warmup) {
+            if (t.n_prompt > 0) {
+                //test_prompt(ctx, std::min(t.n_batch, std::min(t.n_prompt, 32)), 0, t.n_batch, t.n_threads);
+                test_prompt(ctx, t.n_prompt, 0, t.n_batch, t.n_threads);
+            }
+            if (t.n_gen > 0) {
+                test_gen(ctx, 1, 0, t.n_threads);
+            }
         }
 
         for (int i = 0; i < params.reps; i++) {
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-08-20 17:15:47 +0300
committer	GitHub <noreply@github.com>	2024-08-20 17:15:47 +0300
commit	d259a50ca6fd3a0821abe6a16b73c0b19c5b4651 (patch)
tree	4f83bbbbbbd9323192d8c0bceb51de5b0fb620c2 /examples
parent	a325745000114a43c1546323f91720db503ed0a9 (diff)