diff options
author | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-07-24 07:57:47 +0200 |
---|---|---|
committer | Iwan Kawrakow <iwan.kawrakow@gmail.com> | 2024-07-24 08:04:47 +0200 |
commit | 2e49f0172f6c11b286a410039ad87433099bc1b9 (patch) | |
tree | 0c3689efb3f86fa7c660f03bde8b5b94b4527118 | |
parent | abb740c9a4b65dd6b2facc4780a1e9f2f515bd86 (diff) |
ggml: thread syncronization on Arm
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
__asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
-rw-r--r-- | ggml.c | 2 |
1 files changed, 2 insertions, 0 deletions
@@ -19142,6 +19142,8 @@ static void ggml_barrier(struct ggml_compute_state * state) { } #if defined(__SSE3__) _mm_pause(); + #elif defined __ARM_NEON + __asm__ __volatile__("isb\n"); #endif } sched_yield(); |