cpu: aarch64: add JIT ASIMD implementations for exp-based eltwise by Anndrey24 · Pull Request #4834 · uxlfoundation/oneDNN

Anndrey24 · 2026-03-16T10:08:55Z

Description

This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding gelu_erf), following the existing JIT SVE algorithms:

tanh
logistic
elu
mish
swish
gelu_tanh
soft_relu

Certain optimisations (e.g. removing unnecessary ldr/mov instructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.

Performance improvements (f32)

The table above summarises the average benchdnn speedup for each op, aggregated over the following shapes:

1536x384
1539x387
1024x4096
1025x4099
4096x4096
4099x4099

For each shape, measurements were collected using 1, 4, 16, and 64 threads, the reported value being the average across all tested configurations.

The comparison baseline depends on the target platform:

c6g
- tanh, logistic, and elu are compared against acl
- mish, swish, gelu_tanh, and soft_relu are compared against ref
c7g / c8g
- all ops are compared against the previous jit_sve implementation

src/cpu/aarch64/injectors/jit_uni_eltwise_injector.cpp

jondea

This is absolutely excellent, thank you. I wasn't expecting SVE speedups too, but that's great!

What were the single threaded comparisons between ACL and the new asimd impls? ACL threading typically has a bit of overhead, and eltwise is particularly hard to scale to 64 threads, so I'm worried that the high thread count/small problems may be tipping the scales when you take the average. I don't think we need to be faster than ACL single threaded to get this in, but it would tell us that there are further optimizations (which could come later).

I've left some comments, but I'll probably have another look when I get another chance. It's a very dense PR, with lots of great stuff!

Anndrey24 · 2026-03-17T17:04:48Z

What were the single threaded comparisons between ACL and the new asimd impls? ACL threading typically has a bit of overhead, and eltwise is particularly hard to scale to 64 threads, so I'm worried that the high thread count/small problems may be tipping the scales when you take the average. I don't think we need to be faster than ACL single threaded to get this in, but it would tell us that there are further optimizations (which could come later).

@jondea As you said, JIT:ASIMD vs ACL c6g speedups tend to increase with higher thread counts, though the JIT implementation also seems to match the latter's single-threaded performance too. (tanh looks like a bit of an outlier, though)

This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding `gelu_erf`), following the existing JIT SVE algorithms: - `tanh` - `logistic` - `elu` - `mish` - `swish` - `gelu_tanh` - `soft_relu` Certain optimisations (e.g. removing unnecessary `ldr`/`mov` instructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.

This is a non-functional commit which removes redundant `ZRegS(IDX(...))` wrappers in the eltwise injector.

Sqvid · 2026-03-20T16:08:22Z

src/cpu/aarch64/injectors/jit_uni_eltwise_injector.cpp

-    TRegS vmm_dst = vmm_aux1, vmm_src_shift = vmm_aux1, vmm_coeff = vmm_aux1,
-          vmm_pol = vmm_aux2, vmm_indices = vmm_aux3, vmm_tmp = vmm_aux3,
-          vmm_src_pos = vmm_aux4, vmm_sign = vmm_aux4;
+    TRegS vmm_dst = vmm_aux0, vmm_src_shift = vmm_aux0, vmm_coeff = vmm_aux0,


nit: would be nice to have these on separate lines for readability.

vpirogov · 2026-03-20T21:18:34Z

@jondea, @Sqvid, just a reminder that merging PRs requires write permissions. So @Anndrey24 would need help when it's ready.

Anndrey24 requested a review from a team as a code owner March 16, 2026 10:08

github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 16, 2026