Skip to content

cpu: aarch64: add JIT ASIMD implementations for exp-based eltwise#4834

Open
Anndrey24 wants to merge 2 commits intouxlfoundation:mainfrom
Anndrey24:asimd-eltwise
Open

cpu: aarch64: add JIT ASIMD implementations for exp-based eltwise#4834
Anndrey24 wants to merge 2 commits intouxlfoundation:mainfrom
Anndrey24:asimd-eltwise

Conversation

@Anndrey24
Copy link
Contributor

@Anndrey24 Anndrey24 commented Mar 16, 2026

Description

This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding gelu_erf), following the existing JIT SVE algorithms:

  • tanh
  • logistic
  • elu
  • mish
  • swish
  • gelu_tanh
  • soft_relu

Certain optimisations (e.g. removing unnecessary ldr/mov instructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.

Performance improvements (f32)

eltwise_speedups

The table above summarises the average benchdnn speedup for each op, aggregated over the following shapes:

  • 1536x384
  • 1539x387
  • 1024x4096
  • 1025x4099
  • 4096x4096
  • 4099x4099

For each shape, measurements were collected using 1, 4, 16, and 64 threads, the reported value being the average across all tested configurations.

The comparison baseline depends on the target platform:

  • c6g
    • tanh, logistic, and elu are compared against acl
    • mish, swish, gelu_tanh, and soft_relu are compared against ref
  • c7g / c8g
    • all ops are compared against the previous jit_sve implementation

@Anndrey24 Anndrey24 requested a review from a team as a code owner March 16, 2026 10:08
@github-actions github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 16, 2026
Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely excellent, thank you. I wasn't expecting SVE speedups too, but that's great!

What were the single threaded comparisons between ACL and the new asimd impls? ACL threading typically has a bit of overhead, and eltwise is particularly hard to scale to 64 threads, so I'm worried that the high thread count/small problems may be tipping the scales when you take the average. I don't think we need to be faster than ACL single threaded to get this in, but it would tell us that there are further optimizations (which could come later).

I've left some comments, but I'll probably have another look when I get another chance. It's a very dense PR, with lots of great stuff!

@Anndrey24
Copy link
Contributor Author

What were the single threaded comparisons between ACL and the new asimd impls? ACL threading typically has a bit of overhead, and eltwise is particularly hard to scale to 64 threads, so I'm worried that the high thread count/small problems may be tipping the scales when you take the average. I don't think we need to be faster than ACL single threaded to get this in, but it would tell us that there are further optimizations (which could come later).

@jondea As you said, JIT:ASIMD vs ACL c6g speedups tend to increase with higher thread counts, though the JIT implementation also seems to match the latter's single-threaded performance too. (tanh looks like a bit of an outlier, though)
asimd_vs_acl

This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding `gelu_erf`), following the existing JIT SVE algorithms:
 - `tanh`
 - `logistic`
 - `elu`
 - `mish`
 - `swish`
 - `gelu_tanh`
 - `soft_relu`

Certain optimisations (e.g. removing unnecessary `ldr`/`mov` instructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.
This is a non-functional commit which removes redundant `ZRegS(IDX(...))` wrappers in the eltwise injector.
TRegS vmm_dst = vmm_aux1, vmm_src_shift = vmm_aux1, vmm_coeff = vmm_aux1,
vmm_pol = vmm_aux2, vmm_indices = vmm_aux3, vmm_tmp = vmm_aux3,
vmm_src_pos = vmm_aux4, vmm_sign = vmm_aux4;
TRegS vmm_dst = vmm_aux0, vmm_src_shift = vmm_aux0, vmm_coeff = vmm_aux0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have these on separate lines for readability.

@vpirogov
Copy link
Contributor

@jondea, @Sqvid, just a reminder that merging PRs requires write permissions. So @Anndrey24 would need help when it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants