cpu: aarch64: add JIT ASIMD implementations for exp-based eltwise#4834
cpu: aarch64: add JIT ASIMD implementations for exp-based eltwise#4834Anndrey24 wants to merge 2 commits intouxlfoundation:mainfrom
Conversation
jondea
left a comment
There was a problem hiding this comment.
This is absolutely excellent, thank you. I wasn't expecting SVE speedups too, but that's great!
What were the single threaded comparisons between ACL and the new asimd impls? ACL threading typically has a bit of overhead, and eltwise is particularly hard to scale to 64 threads, so I'm worried that the high thread count/small problems may be tipping the scales when you take the average. I don't think we need to be faster than ACL single threaded to get this in, but it would tell us that there are further optimizations (which could come later).
I've left some comments, but I'll probably have another look when I get another chance. It's a very dense PR, with lots of great stuff!
@jondea As you said, JIT:ASIMD vs ACL c6g speedups tend to increase with higher thread counts, though the JIT implementation also seems to match the latter's single-threaded performance too. ( |
This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding `gelu_erf`), following the existing JIT SVE algorithms: - `tanh` - `logistic` - `elu` - `mish` - `swish` - `gelu_tanh` - `soft_relu` Certain optimisations (e.g. removing unnecessary `ldr`/`mov` instructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.
This is a non-functional commit which removes redundant `ZRegS(IDX(...))` wrappers in the eltwise injector.
2981f9f to
950dea9
Compare
| TRegS vmm_dst = vmm_aux1, vmm_src_shift = vmm_aux1, vmm_coeff = vmm_aux1, | ||
| vmm_pol = vmm_aux2, vmm_indices = vmm_aux3, vmm_tmp = vmm_aux3, | ||
| vmm_src_pos = vmm_aux4, vmm_sign = vmm_aux4; | ||
| TRegS vmm_dst = vmm_aux0, vmm_src_shift = vmm_aux0, vmm_coeff = vmm_aux0, |
There was a problem hiding this comment.
nit: would be nice to have these on separate lines for readability.
|
@jondea, @Sqvid, just a reminder that merging PRs requires write permissions. So @Anndrey24 would need help when it's ready. |

Description
This commit introduces JIT ASIMD implementations for 7 out of 8 exp-based eltwise ops (excluding
gelu_erf), following the existing JIT SVE algorithms:tanhlogisticelumishswishgelu_tanhsoft_reluCertain optimisations (e.g. removing unnecessary
ldr/movinstructions and reducing the number of preserved auxiliary registers) were also applied to those SVE implementations, resulting in performance improvements.Performance improvements (f32)
The table above summarises the average benchdnn speedup for each op, aggregated over the following shapes:
1536x3841539x3871024x40961025x40994096x40964099x4099For each shape, measurements were collected using 1, 4, 16, and 64 threads, the reported value being the average across all tested configurations.
The comparison baseline depends on the target platform:
tanh,logistic, andeluare compared againstaclmish,swish,gelu_tanh, andsoft_reluare compared againstrefjit_sveimplementation