sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR simplifies the implementation of the AVX-based 8x8 SGEMM microkernel in
kernel_x86_avx
, again.This refactor is motivated by the fact that instructions like
moveldup
,movehdup
,permute
, andpermute2f128
are all executed exclusively on port 5. On FMA-capable platforms, this creates a bottleneck that thefmadd
throughput cannot hide, especially when combined with loads followed by permutes. The new implementation reduces port 5 pressure by avoiding these sequences entirely during the inner loop.By the way, the theoretical maximum FLOPS kernel on FMA-capable platforms should resemble a
6x16
layout using_mm256_broadcast_ss
(from memory) followed by two_mm256_fmadd_ps
. However, this comes at the cost of losing the transposition flexibility of square kernels that adapt to the row/column-major layout of C.Performance
On Intel i7-10700k
Before:
After: