Skip to content

Conversation

SongXiaoXi
Copy link
Contributor

This PR simplifies the implementation of the AVX-based 8x8 SGEMM microkernel in kernel_x86_avx, again.

This refactor is motivated by the fact that instructions like moveldup, movehdup, permute, and permute2f128 are all executed exclusively on port 5. On FMA-capable platforms, this creates a bottleneck that the fmadd throughput cannot hide, especially when combined with loads followed by permutes. The new implementation reduces port 5 pressure by avoiding these sequences entirely during the inner loop.

By the way, the theoretical maximum FLOPS kernel on FMA-capable platforms should resemble a 6x16 layout using _mm256_broadcast_ss(from memory) followed by two _mm256_fmadd_ps. However, this comes at the cost of losing the transposition flexibility of square kernels that adapt to the row/column-major layout of C.

Performance

On Intel i7-10700k
Before:

m k n layout type average_ns minimum_ns median_ns samples GFLOPS nc kc mc threads
384 384 384 FCC f32 974,601 973,492 974,549 1560 116.1975085188708 0
450 450 450 FCC f32 1,594,266 1,584,060 1,590,237 1560 114.31592971310936 0
512 512 512 FCC f32 2,334,892 2,289,947 2,357,830 1560 114.96696892190303 0

After:

m k n layout type average_ns minimum_ns median_ns samples gflops nc kc mc threads
384 384 384 FCC f32 892269 891653 892212 1560 126.91935727902684 0
450 450 450 FCC f32 1460456 1456068 1460988 1560 124.78979168150222 0
512 512 512 FCC f32 2108191 2101841 2111378 1560 127.32976091824698 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant