sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91

SongXiaoXi · 2025-06-03T06:58:15Z

This PR simplifies the implementation of the AVX-based 8x8 SGEMM microkernel in kernel_x86_avx, again.

This refactor is motivated by the fact that instructions like moveldup, movehdup, permute, and permute2f128 are all executed exclusively on port 5. On FMA-capable platforms, this creates a bottleneck that the fmadd throughput cannot hide, especially when combined with loads followed by permutes. The new implementation reduces port 5 pressure by avoiding these sequences entirely during the inner loop.

By the way, the theoretical maximum FLOPS kernel on FMA-capable platforms should resemble a 6x16 layout using _mm256_broadcast_ss(from memory) followed by two _mm256_fmadd_ps. However, this comes at the cost of losing the transposition flexibility of square kernels that adapt to the row/column-major layout of C.

Performance

On Intel i7-10700k
Before:

m	k	n	layout	type	average_ns	minimum_ns	median_ns	samples	GFLOPS
384	384	384	FCC	f32	974,601	973,492	974,549	1560	116.1975085188708
450	450	450	FCC	f32	1,594,266	1,584,060	1,590,237	1560	114.31592971310936
512	512	512	FCC	f32	2,334,892	2,289,947	2,357,830	1560	114.96696892190303

After:

m	k	n	layout	type	average_ns	minimum_ns	median_ns	samples	gflops
384	384	384	FCC	f32	892269	891653	892212	1560	126.91935727902684
450	450	450	FCC	f32	1460456	1456068	1460988	1560	124.78979168150222
512	512	512	FCC	f32	2108191	2101841	2111378	1560	127.32976091824698

sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead

0d09542

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91

sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91

Uh oh!

SongXiaoXi commented Jun 3, 2025

Uh oh!

Uh oh!

sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91

Are you sure you want to change the base?

sgemm: simplify kernel_x86_avx logic and reduce shuffle overhead #91

Uh oh!

Conversation

SongXiaoXi commented Jun 3, 2025

Performance

Uh oh!

Uh oh!