leo16: Add simpler (but faster) AVX512 GFNI path#320
Conversation
For AVX512 simply use the extra registers and always use `VPTERNLOGD` independent of compilation settings.
📝 WalkthroughWalkthroughEnables previously-disabled AVX-512 GFNI branches in FFT encode/reconstruct paths and migrates assembly GFNI implementations from 512-bit ZMM operations to 256-bit Y-register GFNI sequences and adjusted broadcast/load/store patterns. Changes
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@galois_gen_nopshufb_amd64.s`:
- Around line 68424-68430: The CPU feature check for enabling AVX512+GFNI is
incorrect: update the options initialization that sets useAvx512GFNI to require
AVX512VL instead of AVX512DQ; specifically, in the options code where it
currently calls cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512DQ)
change the third flag to cpuid.AVX512VL so it becomes
cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512VL); this aligns the
feature gate used by the dispatcher (galois_amd64.go) and the tests that expect
GFNI+AVX512VL for Y-register 256-bit instructions like VGF2P8AFFINEQB and
VPTERNLOGD.
🧹 Nitpick comments (2)
galois_gen_nopshufb_amd64.s (1)
69480-69480: Function naming is misleading.These functions are named
*_gfni_avx512_7but only require AVX and AVX2, with no GFNI instructions in the implementation. Consider renaming to clarify they are non-GFNI fallback variants.Also applies to: 69534-69534
galois_gen_amd64.s (1)
126994-126996: Dead load of table01.Line 126994 loads
table01+32(FP)into AX which is immediately overwritten on line 126995. This variant doesn't use table01, so the load is harmless but redundant. Not a bug - the generated code pattern likely keeps parameter handling consistent across variants.
|
Curios to see how this affected benchmarks for gf16 post #317 |
|
@Wondertan It builds on top of that. Didn't really see too much of a consistent difference so some other factor may have been limiting elsewhere. But this can easily be faster on other microarchs (currently testing on a Zen4). There shouldn't be any scenario where it would be slower - and it has the potential to be faster since it is less ops and less memory reads. |
|
@Wondertan I think #322 is more impactful. This is mainly just cleanup. |
For AVX512 simply use the extra registers and always use
VPTERNLOGDindependent of compilation settings.So this re-enabled the code path with new code. And removes the AVX512 with shuffling.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.