Skip to content

leo16: Add simpler (but faster) AVX512 GFNI path#320

Merged
klauspost merged 2 commits intomasterfrom
avx512-gfni-leopard
Jan 20, 2026
Merged

leo16: Add simpler (but faster) AVX512 GFNI path#320
klauspost merged 2 commits intomasterfrom
avx512-gfni-leopard

Conversation

@klauspost
Copy link
Owner

@klauspost klauspost commented Jan 20, 2026

For AVX512 simply use the extra registers and always use VPTERNLOGD independent of compilation settings.

So this re-enabled the code path with new code. And removes the AVX512 with shuffling.

Summary by CodeRabbit

  • Refactor
    • Re-enabled and streamlined high-performance computation pathways for compatible CPUs.
  • Performance Improvements
    • Enables accelerated transform implementations on AVX-512/related-capable hardware for faster encoding and reconstruction.
  • Compatibility
    • Broadened support for additional CPU instruction subsets to improve execution on more modern processors.

✏️ Tip: You can customize this high-level summary in your review settings.

For AVX512 simply use the extra registers and always use `VPTERNLOGD` independent of compilation settings.
@klauspost klauspost requested a review from Copilot January 20, 2026 11:35
@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Enables previously-disabled AVX-512 GFNI branches in FFT encode/reconstruct paths and migrates assembly GFNI implementations from 512-bit ZMM operations to 256-bit Y-register GFNI sequences and adjusted broadcast/load/store patterns.

Changes

Cohort / File(s) Summary
GFNI branch enablement
galois_amd64.go
Removed the always-false guard so GFNI paths are taken when o.useAvx512GFNI && o.useAVX512 && gf2p811dMulMatrices16 != nil in ifftDIT4 and fftDIT4. No other logic changes.
GFNI assembly rework
galois_gen_amd64.s
Replaced ZMM-centric GFNI sequences with 256-bit Y-register GFNI operations; changed broadcasts (e.g., VPBROADCASTQVBROADCASTSD), adjusted load/store widths (64→32/16-byte patterns), added explicit AVX2/AVX512VL/GFNI annotations, and updated loop/shuffle/accumulation instruction flows across all fft/ifft variants.

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling a faster AVX512 GFNI path by removing the hard-coded false condition that previously disabled it.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

This comment was marked as resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@galois_gen_nopshufb_amd64.s`:
- Around line 68424-68430: The CPU feature check for enabling AVX512+GFNI is
incorrect: update the options initialization that sets useAvx512GFNI to require
AVX512VL instead of AVX512DQ; specifically, in the options code where it
currently calls cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512DQ)
change the third flag to cpuid.AVX512VL so it becomes
cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512VL); this aligns the
feature gate used by the dispatcher (galois_amd64.go) and the tests that expect
GFNI+AVX512VL for Y-register 256-bit instructions like VGF2P8AFFINEQB and
VPTERNLOGD.
🧹 Nitpick comments (2)
galois_gen_nopshufb_amd64.s (1)

69480-69480: Function naming is misleading.

These functions are named *_gfni_avx512_7 but only require AVX and AVX2, with no GFNI instructions in the implementation. Consider renaming to clarify they are non-GFNI fallback variants.

Also applies to: 69534-69534

galois_gen_amd64.s (1)

126994-126996: Dead load of table01.

Line 126994 loads table01+32(FP) into AX which is immediately overwritten on line 126995. This variant doesn't use table01, so the load is harmless but redundant. Not a bug - the generated code pattern likely keeps parameter handling consistent across variants.

@klauspost klauspost merged commit c9386ba into master Jan 20, 2026
17 checks passed
@klauspost klauspost deleted the avx512-gfni-leopard branch January 20, 2026 12:41
@Wondertan
Copy link

Wondertan commented Jan 21, 2026

Curios to see how this affected benchmarks for gf16 post #317

@klauspost
Copy link
Owner Author

@Wondertan It builds on top of that. Didn't really see too much of a consistent difference so some other factor may have been limiting elsewhere.

But this can easily be faster on other microarchs (currently testing on a Zen4). There shouldn't be any scenario where it would be slower - and it has the potential to be faster since it is less ops and less memory reads.

@klauspost
Copy link
Owner Author

@Wondertan I think #322 is more impactful. This is mainly just cleanup.

@klauspost klauspost changed the title Add simpler (but faster) AVX512 GFNI path leo16: Add simpler (but faster) AVX512 GFNI path Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants