leo16: Add simpler (but faster) AVX512 GFNI path by klauspost · Pull Request #320 · klauspost/reedsolomon

klauspost · 2026-01-20T11:35:26Z

For AVX512 simply use the extra registers and always use VPTERNLOGD independent of compilation settings.

So this re-enabled the code path with new code. And removes the AVX512 with shuffling.

Summary by CodeRabbit

Refactor
- Re-enabled and streamlined high-performance computation pathways for compatible CPUs.
Performance Improvements
- Enables accelerated transform implementations on AVX-512/related-capable hardware for faster encoding and reconstruction.
Compatibility
- Broadened support for additional CPU instruction subsets to improve execution on more modern processors.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

For AVX512 simply use the extra registers and always use `VPTERNLOGD` independent of compilation settings.

coderabbitai · 2026-01-20T11:35:39Z

📝 Walkthrough

Walkthrough

Enables previously-disabled AVX-512 GFNI branches in FFT encode/reconstruct paths and migrates assembly GFNI implementations from 512-bit ZMM operations to 256-bit Y-register GFNI sequences and adjusted broadcast/load/store patterns.

Changes

Cohort / File(s)	Summary
GFNI branch enablement `galois_amd64.go`	Removed the always-false guard so GFNI paths are taken when `o.useAvx512GFNI && o.useAVX512 && gf2p811dMulMatrices16 != nil` in `ifftDIT4` and `fftDIT4`. No other logic changes.
GFNI assembly rework `galois_gen_amd64.s`	Replaced ZMM-centric GFNI sequences with 256-bit Y-register GFNI operations; changed broadcasts (e.g., `VPBROADCASTQ` → `VBROADCASTSD`), adjusted load/store widths (64→32/16-byte patterns), added explicit AVX2/AVX512VL/GFNI annotations, and updated loop/shuffle/accumulation instruction flows across all fft/ifft variants.

Possibly related PRs

Add Leopard16 GFNI on amd64 #317: Modifies GFNI-based FFT/IFFT assembly and matrix setup that correspond to the now-enabled GFNI branches.
Add GFNI to leopard gf8 #314: Also re-enables GFNI branches and adjusts GFNI-related data paths, overlapping the same codepaths changed here.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enabling a faster AVX512 GFNI path by removing the hard-coded false condition that previously disabled it.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@galois_gen_nopshufb_amd64.s`:
- Around line 68424-68430: The CPU feature check for enabling AVX512+GFNI is
incorrect: update the options initialization that sets useAvx512GFNI to require
AVX512VL instead of AVX512DQ; specifically, in the options code where it
currently calls cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512DQ)
change the third flag to cpuid.AVX512VL so it becomes
cpuid.CPU.Supports(cpuid.AVX512F, cpuid.GFNI, cpuid.AVX512VL); this aligns the
feature gate used by the dispatcher (galois_amd64.go) and the tests that expect
GFNI+AVX512VL for Y-register 256-bit instructions like VGF2P8AFFINEQB and
VPTERNLOGD.

🧹 Nitpick comments (2)

galois_gen_nopshufb_amd64.s (1)

69480-69480: Function naming is misleading.

These functions are named *_gfni_avx512_7 but only require AVX and AVX2, with no GFNI instructions in the implementation. Consider renaming to clarify they are non-GFNI fallback variants.

Also applies to: 69534-69534

galois_gen_amd64.s (1)

126994-126996: Dead load of table01.

Line 126994 loads table01+32(FP) into AX which is immediately overwritten on line 126995. This variant doesn't use table01, so the load is harmless but redundant. Not a bug - the generated code pattern likely keeps parameter handling consistent across variants.

galois_gen_nopshufb_amd64.s

Wondertan · 2026-01-21T14:43:49Z

Curios to see how this affected benchmarks for gf16 post #317

klauspost · 2026-01-21T14:56:41Z

@Wondertan It builds on top of that. Didn't really see too much of a consistent difference so some other factor may have been limiting elsewhere.

But this can easily be faster on other microarchs (currently testing on a Zen4). There shouldn't be any scenario where it would be slower - and it has the potential to be faster since it is less ops and less memory reads.

klauspost · 2026-01-21T16:21:03Z

@Wondertan I think #322 is more impactful. This is mainly just cleanup.

Add simpler (but faster) AVX512 GFNI path

0f9694a

For AVX512 simply use the extra registers and always use `VPTERNLOGD` independent of compilation settings.

klauspost requested a review from Copilot January 20, 2026 11:35

Copilot started reviewing on behalf of klauspost January 20, 2026 11:35 View session

This comment was marked as resolved.

Sign in to view

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

galois_gen_nopshufb_amd64.s Show resolved Hide resolved

Check useAVX512 as well to check for AVX512VL

577f030

klauspost merged commit c9386ba into master Jan 20, 2026
17 checks passed

klauspost deleted the avx512-gfni-leopard branch January 20, 2026 12:41

coderabbitai bot mentioned this pull request Jan 21, 2026

Remove copying+zeroing from gf16 on amd64 #322

Merged

klauspost changed the title ~~Add simpler (but faster) AVX512 GFNI path~~ leo16: Add simpler (but faster) AVX512 GFNI path Feb 2, 2026

coderabbitai bot mentioned this pull request Feb 3, 2026

Fix SIGILL on non-avx512 GFNI, leopard16 #325

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leo16: Add simpler (but faster) AVX512 GFNI path#320

leo16: Add simpler (but faster) AVX512 GFNI path#320
klauspost merged 2 commits intomasterfrom
avx512-gfni-leopard

klauspost commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Wondertan commented Jan 21, 2026 •

edited

Loading

Uh oh!

klauspost commented Jan 21, 2026

Uh oh!

klauspost commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klauspost commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Wondertan commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klauspost commented Jan 21, 2026

Uh oh!

klauspost commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klauspost commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Wondertan commented Jan 21, 2026 •

edited

Loading