Skip to content

Remove copying+zeroing from gf16 on amd64#322

Merged
klauspost merged 2 commits intomasterfrom
gf16-nocopy
Jan 22, 2026
Merged

Remove copying+zeroing from gf16 on amd64#322
klauspost merged 2 commits intomasterfrom
gf16-nocopy

Conversation

@klauspost
Copy link
Owner

@klauspost klauspost commented Jan 21, 2026

Similar to #321 but on GF16.

Before:
λ benchmark -k 1000 -m 200 -size=1000000 -codec leopard16
Benchmarking 1 block(s) of 1000 data (K) and 200 parity shards (M), each 1024 bytes using 16 threads. Total 1228800 bytes.

 * Encoded 122856 MiB in 10s. Speed: 12285.07 MiB/s (1000+200:1024)
 * Repaired 11271 MiB in 10.001s. Speed: 1127.01 MiB/s (1000+200:1024)

After:
λ benchmark -k 1000 -m 200 -size=1000000 -codec leopard16
Benchmarking 1 block(s) of 1000 data (K) and 200 parity shards (M), each 1024 bytes using 16 threads. Total 1228800 bytes.

 * Encoded 141020 MiB in 10s. Speed: 14101.95 MiB/s (1000+200:1024)
 * Repaired 10418 MiB in 10.001s. Speed: 1041.70 MiB/s (1000+200:1024)

Summary by CodeRabbit

  • New Features

    • Introduced optimized buffer pooling for improved memory management in encoding operations.
    • Enhanced FFT encoding performance with accelerated computation paths for the initial transformation layer.
  • Chores

    • Added internal architectural variants to support optimized FFT butterfly operations across different hardware configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

Similar to #321 but on GF16.

```
Before:
λ benchmark -k 1000 -m 200 -size=1000000 -codec leopard16
Benchmarking 1 block(s) of 1000 data (K) and 200 parity shards (M), each 1024 bytes using 16 threads. Total 1228800 bytes.

 * Encoded 122856 MiB in 10s. Speed: 12285.07 MiB/s (1000+200:1024)
 * Repaired 11271 MiB in 10.001s. Speed: 1127.01 MiB/s (1000+200:1024)

After:
λ benchmark -k 1000 -m 200 -size=1000000 -codec leopard16
Benchmarking 1 block(s) of 1000 data (K) and 200 parity shards (M), each 1024 bytes using 16 threads. Total 1228800 bytes.

 * Encoded 141020 MiB in 10s. Speed: 14101.95 MiB/s (1000+200:1024)
 * Repaired 10418 MiB in 10.001s. Speed: 1041.70 MiB/s (1000+200:1024)
```
@coderabbitai
Copy link

coderabbitai bot commented Jan 21, 2026

📝 Walkthrough

Walkthrough

This PR adds destination-based variants of the 4-way inverse FFT butterfly operation (ifftDIT4Dst) with corresponding SIMD assembly implementations across multiple architectures (AVX-512, GFNI, AVX2). The encoder integrates these variants with a zero-buffer pool to fuse input copying with the first FFT layer, reducing intermediate allocations.

Changes

Cohort / File(s) Summary
Architecture-specific wrapper implementations
galois_amd64.go, galois_arm64.go, galois_noasm.go, galois_nopshufb_amd64.go, galois_ppc64le.go
Added non-exported ifftDIT4Dst wrapper function that mirrors existing ifftDIT4 but directs output to a separate dst parameter. Wrappers dispatch to reference implementations (ifftDIT4DstRef) or delegate through architecture-specific paths.
Assembly entry point declarations
galois_gen_amd64.go, galois_gen_nopshufb_amd64.go
Added public function declarations for destination-based assembly variants: 8 ifftDIT4_avx512_dst_* (with [128]uint8 tables), 8 ifftDIT4_gfni_dst_* (with [4]uint64 tables), and 8 ifftDIT4_gfni_avx512_dst_* variants.
Encoder integration and buffer pooling
leopard.go
Integrated ifftDIT4Dst into ifftDITEncoder with a 1MB zero-buffer pool and conditional SIMD-accelerated path. First FFT layer now fuses input copying with butterfly computation via ifftDIT4Dst when appropriate; fallback path preserves original behavior for non-SIMD systems or oversized shards.

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: optimizing GF16 on amd64 by removing copying and zeroing operations during the first FFT layer via destination-based butterfly variants.
Docstring Coverage ✅ Passed Docstring coverage is 93.15% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@galois_gen_nopshufb_amd64.s`:
- Around line 70418-70419: The code overwrites CX with table02 right after
loading table23 into CX so table23 is never used; to fix, keep the MOVQ that
loads table23 into CX and use CX to perform the required broadcasts into Y20–Y23
(add the VBROADCAST/VPBROADCAST sequences that populate Y20, Y21, Y22, Y23 from
CX), and change the subsequent MOVQ that loads table02+72(FP) so it uses a
different scratch register (or delay that load until after table23 is broadcast)
so table23 is not clobbered.
- Around line 70762-70763: The MOVQ sequence overwrites CX (loading table23 then
immediately replacing it with table02) so table23/table02 are never used; change
the second load to use DX (MOVQ table02+72(FP), DX), keep table23 in CX, then
add broadcasts of table23 from CX into Y20–Y23 and broadcasts of table02 from DX
into Y0–Y3, and insert the corresponding GFNI table lookup/XOR operations that
consume Y20–Y23 and Y0–Y3 where the function currently only uses table01; ensure
the new DX-based loads and GFNI uses mirror the existing pattern used for
table01 so registers aren’t clobbered.
- Around line 69841-69843: The code currently loads table01, table23, and
table02 into AX consecutively, overwriting AX and leaving table01/table23
unused; fix by loading each table parameter into its own register (e.g., use AX
for table01, CX for table23, DX for table02 or other free registers) in the
sequence around MOVQ table01+56(FP), MOVQ table23+64(FP), MOVQ table02+72(FP)
and then insert the missing GFNI operations that consume the registers holding
table01 and table23 (the GFNI instructions that should operate on the values
from the registers you assign) so all three tables are actually used instead of
only table02.
- Around line 70685-70693: The sequence overwrites register AX (table01) with
table23 and clobbers CX before table02 is used, so table01 and table02 are never
broadcast; fix by preserving each parameter in its own register: keep table01 in
AX and emit VBROADCASTSD using (AX) / 8(AX) / 16(AX) / 24(AX) as needed, load
table23 into a different register (e.g., DX) instead of AX and use VBROADCASTSD
on that register, and ensure table02 is loaded into a free register (e.g., CX or
another unused register) before being overwritten and then broadcast with
VBROADCASTSD; update the MOVQ and VBROADCASTSD lines around the symbols table01,
table23, and table02 accordingly so all three tables are actually used.
- Around line 70513-70515: The three MOVQ lines overwrite AX twice so table01
and table23 are never actually used; change the loads to move table01 and
table23 into their own registers (not AX) and apply the same broadcast sequence
used for table02 to those registers (i.e., load table01 into a separate
register, broadcast it, load table23 into another separate register, broadcast
it) so all three tables are preserved and broadcasted before use.
- Around line 70321-70322: The load of table01 into AX is immediately clobbered
by the next MOVQ that loads table23 into AX, so table01 is never used; fix by
preserving or broadcasting table01 before AX is overwritten — either perform the
required broadcast(s) from AX right after MOVQ table01+56(FP), AX, or load
table01 into a separate register (e.g., CX) and broadcast from that register,
then proceed to load table23 into AX; ensure any subsequent XMM broadcast
instructions reference the preserved/broadcasted table01 data rather than
relying on AX after it is overwritten.
- Around line 70005-70008: The function currently overwrites AX and CX so
table01 and table02 are never used (MOVQ table01+56(FP), AX then MOVQ
table23+64(FP), AX; MOVQ table02+72(FP), CX then MOVQ dist+48(FP), CX); fix by
loading table01 into AX and table23 into CX (do not reuse AX for table23), load
dist into a separate temporary register instead of CX, and load table02 into
vector registers Y0–Y3 so the GFNI operations can reference AX for table01, CX
for table23, and Y0–Y3 for table02; update the GFNI instruction operands
accordingly to use AX/CX/Y0–Y3 and remove the overwrites.
- Around line 69657-69659: The code loads table01 into AX then immediately
overwrites AX with table23, so table01 is never used; fix by loading table23
into CX (not AX) and ensure table02 is moved into a non-conflicting register
(e.g., change the MOVQ that currently writes table02+72(FP) into CX to use a
different register such as DX), then update all GFNI instructions that were
using AX for table23 to use CX instead so table01 remains in AX, table23 is in
CX, and table02 is in the chosen spare register. Use the symbols table01,
table23, table02, AX, CX, DX and the GFNI operation sites to locate and change
the loads and operand registers.
- Around line 70078-70080: The three MOVQ instructions are clobbering CX
repeatedly (MOVQ table23+64(FP), CX; MOVQ table02+72(FP), CX; MOVQ dist+48(FP),
CX), so table23 and table02 are never preserved or used; fix by loading each
table pointer into its own register (e.g., load table23 into a free register
such as DX or R8, table02 into another like SI or R9) instead of reusing CX,
then implement the missing GFNI operations that consume table23 and table02
using those registers, keeping CX for the dist value (or vice versa) and
ensuring existing usage of table01 remains unchanged (search for uses of
table01, table23, table02, dist and the MOVQ lines to update the loads and
insert the GFNI instructions).
- Around line 69750-69751: The code loads table23 into register CX then
immediately overwrites CX with table02, so the table23 parameter is never used;
modify the sequence so CX retains table23 (MOVQ table23+64(FP), CX) and perform
the intended GFNI instructions that use CX for table23 before loading table02
into CX (MOVQ table02+72(FP), CX), ensuring any subsequent GFNI ops reference CX
(for table23) and then reload CX when you need table02; update the GFNI usage
sites where table23 was meant to be applied to explicitly operate on CX after
the first MOVQ.
- Around line 69927-69928: The code loads table02 into DX then immediately
overwrites DX with dist (MOVQ table02+72(FP), DX followed by MOVQ dist+48(FP),
DX), so table02 is never used; fix by preserving table02 (load it into a
different register or push it into YMM/XMM/Y0-Y3 as suggested) before loading
dist, then implement the GFNI table02 multiplication using Y0-Y3 registers (or
the chosen preserved registers) so the table02 data is actually consumed by the
GFNI instructions; ensure subsequent uses reference the preserved register(s)
instead of DX so dist and table02 are both available.
- Around line 70597-70610: The table02 load into DX is being clobbered by the
later MOVQ 8(CX), DX before using table02 for VBROADCASTSD, so restore the
table02 usage: either move the MOVQ 8(CX), DX instruction until after you finish
broadcasting from DX (the VBROADCASTSD that should target Y16–Y19 for table02),
or load 8(CX) into a different spare register (e.g., R8/R9) instead of DX;
ensure the VBROADCASTSD sequence actually reads from (DX) for table02 and then
add the GFNI instructions that operate on Y0–Y3 as proposed.
🧹 Nitpick comments (1)
galois_gen_nopshufb_amd64.s (1)

70837-70840: Table parameters are loaded but never used.

Lines 70837-70839 load three table parameters into AX sequentially, each overwriting the previous value. None of these tables are referenced in the function. The dist parameter (line 70840) is used for pointer arithmetic, but the table pointers are discarded. This matches the identical pattern in ifftDIT4_gfni_dst_7, suggesting both dst_7 variants are intentionally designed without table lookups. The unused loads are harmless but can be removed if the API signature is simplified.

@klauspost klauspost merged commit 361ccd4 into master Jan 22, 2026
16 checks passed
@klauspost klauspost deleted the gf16-nocopy branch January 22, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant