Implement "real" AVX2 intrinsics and clean up x86 codegen #115

valadaptive · 2025-11-12T01:41:57Z

Resolves #114.

This may be best reviewed one commit at a time; one of them moves a lot of stuff around.

This PR updates the x86 codegen to use actual AVX2 intrinsics (the ones starting with _mm256). This is mostly straightforward, but there are a few operations that require special attention. I've included some other x86 codegen fixes and improvements that are somewhat interwoven:

I've added tests for several operations that were previously untested. Mainly these are 256-bit zip/unzip, widen/narrow, split/combine, and integer equality comparisons. Note that these test cases were generated by Claude.
The x86 codegen now actually generates the correct code for integer equality comparisons. Previously, it incorrectly generated "greater than" comparisons instead.
It also now uses the blendv family for "select" operations. Intel's manual says these are available starting in SSE4.1. Not sure if there's a reason this wasn't done before.
For SSE4.2-level unzip operations, I've changed the codegen.

Previously, for unzip_low, it would shuffle the inputs to put the even-indexed elements in both the lower and upper halves of the values, then use unpacklo to select just the lower halves. Likewise, for unzip_high, it would shuffle the inputs to put the odd-indexed elements in both halves, and use unpacklo once more.

I've changed this so that unzip_low and unzip_high both use a shuffle operation that moves the even-indexed elements into the lower halves and the odd-indexed elements into the upper halves. unzip_low uses unpacklo to select the lower halves, and unzip_high uses unpackhi to select the upper halves. This means that if the user calls both unzip_low and unzip_high, the shuffle operation's result can be shared.
I've implemented 8-bit multiplication based on this StackOverflow answer.

On the AVX2 side, most existing 128-bit operations have a straightforward 256-bit counterpart, but some are more involved:

The zip/unzip operations are a bit more complicated, since most AVX2 swizzle operations operate within each 128-bit lane. For 32-bit and larger operations, there are special "lane-crossing" shuffles we can use instead. Operations on smaller scalars require a combination of intra-lane and "lane-crossing" shuffles.
Splitting a 256-bit vector to a 128-bit one, or combining two 128-bit vectors into a 256-bit one, can be done directly with AVX2 intrinsics.
Widen/narrow operations can be done a bit more efficiently in AVX2. Widening a u8x16 to a 16x16 can be done with a single _mm256_cvtepu8_epi16. Narrowing a u16x16 to a u8x16 can done with two shuffles: one to extract the lower bits of each 16-bit value within each 128-bit lane, and one to combine the two lanes.

I've consolidated much of the x86 codegen from x86_common.rs, arch/avx2.rs, arch/sse4_2.rs, and arch/x86_common.rs into a single arch/x86.rs file. I did this in the middle of some other commits; sorry! The main AVX2 codegen was implemented before the reorganization, but the split/combine and widen/narrow ops were implemented afterwards.

In the future, I'd like to rework and tidy up the codegen a bit more. For instance, we're passing in things like vector types' widths alongside those very same vector types, which is redundant. The Arch trait is also very much not pulling its weight.

Catches a silly bug in the Intel simd_eq implementation.

DJMcNab

I don't really have the domain knowledge to validate all of the logic in here, but it's good to see more testing. I've read through the code and pointed at what I can see which seems suspect.
Hopefully we can discuss at office hours, and see if anyone else is interested in reviewing this. But I'd be happy landing this by the end of this week if we don't get other review; it can always be reviewed post-merge.

It might be worth also running Vello's tests with this version (would it make sense to also run the benchmarks?)

DJMcNab · 2025-11-12T10:26:28Z

fearless_simd_gen/src/mk_avx2.rs

+            let acceptable_wide_op = matches!(method, "load_interleaved_128")
+                || matches!(method, "store_interleaved_128");


Just want to check these these don't need to be load_interleaved_256

I believe there is no load_interleaved_256. The name load_interleaved_128 is a bit confusing since it's actually performing a 512-bit load (aka 64 bytes); not sure where 128 comes in.

👍 I thought it was probably right - I was just playing a bit of "spot the difference" with the sse4.2 version

Yeah perhaps not the best name, the 128 was because it's basically interleaving it in steps of 4.

fearless_simd_gen/src/mk_sse4_2.rs

LaurenzV · 2025-11-12T15:19:51Z

I probably wouldn't have time to review this more carefully until next week, but as long as current vello_cpu works fine with those changes I would also be fine merging this with a cursory review. :)

valadaptive · 2025-11-12T15:58:15Z

All the Vello tests seem to pass! Updating Vello to use the new Level API is a bit tricky, and I don't know if all the tests are being run with AVX2 (enabling RUSTFLAGS="-Ctarget-cpu=x86-64-v3" panics with "hybrid renderer doesn't support SIMD"), but a cursory look suggests everything works properly.

LaurenzV · 2025-11-12T16:26:43Z

The tests of vello_sparse_tetss should run with AVX2 as well in CI, I think.

valadaptive · 2025-11-12T20:48:30Z

I ran cargo test --workspace --release in Vello and everything passes. I was a bit confused because some AVX2 tests appeared to be missing, but that's just a consequence of cargo test's output being shuffled around a bit. I've made my tentative Vello updates public at linebender/vello#1288; I'll just wait for the CI now.

DJMcNab

As discussed in #office hours > Renderer 2025-11-12, I think we're happy to semi-optimistically land this.

It doesn't change public API, all the tests pass, and it also passes Vello's tests. I've not carefully reviewed the codegen changes however. For the sake of unblocking the stacked work though, I think landing it early is wortwhile; we can always do a post-hoc review.

(If this isn't an accurate outcome from the meeting yesterday, let me know)

valadaptive · 2025-11-14T00:28:13Z

I'll go ahead and merge this since the existing tests, my new tests, and the Vello tests all pass. The current x86 code is a bit dodgy anyway (for example, equality comparisons being broken), and I think this PR is an improvement.

This should unblock a fair amount of stuff.

This builds on top of #115. There are no functional changes to the generated code (besides what #115 does), but cleans up the `fearless_simd_gen` code: - The `Arch` trait has been removed. It operated at the wrong level of abstraction--it makes no sense to call e.g. `mk_avx2::make_method` with any `Arch` implementation other than `X86`. - Many code generation functions in the AVX2 and SSE4.2 modules used to pass in the vector type along with its scalar and total bit widths. The former provides the latter, so we can stop passing all three in and just pass in the vector type.

Split off from #115 to make review of that PR easier.

valadaptive requested review from LaurenzV and Ralith November 12, 2025 01:41

valadaptive added 9 commits November 11, 2025 21:55

Improve test coverage a bit

21967e9

Catches a silly bug in the Intel simd_eq implementation.

Implement AVX2 and improve x86 codegen

210a712

Consolidate the x86 codegen

97a0494

Add more split/combine tests

c4f15a1

Add specialized AVX2 split/combine ops

0ca9b62

Add 256-bit widen/narrow tests

1fee324

Add specialized AVX2 widen/narrow ops

bd31e47

Fix spontaneous Clippy complaint

843fc66

Implement 8-bit multiplication in x86

d0bff93

valadaptive force-pushed the more-avx2 branch from 3dd812f to d0bff93 Compare November 12, 2025 02:55

valadaptive mentioned this pull request Nov 12, 2025

Clean up the codegen a bit (particularly x86) #116

Merged

DJMcNab reviewed Nov 12, 2025

View reviewed changes

Fix meaning of acceptable_wide_op in SSE4.2 gen

b29bf2a

valadaptive mentioned this pull request Nov 12, 2025

Remaining work for x86 intrinsics #119

Open

DJMcNab approved these changes Nov 13, 2025

View reviewed changes

Add CHANGELOG entries

2e4235d

valadaptive added this pull request to the merge queue Nov 14, 2025

Merged via the queue into linebender:main with commit 2425ecd Nov 14, 2025
18 checks passed

valadaptive deleted the more-avx2 branch November 14, 2025 00:31

valadaptive mentioned this pull request Nov 14, 2025

Widen Avx2's associated types to 256 bits #123

Merged

github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2025

Widen Avx2's associated types to 256 bits (#123)

3688f88

Split off from #115 to make review of that PR easier.

		let acceptable_wide_op = matches!(method, "load_interleaved_128")
		\|\| matches!(method, "store_interleaved_128");

Implement "real" AVX2 intrinsics and clean up x86 codegen #115

Implement "real" AVX2 intrinsics and clean up x86 codegen #115

Uh oh!

Conversation

valadaptive commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DJMcNab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DJMcNab Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

valadaptive Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

DJMcNab Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

LaurenzV Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LaurenzV commented Nov 12, 2025

Uh oh!

valadaptive commented Nov 12, 2025

Uh oh!

LaurenzV commented Nov 12, 2025

Uh oh!

valadaptive commented Nov 12, 2025

Uh oh!

DJMcNab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valadaptive commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valadaptive commented Nov 12, 2025 •

edited

Loading

DJMcNab left a comment •

edited

Loading

DJMcNab left a comment •

edited

Loading