Skip to content

Conversation

@valadaptive
Copy link
Contributor

@valadaptive valadaptive commented Nov 12, 2025

Resolves #114.

This may be best reviewed one commit at a time; one of them moves a lot of stuff around.

This PR updates the x86 codegen to use actual AVX2 intrinsics (the ones starting with _mm256). This is mostly straightforward, but there are a few operations that require special attention. I've included some other x86 codegen fixes and improvements that are somewhat interwoven:

  • I've added tests for several operations that were previously untested. Mainly these are 256-bit zip/unzip, widen/narrow, split/combine, and integer equality comparisons. Note that these test cases were generated by Claude.

  • The x86 codegen now actually generates the correct code for integer equality comparisons. Previously, it incorrectly generated "greater than" comparisons instead.

  • It also now uses the blendv family for "select" operations. Intel's manual says these are available starting in SSE4.1. Not sure if there's a reason this wasn't done before.

  • For SSE4.2-level unzip operations, I've changed the codegen.

    Previously, for unzip_low, it would shuffle the inputs to put the even-indexed elements in both the lower and upper halves of the values, then use unpacklo to select just the lower halves. Likewise, for unzip_high, it would shuffle the inputs to put the odd-indexed elements in both halves, and use unpacklo once more.

    I've changed this so that unzip_low and unzip_high both use a shuffle operation that moves the even-indexed elements into the lower halves and the odd-indexed elements into the upper halves. unzip_low uses unpacklo to select the lower halves, and unzip_high uses unpackhi to select the upper halves. This means that if the user calls both unzip_low and unzip_high, the shuffle operation's result can be shared.

  • I've implemented 8-bit multiplication based on this StackOverflow answer.

On the AVX2 side, most existing 128-bit operations have a straightforward 256-bit counterpart, but some are more involved:

  • The zip/unzip operations are a bit more complicated, since most AVX2 swizzle operations operate within each 128-bit lane. For 32-bit and larger operations, there are special "lane-crossing" shuffles we can use instead. Operations on smaller scalars require a combination of intra-lane and "lane-crossing" shuffles.

  • Splitting a 256-bit vector to a 128-bit one, or combining two 128-bit vectors into a 256-bit one, can be done directly with AVX2 intrinsics.

  • Widen/narrow operations can be done a bit more efficiently in AVX2. Widening a u8x16 to a 16x16 can be done with a single _mm256_cvtepu8_epi16. Narrowing a u16x16 to a u8x16 can done with two shuffles: one to extract the lower bits of each 16-bit value within each 128-bit lane, and one to combine the two lanes.

I've consolidated much of the x86 codegen from x86_common.rs, arch/avx2.rs, arch/sse4_2.rs, and arch/x86_common.rs into a single arch/x86.rs file. I did this in the middle of some other commits; sorry! The main AVX2 codegen was implemented before the reorganization, but the split/combine and widen/narrow ops were implemented afterwards.

In the future, I'd like to rework and tidy up the codegen a bit more. For instance, we're passing in things like vector types' widths alongside those very same vector types, which is redundant. The Arch trait is also very much not pulling its weight.

Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have the domain knowledge to validate all of the logic in here, but it's good to see more testing. I've read through the code and pointed at what I can see which seems suspect.
Hopefully we can discuss at office hours, and see if anyone else is interested in reviewing this. But I'd be happy landing this by the end of this week if we don't get other review; it can always be reviewed post-merge.

It might be worth also running Vello's tests with this version (would it make sense to also run the benchmarks?)

Comment on lines +82 to +83
let acceptable_wide_op = matches!(method, "load_interleaved_128")
|| matches!(method, "store_interleaved_128");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to check these these don't need to be load_interleaved_256

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is no load_interleaved_256. The name load_interleaved_128 is a bit confusing since it's actually performing a 512-bit load (aka 64 bytes); not sure where 128 comes in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I thought it was probably right - I was just playing a bit of "spot the difference" with the sse4.2 version

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah perhaps not the best name, the 128 was because it's basically interleaving it in steps of 4.

@LaurenzV
Copy link
Collaborator

I probably wouldn't have time to review this more carefully until next week, but as long as current vello_cpu works fine with those changes I would also be fine merging this with a cursory review. :)

@valadaptive
Copy link
Contributor Author

All the Vello tests seem to pass! Updating Vello to use the new Level API is a bit tricky, and I don't know if all the tests are being run with AVX2 (enabling RUSTFLAGS="-Ctarget-cpu=x86-64-v3" panics with "hybrid renderer doesn't support SIMD"), but a cursory look suggests everything works properly.

@LaurenzV
Copy link
Collaborator

The tests of vello_sparse_tetss should run with AVX2 as well in CI, I think.

@valadaptive
Copy link
Contributor Author

I ran cargo test --workspace --release in Vello and everything passes. I was a bit confused because some AVX2 tests appeared to be missing, but that's just a consequence of cargo test's output being shuffled around a bit. I've made my tentative Vello updates public at linebender/vello#1288; I'll just wait for the CI now.

Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #office hours > Renderer 2025-11-12, I think we're happy to semi-optimistically land this.

It doesn't change public API, all the tests pass, and it also passes Vello's tests. I've not carefully reviewed the codegen changes however. For the sake of unblocking the stacked work though, I think landing it early is wortwhile; we can always do a post-hoc review.

(If this isn't an accurate outcome from the meeting yesterday, let me know)

@valadaptive
Copy link
Contributor Author

I'll go ahead and merge this since the existing tests, my new tests, and the Vello tests all pass. The current x86 code is a bit dodgy anyway (for example, equality comparisons being broken), and I think this PR is an improvement.

This should unblock a fair amount of stuff.

@valadaptive valadaptive added this pull request to the merge queue Nov 14, 2025
Merged via the queue into linebender:main with commit 2425ecd Nov 14, 2025
18 checks passed
@valadaptive valadaptive deleted the more-avx2 branch November 14, 2025 00:31
github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2025
This builds on top of #115. There are no functional changes to the
generated code (besides what #115 does), but cleans up the
`fearless_simd_gen` code:

- The `Arch` trait has been removed. It operated at the wrong level of
abstraction--it makes no sense to call e.g. `mk_avx2::make_method` with
any `Arch` implementation other than `X86`.

- Many code generation functions in the AVX2 and SSE4.2 modules used to
pass in the vector type along with its scalar and total bit widths. The
former provides the latter, so we can stop passing all three in and just
pass in the vector type.
github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2025
Split off from #115 to make review of that PR easier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actually make use of AVX2's increased lane width

3 participants