-
Notifications
You must be signed in to change notification settings - Fork 20
Implement "real" AVX2 intrinsics and clean up x86 codegen #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Catches a silly bug in the Intel simd_eq implementation.
3dd812f to
d0bff93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have the domain knowledge to validate all of the logic in here, but it's good to see more testing. I've read through the code and pointed at what I can see which seems suspect.
Hopefully we can discuss at office hours, and see if anyone else is interested in reviewing this. But I'd be happy landing this by the end of this week if we don't get other review; it can always be reviewed post-merge.
It might be worth also running Vello's tests with this version (would it make sense to also run the benchmarks?)
| let acceptable_wide_op = matches!(method, "load_interleaved_128") | ||
| || matches!(method, "store_interleaved_128"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to check these these don't need to be load_interleaved_256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is no load_interleaved_256. The name load_interleaved_128 is a bit confusing since it's actually performing a 512-bit load (aka 64 bytes); not sure where 128 comes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I thought it was probably right - I was just playing a bit of "spot the difference" with the sse4.2 version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah perhaps not the best name, the 128 was because it's basically interleaving it in steps of 4.
|
I probably wouldn't have time to review this more carefully until next week, but as long as current vello_cpu works fine with those changes I would also be fine merging this with a cursory review. :) |
|
All the Vello tests seem to pass! Updating Vello to use the new |
|
The tests of vello_sparse_tetss should run with AVX2 as well in CI, I think. |
|
I ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in #office hours > Renderer 2025-11-12, I think we're happy to semi-optimistically land this.
It doesn't change public API, all the tests pass, and it also passes Vello's tests. I've not carefully reviewed the codegen changes however. For the sake of unblocking the stacked work though, I think landing it early is wortwhile; we can always do a post-hoc review.
(If this isn't an accurate outcome from the meeting yesterday, let me know)
|
I'll go ahead and merge this since the existing tests, my new tests, and the Vello tests all pass. The current x86 code is a bit dodgy anyway (for example, equality comparisons being broken), and I think this PR is an improvement. This should unblock a fair amount of stuff. |
This builds on top of #115. There are no functional changes to the generated code (besides what #115 does), but cleans up the `fearless_simd_gen` code: - The `Arch` trait has been removed. It operated at the wrong level of abstraction--it makes no sense to call e.g. `mk_avx2::make_method` with any `Arch` implementation other than `X86`. - Many code generation functions in the AVX2 and SSE4.2 modules used to pass in the vector type along with its scalar and total bit widths. The former provides the latter, so we can stop passing all three in and just pass in the vector type.
Resolves #114.
This may be best reviewed one commit at a time; one of them moves a lot of stuff around.
This PR updates the x86 codegen to use actual AVX2 intrinsics (the ones starting with
_mm256). This is mostly straightforward, but there are a few operations that require special attention. I've included some other x86 codegen fixes and improvements that are somewhat interwoven:I've added tests for several operations that were previously untested. Mainly these are 256-bit zip/unzip, widen/narrow, split/combine, and integer equality comparisons. Note that these test cases were generated by Claude.
The x86 codegen now actually generates the correct code for integer equality comparisons. Previously, it incorrectly generated "greater than" comparisons instead.
It also now uses the
blendvfamily for "select" operations. Intel's manual says these are available starting in SSE4.1. Not sure if there's a reason this wasn't done before.For SSE4.2-level unzip operations, I've changed the codegen.
Previously, for
unzip_low, it would shuffle the inputs to put the even-indexed elements in both the lower and upper halves of the values, then useunpackloto select just the lower halves. Likewise, forunzip_high, it would shuffle the inputs to put the odd-indexed elements in both halves, and useunpackloonce more.I've changed this so that
unzip_lowandunzip_highboth use a shuffle operation that moves the even-indexed elements into the lower halves and the odd-indexed elements into the upper halves.unzip_lowusesunpackloto select the lower halves, andunzip_highusesunpackhito select the upper halves. This means that if the user calls bothunzip_lowandunzip_high, the shuffle operation's result can be shared.I've implemented 8-bit multiplication based on this StackOverflow answer.
On the AVX2 side, most existing 128-bit operations have a straightforward 256-bit counterpart, but some are more involved:
The zip/unzip operations are a bit more complicated, since most AVX2 swizzle operations operate within each 128-bit lane. For 32-bit and larger operations, there are special "lane-crossing" shuffles we can use instead. Operations on smaller scalars require a combination of intra-lane and "lane-crossing" shuffles.
Splitting a 256-bit vector to a 128-bit one, or combining two 128-bit vectors into a 256-bit one, can be done directly with AVX2 intrinsics.
Widen/narrow operations can be done a bit more efficiently in AVX2. Widening a u8x16 to a 16x16 can be done with a single
_mm256_cvtepu8_epi16. Narrowing a u16x16 to a u8x16 can done with two shuffles: one to extract the lower bits of each 16-bit value within each 128-bit lane, and one to combine the two lanes.I've consolidated much of the x86 codegen from
x86_common.rs,arch/avx2.rs,arch/sse4_2.rs, andarch/x86_common.rsinto a singlearch/x86.rsfile. I did this in the middle of some other commits; sorry! The main AVX2 codegen was implemented before the reorganization, but the split/combine and widen/narrow ops were implemented afterwards.In the future, I'd like to rework and tidy up the codegen a bit more. For instance, we're passing in things like vector types' widths alongside those very same vector types, which is redundant. The
Archtrait is also very much not pulling its weight.