Skip to content

Commit 2425ecd

Browse files
authored
Implement "real" AVX2 intrinsics and clean up x86 codegen (#115)
Resolves #114. This may be best reviewed one commit at a time; one of them moves a lot of stuff around. This PR updates the x86 codegen to use actual AVX2 intrinsics (the ones starting with `_mm256`). This is mostly straightforward, but there are a few operations that require special attention. I've included some other x86 codegen fixes and improvements that are somewhat interwoven: - I've added tests for several operations that were previously untested. Mainly these are 256-bit zip/unzip, widen/narrow, split/combine, and integer equality comparisons. Note that these test cases were generated by Claude. - The x86 codegen now actually generates the correct code for integer equality comparisons. Previously, it incorrectly generated "greater than" comparisons instead. - It also now uses the `blendv` family for "select" operations. Intel's manual says these are available starting in SSE4.1. Not sure if there's a reason this wasn't done before. - For SSE4.2-level unzip operations, I've changed the codegen. Previously, for `unzip_low`, it would shuffle the inputs to put the even-indexed elements in both the lower and upper halves of the values, then use `unpacklo` to select just the lower halves. Likewise, for `unzip_high`, it would shuffle the inputs to put the *odd*-indexed elements in both halves, and use `unpacklo` once more. I've changed this so that `unzip_low` and `unzip_high` both use a shuffle operation that moves the even-indexed elements into the lower halves and the odd-indexed elements into the upper halves. `unzip_low` uses `unpacklo` to select the lower halves, and `unzip_high` uses `unpackhi` to select the upper halves. This means that if the user calls both `unzip_low` and `unzip_high`, the shuffle operation's result can be shared. - I've implemented 8-bit multiplication based on [this StackOverflow answer](https://stackoverflow.com/questions/8193601/sse-multiplication-16-x-uint8-t). On the AVX2 side, most existing 128-bit operations have a straightforward 256-bit counterpart, but some are more involved: - The zip/unzip operations are a bit more complicated, since most AVX2 swizzle operations operate *within* each 128-bit lane. For 32-bit and larger operations, there are special "lane-crossing" shuffles we can use instead. Operations on smaller scalars require a combination of intra-lane and "lane-crossing" shuffles. - Splitting a 256-bit vector to a 128-bit one, or combining two 128-bit vectors into a 256-bit one, can be done directly with AVX2 intrinsics. - Widen/narrow operations can be done a bit more efficiently in AVX2. Widening a u8x16 to a 16x16 can be done with a single `_mm256_cvtepu8_epi16`. Narrowing a u16x16 to a u8x16 can done with two shuffles: one to extract the lower bits of each 16-bit value within each 128-bit lane, and one to combine the two lanes. I've consolidated much of the x86 codegen from `x86_common.rs`, `arch/avx2.rs`, `arch/sse4_2.rs`, and `arch/x86_common.rs` into a single `arch/x86.rs` file. I did this in the middle of some other commits; sorry! The main AVX2 codegen was implemented before the reorganization, but the split/combine and widen/narrow ops were implemented afterwards. In the future, I'd like to rework and tidy up the codegen a bit more. For instance, we're passing in things like vector types' widths alongside those very same vector types, which is redundant. The `Arch` trait is also very much not pulling its weight.
1 parent deab67c commit 2425ecd

File tree

14 files changed

+3127
-1779
lines changed

14 files changed

+3127
-1779
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@ This release has an [MSRV][] of 1.88.
1818
### Added
1919

2020
- All vector types now implement `Index` and `IndexMut`. ([#112][] by [@Ralith][])
21+
- 256-bit vector types now use native AVX2 intrinsics on supported platforms. ([#115][] by [@valadaptive][])
22+
- 8-bit integer multiplication is now implemented on x86. ([#115][] by [@valadaptive][])
23+
24+
### Fixed
25+
26+
- Integer equality comparisons now function properly on x86. Previously, they performed "greater than" comparisons.
27+
([#115][] by [@valadaptive][])
2128

2229
### Changed
2330

@@ -27,6 +34,7 @@ This release has an [MSRV][] of 1.88.
2734
A consequence of this is that the available variants on `Level` are now dependent on the target features you are compiling with.
2835
The fallback level can be restored with the `force_support_fallback` cargo feature. We don't expect this to be necessary outside
2936
of tests.
37+
- Code generation for `select` and `unzip` operations on x86 has been improved. ([#115][] by [@valadaptive][])
3038

3139
### Removed
3240

@@ -86,6 +94,7 @@ No changelog was kept for this release.
8694

8795
[@Ralith]: https://github.com/Ralith
8896
[@DJMcNab]: https://github.com/DJMcNab
97+
[@valadaptive]: https://github.com/valadaptive
8998

9099
[#75]: https://github.com/linebender/fearless_simd/pull/75
91100
[#76]: https://github.com/linebender/fearless_simd/pull/76
@@ -103,6 +112,7 @@ No changelog was kept for this release.
103112
[#96]: https://github.com/linebender/fearless_simd/pull/96
104113
[#99]: https://github.com/linebender/fearless_simd/pull/99
105114
[#105]: https://github.com/linebender/fearless_simd/pull/105
115+
[#115]: https://github.com/linebender/fearless_simd/pull/115
106116

107117
[Unreleased]: https://github.com/linebender/fearless_simd/compare/v0.3.0...HEAD
108118
[0.3.0]: https://github.com/linebender/fearless_simd/compare/v0.3.0...v0.2.0

fearless_simd/src/generated/avx2.rs

Lines changed: 809 additions & 1011 deletions
Large diffs are not rendered by default.

fearless_simd/src/generated/sse4_2.rs

Lines changed: 54 additions & 122 deletions
Large diffs are not rendered by default.

fearless_simd_gen/src/arch/avx2.rs

Lines changed: 0 additions & 18 deletions
This file was deleted.

fearless_simd_gen/src/arch/mod.rs

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,8 @@
33

44
pub(crate) mod fallback;
55
pub(crate) mod neon;
6-
7-
pub(crate) mod avx2;
8-
pub(crate) mod sse4_2;
96
pub(crate) mod wasm;
10-
pub(crate) mod x86_common;
7+
pub(crate) mod x86;
118

129
use proc_macro2::TokenStream;
1310

fearless_simd_gen/src/arch/sse4_2.rs

Lines changed: 0 additions & 18 deletions
This file was deleted.

fearless_simd_gen/src/arch/x86.rs

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
// Copyright 2025 the Fearless_SIMD Authors
2+
// SPDX-License-Identifier: Apache-2.0 OR MIT
3+
4+
#![expect(
5+
unreachable_pub,
6+
reason = "TODO: https://github.com/linebender/fearless_simd/issues/40"
7+
)]
8+
9+
use crate::arch::Arch;
10+
use crate::types::{ScalarType, VecType};
11+
use proc_macro2::{Ident, Span, TokenStream};
12+
use quote::{format_ident, quote};
13+
14+
pub struct X86;
15+
16+
pub(crate) fn translate_op(op: &str) -> Option<&'static str> {
17+
Some(match op {
18+
"floor" => "floor",
19+
"sqrt" => "sqrt",
20+
"add" => "add",
21+
"sub" => "sub",
22+
"div" => "div",
23+
"and" => "and",
24+
"simd_eq" => "cmpeq",
25+
"simd_lt" => "cmplt",
26+
"simd_le" => "cmple",
27+
"simd_ge" => "cmpge",
28+
"simd_gt" => "cmpgt",
29+
"or" => "or",
30+
"xor" => "xor",
31+
"shl" => "shl",
32+
"shr" => "shr",
33+
"max" => "max",
34+
"min" => "min",
35+
"max_precise" => "max",
36+
"min_precise" => "min",
37+
"select" => "blendv",
38+
_ => return None,
39+
})
40+
}
41+
42+
impl Arch for X86 {
43+
fn arch_ty(&self, ty: &VecType) -> TokenStream {
44+
let suffix = match (ty.scalar, ty.scalar_bits) {
45+
(ScalarType::Float, 32) => "",
46+
(ScalarType::Float, 64) => "d",
47+
(ScalarType::Float, _) => unimplemented!(),
48+
(ScalarType::Unsigned | ScalarType::Int | ScalarType::Mask, _) => "i",
49+
};
50+
let name = format!("__m{}{}", ty.scalar_bits * ty.len, suffix);
51+
let ident = Ident::new(&name, Span::call_site());
52+
quote! { #ident }
53+
}
54+
55+
fn expr(&self, op: &str, ty: &VecType, args: &[TokenStream]) -> TokenStream {
56+
if let Some(op_name) = translate_op(op) {
57+
let sign_aware = matches!(op, "max" | "min");
58+
59+
let suffix = match op_name {
60+
"and" | "or" | "xor" => coarse_type(*ty),
61+
"blendv" if ty.scalar != ScalarType::Float => "epi8",
62+
_ => op_suffix(ty.scalar, ty.scalar_bits, sign_aware),
63+
};
64+
let intrinsic = intrinsic_ident(op_name, suffix, ty.n_bits());
65+
quote! { #intrinsic ( #( #args ),* ) }
66+
} else {
67+
let suffix = op_suffix(ty.scalar, ty.scalar_bits, true);
68+
match op {
69+
"trunc" => {
70+
let intrinsic = intrinsic_ident("round", suffix, ty.n_bits());
71+
quote! { #intrinsic ( #( #args, )* _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC) }
72+
}
73+
"neg" => match ty.scalar {
74+
ScalarType::Float => {
75+
let set1 = set1_intrinsic(ty.scalar, ty.scalar_bits, ty.n_bits());
76+
let xor =
77+
simple_intrinsic("xor", ScalarType::Float, ty.scalar_bits, ty.n_bits());
78+
quote! {
79+
#( #xor(#args, #set1(-0.0)) )*
80+
}
81+
}
82+
ScalarType::Int => {
83+
let set0 = intrinsic_ident("setzero", coarse_type(*ty), ty.n_bits());
84+
let sub = simple_intrinsic("sub", ty.scalar, ty.scalar_bits, ty.n_bits());
85+
let arg = &args[0];
86+
quote! {
87+
#sub(#set0(), #arg)
88+
}
89+
}
90+
_ => unreachable!(),
91+
},
92+
"abs" => {
93+
let set1 = set1_intrinsic(ty.scalar, ty.scalar_bits, ty.n_bits());
94+
let andnot =
95+
simple_intrinsic("andnot", ScalarType::Float, ty.scalar_bits, ty.n_bits());
96+
quote! {
97+
#( #andnot(#set1(-0.0), #args) )*
98+
}
99+
}
100+
"copysign" => {
101+
let a = &args[0];
102+
let b = &args[1];
103+
let set1 = set1_intrinsic(ty.scalar, ty.scalar_bits, ty.n_bits());
104+
let and =
105+
simple_intrinsic("and", ScalarType::Float, ty.scalar_bits, ty.n_bits());
106+
let andnot =
107+
simple_intrinsic("andnot", ScalarType::Float, ty.scalar_bits, ty.n_bits());
108+
let or = simple_intrinsic("or", ScalarType::Float, ty.scalar_bits, ty.n_bits());
109+
quote! {
110+
let mask = #set1(-0.0);
111+
#or(#and(mask, #b), #andnot(mask, #a))
112+
}
113+
}
114+
"mul" => {
115+
let suffix = op_suffix(ty.scalar, ty.scalar_bits, false);
116+
let intrinsic = if matches!(ty.scalar, ScalarType::Int | ScalarType::Unsigned) {
117+
intrinsic_ident("mullo", suffix, ty.n_bits())
118+
} else {
119+
intrinsic_ident("mul", suffix, ty.n_bits())
120+
};
121+
122+
quote! { #intrinsic ( #( #args ),* ) }
123+
}
124+
"shrv" if ty.scalar_bits > 16 => {
125+
let suffix = op_suffix(ty.scalar, ty.scalar_bits, false);
126+
let name = match ty.scalar {
127+
ScalarType::Int => "srav",
128+
_ => "srlv",
129+
};
130+
let intrinsic = intrinsic_ident(name, suffix, ty.n_bits());
131+
quote! { #intrinsic ( #( #args ),* ) }
132+
}
133+
_ => unimplemented!("{}", op),
134+
}
135+
}
136+
}
137+
}
138+
139+
pub(crate) fn op_suffix(mut ty: ScalarType, bits: usize, sign_aware: bool) -> &'static str {
140+
use ScalarType::*;
141+
if !sign_aware && ty == Unsigned {
142+
ty = Int;
143+
}
144+
match (ty, bits) {
145+
(Float, 32) => "ps",
146+
(Float, 64) => "pd",
147+
(Float, _) => unimplemented!("{bits} bit floats"),
148+
(Int | Mask, 8) => "epi8",
149+
(Int | Mask, 16) => "epi16",
150+
(Int | Mask, 32) => "epi32",
151+
(Int | Mask, 64) => "epi64",
152+
(Unsigned, 8) => "epu8",
153+
(Unsigned, 16) => "epu16",
154+
(Unsigned, 32) => "epu32",
155+
(Unsigned, 64) => "epu64",
156+
_ => unreachable!(),
157+
}
158+
}
159+
160+
/// Intrinsic name for the "int, float, or double" type (not as fine-grained as [`op_suffix`]).
161+
pub(crate) fn coarse_type(vec_ty: VecType) -> &'static str {
162+
use ScalarType::*;
163+
match (vec_ty.scalar, vec_ty.n_bits()) {
164+
(Int | Unsigned | Mask, 128) => "si128",
165+
(Int | Unsigned | Mask, 256) => "si256",
166+
(Int | Unsigned | Mask, 512) => "si512",
167+
_ => op_suffix(vec_ty.scalar, vec_ty.scalar_bits, false),
168+
}
169+
}
170+
171+
pub(crate) fn set1_intrinsic(ty: ScalarType, bits: usize, ty_bits: usize) -> Ident {
172+
use ScalarType::*;
173+
let suffix = match (ty, bits) {
174+
(Int | Unsigned | Mask, 64) => "epi64x",
175+
_ => op_suffix(ty, bits, false),
176+
};
177+
178+
intrinsic_ident("set1", suffix, ty_bits)
179+
}
180+
181+
pub(crate) fn simple_intrinsic(name: &str, ty: ScalarType, bits: usize, ty_bits: usize) -> Ident {
182+
let suffix = op_suffix(ty, bits, true);
183+
184+
intrinsic_ident(name, suffix, ty_bits)
185+
}
186+
187+
pub(crate) fn simple_sign_unaware_intrinsic(
188+
name: &str,
189+
ty: ScalarType,
190+
bits: usize,
191+
ty_bits: usize,
192+
) -> Ident {
193+
let suffix = op_suffix(ty, bits, false);
194+
195+
intrinsic_ident(name, suffix, ty_bits)
196+
}
197+
198+
pub(crate) fn extend_intrinsic(
199+
ty: ScalarType,
200+
from_bits: usize,
201+
to_bits: usize,
202+
ty_bits: usize,
203+
) -> Ident {
204+
let from_suffix = op_suffix(ty, from_bits, true);
205+
let to_suffix = op_suffix(ty, to_bits, false);
206+
207+
intrinsic_ident(&format!("cvt{from_suffix}"), to_suffix, ty_bits)
208+
}
209+
210+
pub(crate) fn cvt_intrinsic(from: VecType, to: VecType) -> Ident {
211+
let from_suffix = op_suffix(from.scalar, from.scalar_bits, false);
212+
let to_suffix = op_suffix(to.scalar, to.scalar_bits, false);
213+
214+
intrinsic_ident(&format!("cvt{from_suffix}"), to_suffix, from.n_bits())
215+
}
216+
217+
pub(crate) fn pack_intrinsic(from_bits: usize, signed: bool, ty_bits: usize) -> Ident {
218+
let unsigned = match signed {
219+
true => "",
220+
false => "u",
221+
};
222+
let suffix = op_suffix(ScalarType::Int, from_bits, false);
223+
224+
intrinsic_ident(&format!("pack{unsigned}s"), suffix, ty_bits)
225+
}
226+
227+
pub(crate) fn unpack_intrinsic(
228+
scalar_type: ScalarType,
229+
scalar_bits: usize,
230+
low: bool,
231+
ty_bits: usize,
232+
) -> Ident {
233+
let suffix = op_suffix(scalar_type, scalar_bits, false);
234+
235+
let low_pref = if low { "lo" } else { "hi" };
236+
237+
intrinsic_ident(&format!("unpack{low_pref}"), suffix, ty_bits)
238+
}
239+
240+
pub(crate) fn intrinsic_ident(name: &str, suffix: &str, ty_bits: usize) -> Ident {
241+
let prefix = match ty_bits {
242+
128 => "",
243+
256 => "256",
244+
512 => "512",
245+
_ => unreachable!(),
246+
};
247+
248+
format_ident!("_mm{prefix}_{name}_{suffix}")
249+
}
250+
251+
pub(crate) fn cast_ident(
252+
src_scalar_ty: ScalarType,
253+
dst_scalar_ty: ScalarType,
254+
scalar_bits: usize,
255+
ty_bits: usize,
256+
) -> Ident {
257+
let prefix = match ty_bits {
258+
128 => "",
259+
256 => "256",
260+
512 => "512",
261+
_ => unreachable!(),
262+
};
263+
let src_name = coarse_type(VecType::new(
264+
src_scalar_ty,
265+
scalar_bits,
266+
ty_bits / scalar_bits,
267+
));
268+
let dst_name = coarse_type(VecType::new(
269+
dst_scalar_ty,
270+
scalar_bits,
271+
ty_bits / scalar_bits,
272+
));
273+
274+
format_ident!("_mm{prefix}_cast{src_name}_{dst_name}")
275+
}

0 commit comments

Comments
 (0)