-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Description
Sorry about that title, not sure what to call this. Godbolt link (change which function has export
in front of it)
In simdjzon, a Zig port of simdjson, we have code like this:
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: Chunk = @bitCast(prev2 -| @as(Chunk, @splat(0b11100000 - 1)));
const is_fourth_byte: Chunk = @bitCast(prev3 -| @as(Chunk, @splat(0b11110000 - 1)));
const i1xchunk_len = @Vector(chunk_len, i1);
const result = @as(i1xchunk_len, @bitCast((is_third_byte | is_fourth_byte) > @as(@Vector(chunk_len, u8), @splat(0))));
return @as(Chunk, @bitCast(@as(IChunk, result))) & @as(Chunk, @splat(0x80));
}
The x86 codegen was:
.LCPI0_0:
.zero 16,223
.LCPI0_1:
.zero 16,239
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation1:
vpsubusb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpsubusb xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
vpor xmm0, xmm1, xmm0
vpxor xmm1, xmm1, xmm1
vpcmpeqb xmm0, xmm0, xmm1
vpandn xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
The aarch64 codegen was:
must_be_2_3_continuation1:
movi v2.16b, #223
uqsub v0.16b, v0.16b, v2.16b
movi v2.16b, #239
uqsub v1.16b, v1.16b, v2.16b
orr v0.16b, v1.16b, v0.16b
cmeq v0.16b, v0.16b, #0
movi v1.16b, #128
bic v0.16b, v1.16b, v0.16b
ret
I tried cleaning it up like so:
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte = @select(u8, prev2 >= @as(Chunk, @splat(0b11100000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
const is_fourth_byte = @select(u8, prev3 >= @as(Chunk, @splat(0b11110000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
return (is_third_byte | is_fourth_byte);
}
Unfortunately, LLVM still did not see the optimization here. x86 emit:
.LCPI0_0:
.zero 16,224
.LCPI0_1:
.zero 16,240
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation2:
vpmaxub xmm2, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmaxub xmm3, xmm1, xmmword ptr [rip + .LCPI0_1]
vpcmpeqb xmm0, xmm0, xmm2
vpcmpeqb xmm1, xmm1, xmm3
vpor xmm0, xmm0, xmm1
vpsllw xmm0, xmm0, 7
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
aarch64 emit:
must_be_2_3_continuation2:
movi v2.16b, #223
cmhi v0.16b, v0.16b, v2.16b
movi v2.16b, #239
cmhi v1.16b, v1.16b, v2.16b
orr v0.16b, v0.16b, v1.16b
movi v1.16b, #128
and v0.16b, v0.16b, v1.16b
ret
(this actually is a bit shorter than the other emit, so maybe it's better? I am not familiar with how expensive each instruction is)
Then I tried:
fn must_be_2_3_continuation3(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev2 >= @as(Chunk, @splat(0b11100000)));
const is_fourth_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev3 >= @as(Chunk, @splat(0b11110000)));
return @select(u8, @as(@Vector(chunk_len, bool), @bitCast(is_third_byte | is_fourth_byte)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
}
This turned out almost identical in both emits discussed. The only difference is that the register allocator flipped the order of the vector-OR registers, so no relevant difference. I have included this just as an extra test case.
Lastly, I wrote this implementation.
export fn must_be_2_3_continuation4(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: Chunk = prev2 -| @as(Chunk, @splat(0b11100000 - 0x80));
const is_fourth_byte: Chunk = prev3 -| @as(Chunk, @splat(0b11110000 - 0x80));
return (is_third_byte | is_fourth_byte) & @as(Chunk, @splat(0x80));
}
x86 emit:
.LCPI0_0:
.zero 16,96
.LCPI0_1:
.zero 16,112
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation4:
vpsubusb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpsubusb xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
vpor xmm0, xmm1, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
aarch64 emit:
must_be_2_3_continuation4:
movi v2.16b, #96
uqsub v0.16b, v0.16b, v2.16b
movi v2.16b, #112
uqsub v1.16b, v1.16b, v2.16b
orr v0.16b, v1.16b, v0.16b
movi v1.16b, #128
and v0.16b, v0.16b, v1.16b
ret
As you can see, with the last implementation we finally got optimal x86 codegen! You can see that the emit between the 1st implementation and the last implementation is different in the same way in x86 and aarch64, this optimization eliminates instructions on both platforms. However, implementations 2 and 3 (identical emit) have the same number of instructions on aarch64 as implementation 4, so if the instructions being switched out have an identical cost, the emit from implementations 2/3 or 4 would be acceptable on aarch64.
Same Godbolt link (change which function has export
in front of it)
Side note: on PowerPC, 2 & 3 do not compile the same way. It's more like (1 and 2) < 3 < 4. Implementation 4 is still the best.