Skip to content

Vector Saturating Subtractions should be flipped around when the result is AND'ed #79690

@Validark

Description

@Validark

Sorry about that title, not sure what to call this. Godbolt link (change which function has export in front of it)

In simdjzon, a Zig port of simdjson, we have code like this:

fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: Chunk = @bitCast(prev2 -| @as(Chunk, @splat(0b11100000 - 1)));
    const is_fourth_byte: Chunk = @bitCast(prev3 -| @as(Chunk, @splat(0b11110000 - 1)));
    const i1xchunk_len = @Vector(chunk_len, i1);
    const result = @as(i1xchunk_len, @bitCast((is_third_byte | is_fourth_byte) > @as(@Vector(chunk_len, u8), @splat(0))));
    return @as(Chunk, @bitCast(@as(IChunk, result))) & @as(Chunk, @splat(0x80));
}

The x86 codegen was:

.LCPI0_0:
        .zero   16,223
.LCPI0_1:
        .zero   16,239
.LCPI0_2:
        .zero   16,128
must_be_2_3_continuation1:
        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpor    xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpandn  xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        ret

The aarch64 codegen was:

must_be_2_3_continuation1:
        movi    v2.16b, #223
        uqsub   v0.16b, v0.16b, v2.16b
        movi    v2.16b, #239
        uqsub   v1.16b, v1.16b, v2.16b
        orr     v0.16b, v1.16b, v0.16b
        cmeq    v0.16b, v0.16b, #0
        movi    v1.16b, #128
        bic     v0.16b, v1.16b, v0.16b
        ret

I tried cleaning it up like so:

fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte  = @select(u8, prev2 >= @as(Chunk, @splat(0b11100000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
    const is_fourth_byte = @select(u8, prev3 >= @as(Chunk, @splat(0b11110000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
    return (is_third_byte | is_fourth_byte);
}

Unfortunately, LLVM still did not see the optimization here. x86 emit:

.LCPI0_0:
        .zero   16,224
.LCPI0_1:
        .zero   16,240
.LCPI0_2:
        .zero   16,128
must_be_2_3_continuation2:
        vpmaxub xmm2, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpmaxub xmm3, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpcmpeqb        xmm0, xmm0, xmm2
        vpcmpeqb        xmm1, xmm1, xmm3
        vpor    xmm0, xmm0, xmm1
        vpsllw  xmm0, xmm0, 7
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        ret

aarch64 emit:

must_be_2_3_continuation2:
        movi    v2.16b, #223
        cmhi    v0.16b, v0.16b, v2.16b
        movi    v2.16b, #239
        cmhi    v1.16b, v1.16b, v2.16b
        orr     v0.16b, v0.16b, v1.16b
        movi    v1.16b, #128
        and     v0.16b, v0.16b, v1.16b
        ret

(this actually is a bit shorter than the other emit, so maybe it's better? I am not familiar with how expensive each instruction is)

Then I tried:

fn must_be_2_3_continuation3(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: std.meta.Int(.unsigned, chunk_len)  = @bitCast(prev2 >= @as(Chunk, @splat(0b11100000)));
    const is_fourth_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev3 >= @as(Chunk, @splat(0b11110000)));
    return @select(u8, @as(@Vector(chunk_len, bool), @bitCast(is_third_byte | is_fourth_byte)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
}

This turned out almost identical in both emits discussed. The only difference is that the register allocator flipped the order of the vector-OR registers, so no relevant difference. I have included this just as an extra test case.

Lastly, I wrote this implementation.

export fn must_be_2_3_continuation4(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: Chunk = prev2 -| @as(Chunk, @splat(0b11100000 - 0x80));
    const is_fourth_byte: Chunk = prev3 -| @as(Chunk, @splat(0b11110000 - 0x80));
    return (is_third_byte | is_fourth_byte) & @as(Chunk, @splat(0x80));
}

x86 emit:

.LCPI0_0:
        .zero   16,96
.LCPI0_1:
        .zero   16,112
.LCPI0_2:
        .zero   16,128
must_be_2_3_continuation4:
        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpor    xmm0, xmm1, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        ret

aarch64 emit:

must_be_2_3_continuation4:
        movi    v2.16b, #96
        uqsub   v0.16b, v0.16b, v2.16b
        movi    v2.16b, #112
        uqsub   v1.16b, v1.16b, v2.16b
        orr     v0.16b, v1.16b, v0.16b
        movi    v1.16b, #128
        and     v0.16b, v0.16b, v1.16b
        ret

As you can see, with the last implementation we finally got optimal x86 codegen! You can see that the emit between the 1st implementation and the last implementation is different in the same way in x86 and aarch64, this optimization eliminates instructions on both platforms. However, implementations 2 and 3 (identical emit) have the same number of instructions on aarch64 as implementation 4, so if the instructions being switched out have an identical cost, the emit from implementations 2/3 or 4 would be acceptable on aarch64.

Same Godbolt link (change which function has export in front of it)

Side note: on PowerPC, 2 & 3 do not compile the same way. It's more like (1 and 2) < 3 < 4. Implementation 4 is still the best.

Metadata

Metadata

Assignees

Labels

good first issuehttps://github.com/llvm/llvm-project/contributellvm:instcombineCovers the InstCombine, InstSimplify and AggressiveInstCombine passesmissed-optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions