[Linux] calls to assembly routines have excessive indirection

this is about a 1.5% difference with 32 threads, 0.5% difference with 2 threads, and no difference for single-threaded measurements. this might not be an issue on macOS targets, though i can't really check that. but this definitely applies for Linux targets, probably BSDs. maybe Windows too, but i doubt it? it's also probably more of a Rust issue but i'm not yet sure how to describe what exactly is happening. so, some notes here in case it inspires some tinkering :)

all call sites to functions provided by `librav1dasm.a` are indirected through the GOT rather than direct calls like to any other internal function. which is to say, they're `ff 15 <addr>` calls like `call *addr(%rip)` rather than `e8 <offset>` calls like `call $+<offset>`. [this gist](https://gist.github.com/iximeow/4e83c245f507d251d4e6feabbebd2bbe) is a standalone demonstration of the issue. here, though, i'm going to use the particularly hot loop in `decode_coefs` as a more relevant example. the Rust code is, roughly, [here](https://github.com/memorysafety/rav1d/blob/3150db5/src/recon.rs#L931-L943):
```rust
            let level = &mut levels[level_off..];
            ctx = get_lo_ctx(level, tx_class, &mut mag, lo_ctx_offsets, x, y, stride);
            if tx_class == TxClass::TwoD {
                y |= x;
            }
            tok = rav1d_msac_decode_symbol_adapt4(&mut ts_c.msac, &mut lo_cdf[ctx as usize], 3);
            if dbg {
                println!(
                    "Post-lo_tok[{}][{}][{}][{}={}={}]: r={}",
                    t_dim.ctx, chroma, ctx, i, rc_i, tok, ts_c.msac.rng,
                );
            }
            if tok == 3 {
```

("roughly" because the assembly that follows is from [a change i was measuring and ought to PR](https://github.com/iximeow/rav1d/blob/715e31cc16dd8c1dc9f62241c10eb63351d4d829/src/recon.rs#L947-L965), but for the region here it's about the same with more branches on `main`.)

`perf`-annotated assembly from build via `cargo +nightly build --release` transcoding `Chimera-AV1-8bit-1920x1080-6736kbps.ivf` looks something like this:

```asm
0.63 │        add    0xc0(%rsp),%rsi
0.64 │        shr    $0x7,%ecx
0.67 │        cmp    $0x201,%eax
0.63 │        movzbl %dl,%eax
0.70 │        cmovae %edi,%ecx
0.60 │        add    (%rax,%rsi,1),%cl
0.62 │        movzbl %cl,%edi
0.69 │        cmp    $0x28,%dil
     │      ↓ ja     1a6c
0.65 │        mov    0x80(%rsp),%rax
0.72 │        lea    (%rax,%rdi,8),%rsi
1.26 │        mov    $0x3,%edx
0.67 │        mov    %r14,%rdi
8.39 │      → call   *0x1e08a1(%rip)        # 2e13e0 <_DYNAMIC+0x460>  ; ouch, 8% of time in this function we're here.
0.48 │        and    $0x3,%eax                                         ;   we're in this function 10% of the time,
0.50 │        cmp    $0x3,%eax                                         ;   so this instruction is 0.8% of total runtime!
     │      ↑ je     cb0
0.49 │        imul   $0x17ff41,%eax,%eax
0.50 │        mov    %eax,%edx
     │        shr    $0x9,%edx
0.49 │        mov    0x10(%rsp),%rdi
```

unfortunately perf doesn't figure out the item at `0x2e13e0` should the address of `rav1d_msac_decode_symbol_adapt4_sse2` so it might not be obvious what's going on here at first. but that function is [statically linked](https://github.com/memorysafety/rav1d/blob/6a0c7f6/build.rs#L330)! i found a related [internals post](https://internals.rust-lang.org/t/compiler-emits-indirect-call-through-got-for-statically-linked-extern-functions/16541) from someone remarking on exactly the same issue, but otherwise i couldn't find obvious references on the Rust issue tracker. 

bjorn3 describes the mechanics of why this is how it is in the replies, though it's all pretty unfortunate. with RELRO, the PLT-based approach of "call to stub that resolves the symbol, then patches its GOT entry" goes out the window as the GOT is read-only. so instead, GOT entries are resolved when the program is loaded, and calls are directly through GOT entries to avoid the PLT indirection. this is generally beneficial, functions do `call [&target]` instead of `call target_stub; target_stub: call [&target]`.

the relaxation bjorn3 mentions is a link-time optimization, where `call target_stub`, for a target that's local to the binary, can be rewritten to `call target` without other changes to the function. five-byte call turns into a five-byte call with no one the wiser. with a call through the GOT, presumably ld (and lld?) give up on the opportunity to relax the `call [&target]` into `nop; call target`, or it's niche enough to not be an implemented optimization.

the relevant code in in `session.rs` is [the same as it was in 2022](https://github.com/rust-lang/rust/blob/078144f/compiler/rustc_session/src/session.rs#L561-L578). if the target does not want `plt_by_default` and has `relro_level == Full`, you get calls through GOT entries and end up here. (that's how i arrived at Linux and BSDs: those are the targets with default `RelroLevel::Full`)

so, if you build a release binary specifically asking for PLT calls, like `RUSTFLAGS="-Z plt=yes" cargo +nightly build --release` and decode the same video, the resulting `dav1d` is about 1.5% faster with 32 cores and the above hot code now looks like:

```asm
  0.67 │        lea    (%rsi,%rsi,4),%rsi
  0.64 │        add    0xc0(%rsp),%rsi
  0.60 │        shr    $0x7,%ecx
  0.63 │        cmp    $0x201,%eax
  0.60 │        movzbl %dl,%eax
  0.63 │        cmovae %edi,%ecx
  0.65 │        add    (%rax,%rsi,1),%cl
  0.59 │        movzbl %cl,%edi
  0.64 │        cmp    $0x28,%dil
  0.00 │      ↓ ja     1b47
  0.63 │        mov    0x90(%rsp),%rax
  0.64 │        lea    (%rax,%rdi,8),%rsi
  0.63 │        mov    $0x3,%edx
  0.67 │        mov    0x28(%rsp),%r14
  0.63 │        mov    %r14,%rdi
  0.91 │      → call   dav1d_msac_decode_symbol_adapt4_sse2
  1.22 │        and    $0x3,%eax                   ; also this is the % 4 in `rav1d_msac_decode_symbol_adapt4`.
  0.71 │        cmp    $0x3,%eax                   ;   it's not strictly necessary.
       │      ↑ je     cf0
  0.86 │        imul   $0x17ff41,%eax,%eax
  0.75 │        mov    %eax,%edx
       │        shr    $0x9,%edx
  0.69 │        mov    0x8(%rsp),%rdi
```

the fact that this has such an effect, i think, is symptomatic of the kind of diffused warm code that `rav1d` and `dav1d` have under load. i'd typically have expected the indirect calls to get lodged firmly in the branch predictor and not really be any worse than direct calls. the fact that (at least on Zen 4/Zen 5) it _does_ stall at these indirect calls tells me that there might actually be a few thousand branches between calls to one of the assembly routines, so the predicted target address is evicted? direct calls are a lot easier at that point because their target address doesn't rely on anything fancy, it's just part of the instruction :)

another effect that is really subtle is that as indirect calls, LLVM correctly decides that it can be beneficial to load the target address once into a reigster, then call that multiple times. in a bunch of places this increases register pressure and some extra spills to save the value that gets clobbered by some assembly routine's address. in the region i've emphasized here, you can see that the last instruction loads `rdi` from the next higher stack slot for exactly this reason - calls became direct, so there was no need to save a call target, so a spill didn't happen, so the stack slot wasn't needed, and the whole function's frame is 16 bytes smaller. kind of wild how much ends up shifting around!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Linux] calls to assembly routines have excessive indirection #1417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Linux] calls to assembly routines have excessive indirection #1417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions