Skip to content

[Linux] calls to assembly routines have excessive indirection #1417

@iximeow

Description

@iximeow

this is about a 1.5% difference with 32 threads, 0.5% difference with 2 threads, and no difference for single-threaded measurements. this might not be an issue on macOS targets, though i can't really check that. but this definitely applies for Linux targets, probably BSDs. maybe Windows too, but i doubt it? it's also probably more of a Rust issue but i'm not yet sure how to describe what exactly is happening. so, some notes here in case it inspires some tinkering :)

all call sites to functions provided by librav1dasm.a are indirected through the GOT rather than direct calls like to any other internal function. which is to say, they're ff 15 <addr> calls like call *addr(%rip) rather than e8 <offset> calls like call $+<offset>. this gist is a standalone demonstration of the issue. here, though, i'm going to use the particularly hot loop in decode_coefs as a more relevant example. the Rust code is, roughly, here:

            let level = &mut levels[level_off..];
            ctx = get_lo_ctx(level, tx_class, &mut mag, lo_ctx_offsets, x, y, stride);
            if tx_class == TxClass::TwoD {
                y |= x;
            }
            tok = rav1d_msac_decode_symbol_adapt4(&mut ts_c.msac, &mut lo_cdf[ctx as usize], 3);
            if dbg {
                println!(
                    "Post-lo_tok[{}][{}][{}][{}={}={}]: r={}",
                    t_dim.ctx, chroma, ctx, i, rc_i, tok, ts_c.msac.rng,
                );
            }
            if tok == 3 {

("roughly" because the assembly that follows is from a change i was measuring and ought to PR, but for the region here it's about the same with more branches on main.)

perf-annotated assembly from build via cargo +nightly build --release transcoding Chimera-AV1-8bit-1920x1080-6736kbps.ivf looks something like this:

0.63add    0xc0(%rsp),%rsi
0.64shr    $0x7,%ecx
0.67cmp    $0x201,%eax
0.63 │        movzbl %dl,%eax
0.70 │        cmovae %edi,%ecx
0.60add    (%rax,%rsi,1),%cl
0.62 │        movzbl %cl,%edi
0.69cmp    $0x28,%dil
     │      ↓ ja     1a6c
0.65mov    0x80(%rsp),%rax
0.72lea    (%rax,%rdi,8),%rsi
1.26mov    $0x3,%edx
0.67mov    %r14,%rdi
8.39 │      → call   *0x1e08a1(%rip)        # 2e13e0 <_DYNAMIC+0x460>  ; ouch, 8% of time in this function we're here.
0.48and    $0x3,%eax                                         ;   we're in this function 10% of the time,
0.50cmp    $0x3,%eax                                         ;   so this instruction is 0.8% of total runtime!
     │      ↑ je     cb0
0.49imul   $0x17ff41,%eax,%eax
0.50mov    %eax,%edx
shr    $0x9,%edx
0.49mov    0x10(%rsp),%rdi

unfortunately perf doesn't figure out the item at 0x2e13e0 should the address of rav1d_msac_decode_symbol_adapt4_sse2 so it might not be obvious what's going on here at first. but that function is statically linked! i found a related internals post from someone remarking on exactly the same issue, but otherwise i couldn't find obvious references on the Rust issue tracker.

bjorn3 describes the mechanics of why this is how it is in the replies, though it's all pretty unfortunate. with RELRO, the PLT-based approach of "call to stub that resolves the symbol, then patches its GOT entry" goes out the window as the GOT is read-only. so instead, GOT entries are resolved when the program is loaded, and calls are directly through GOT entries to avoid the PLT indirection. this is generally beneficial, functions do call [&target] instead of call target_stub; target_stub: call [&target].

the relaxation bjorn3 mentions is a link-time optimization, where call target_stub, for a target that's local to the binary, can be rewritten to call target without other changes to the function. five-byte call turns into a five-byte call with no one the wiser. with a call through the GOT, presumably ld (and lld?) give up on the opportunity to relax the call [&target] into nop; call target, or it's niche enough to not be an implemented optimization.

the relevant code in in session.rs is the same as it was in 2022. if the target does not want plt_by_default and has relro_level == Full, you get calls through GOT entries and end up here. (that's how i arrived at Linux and BSDs: those are the targets with default RelroLevel::Full)

so, if you build a release binary specifically asking for PLT calls, like RUSTFLAGS="-Z plt=yes" cargo +nightly build --release and decode the same video, the resulting dav1d is about 1.5% faster with 32 cores and the above hot code now looks like:

  0.67lea    (%rsi,%rsi,4),%rsi
  0.64add    0xc0(%rsp),%rsi
  0.60shr    $0x7,%ecx
  0.63cmp    $0x201,%eax
  0.60 │        movzbl %dl,%eax
  0.63 │        cmovae %edi,%ecx
  0.65add    (%rax,%rsi,1),%cl
  0.59 │        movzbl %cl,%edi
  0.64cmp    $0x28,%dil
  0.00 │      ↓ ja     1b47
  0.63mov    0x90(%rsp),%rax
  0.64lea    (%rax,%rdi,8),%rsi
  0.63mov    $0x3,%edx
  0.67mov    0x28(%rsp),%r14
  0.63mov    %r14,%rdi
  0.91 │      → call   dav1d_msac_decode_symbol_adapt4_sse2
  1.22and    $0x3,%eax                   ; also this is the % 4 in `rav1d_msac_decode_symbol_adapt4`.
  0.71cmp    $0x3,%eax                   ;   it's not strictly necessary.
       │      ↑ je     cf0
  0.86imul   $0x17ff41,%eax,%eax
  0.75mov    %eax,%edx
shr    $0x9,%edx
  0.69mov    0x8(%rsp),%rdi

the fact that this has such an effect, i think, is symptomatic of the kind of diffused warm code that rav1d and dav1d have under load. i'd typically have expected the indirect calls to get lodged firmly in the branch predictor and not really be any worse than direct calls. the fact that (at least on Zen 4/Zen 5) it does stall at these indirect calls tells me that there might actually be a few thousand branches between calls to one of the assembly routines, so the predicted target address is evicted? direct calls are a lot easier at that point because their target address doesn't rely on anything fancy, it's just part of the instruction :)

another effect that is really subtle is that as indirect calls, LLVM correctly decides that it can be beneficial to load the target address once into a reigster, then call that multiple times. in a bunch of places this increases register pressure and some extra spills to save the value that gets clobbered by some assembly routine's address. in the region i've emphasized here, you can see that the last instruction loads rdi from the next higher stack slot for exactly this reason - calls became direct, so there was no need to save a call target, so a spill didn't happen, so the stack slot wasn't needed, and the whole function's frame is 16 bytes smaller. kind of wild how much ends up shifting around!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions