-
Notifications
You must be signed in to change notification settings - Fork 57
Description
this is about a 1.5% difference with 32 threads, 0.5% difference with 2 threads, and no difference for single-threaded measurements. this might not be an issue on macOS targets, though i can't really check that. but this definitely applies for Linux targets, probably BSDs. maybe Windows too, but i doubt it? it's also probably more of a Rust issue but i'm not yet sure how to describe what exactly is happening. so, some notes here in case it inspires some tinkering :)
all call sites to functions provided by librav1dasm.a
are indirected through the GOT rather than direct calls like to any other internal function. which is to say, they're ff 15 <addr>
calls like call *addr(%rip)
rather than e8 <offset>
calls like call $+<offset>
. this gist is a standalone demonstration of the issue. here, though, i'm going to use the particularly hot loop in decode_coefs
as a more relevant example. the Rust code is, roughly, here:
let level = &mut levels[level_off..];
ctx = get_lo_ctx(level, tx_class, &mut mag, lo_ctx_offsets, x, y, stride);
if tx_class == TxClass::TwoD {
y |= x;
}
tok = rav1d_msac_decode_symbol_adapt4(&mut ts_c.msac, &mut lo_cdf[ctx as usize], 3);
if dbg {
println!(
"Post-lo_tok[{}][{}][{}][{}={}={}]: r={}",
t_dim.ctx, chroma, ctx, i, rc_i, tok, ts_c.msac.rng,
);
}
if tok == 3 {
("roughly" because the assembly that follows is from a change i was measuring and ought to PR, but for the region here it's about the same with more branches on main
.)
perf
-annotated assembly from build via cargo +nightly build --release
transcoding Chimera-AV1-8bit-1920x1080-6736kbps.ivf
looks something like this:
0.63 │ add 0xc0(%rsp),%rsi
0.64 │ shr $0x7,%ecx
0.67 │ cmp $0x201,%eax
0.63 │ movzbl %dl,%eax
0.70 │ cmovae %edi,%ecx
0.60 │ add (%rax,%rsi,1),%cl
0.62 │ movzbl %cl,%edi
0.69 │ cmp $0x28,%dil
│ ↓ ja 1a6c
0.65 │ mov 0x80(%rsp),%rax
0.72 │ lea (%rax,%rdi,8),%rsi
1.26 │ mov $0x3,%edx
0.67 │ mov %r14,%rdi
8.39 │ → call *0x1e08a1(%rip) # 2e13e0 <_DYNAMIC+0x460> ; ouch, 8% of time in this function we're here.
0.48 │ and $0x3,%eax ; we're in this function 10% of the time,
0.50 │ cmp $0x3,%eax ; so this instruction is 0.8% of total runtime!
│ ↑ je cb0
0.49 │ imul $0x17ff41,%eax,%eax
0.50 │ mov %eax,%edx
│ shr $0x9,%edx
0.49 │ mov 0x10(%rsp),%rdi
unfortunately perf doesn't figure out the item at 0x2e13e0
should the address of rav1d_msac_decode_symbol_adapt4_sse2
so it might not be obvious what's going on here at first. but that function is statically linked! i found a related internals post from someone remarking on exactly the same issue, but otherwise i couldn't find obvious references on the Rust issue tracker.
bjorn3 describes the mechanics of why this is how it is in the replies, though it's all pretty unfortunate. with RELRO, the PLT-based approach of "call to stub that resolves the symbol, then patches its GOT entry" goes out the window as the GOT is read-only. so instead, GOT entries are resolved when the program is loaded, and calls are directly through GOT entries to avoid the PLT indirection. this is generally beneficial, functions do call [&target]
instead of call target_stub; target_stub: call [&target]
.
the relaxation bjorn3 mentions is a link-time optimization, where call target_stub
, for a target that's local to the binary, can be rewritten to call target
without other changes to the function. five-byte call turns into a five-byte call with no one the wiser. with a call through the GOT, presumably ld (and lld?) give up on the opportunity to relax the call [&target]
into nop; call target
, or it's niche enough to not be an implemented optimization.
the relevant code in in session.rs
is the same as it was in 2022. if the target does not want plt_by_default
and has relro_level == Full
, you get calls through GOT entries and end up here. (that's how i arrived at Linux and BSDs: those are the targets with default RelroLevel::Full
)
so, if you build a release binary specifically asking for PLT calls, like RUSTFLAGS="-Z plt=yes" cargo +nightly build --release
and decode the same video, the resulting dav1d
is about 1.5% faster with 32 cores and the above hot code now looks like:
0.67 │ lea (%rsi,%rsi,4),%rsi
0.64 │ add 0xc0(%rsp),%rsi
0.60 │ shr $0x7,%ecx
0.63 │ cmp $0x201,%eax
0.60 │ movzbl %dl,%eax
0.63 │ cmovae %edi,%ecx
0.65 │ add (%rax,%rsi,1),%cl
0.59 │ movzbl %cl,%edi
0.64 │ cmp $0x28,%dil
0.00 │ ↓ ja 1b47
0.63 │ mov 0x90(%rsp),%rax
0.64 │ lea (%rax,%rdi,8),%rsi
0.63 │ mov $0x3,%edx
0.67 │ mov 0x28(%rsp),%r14
0.63 │ mov %r14,%rdi
0.91 │ → call dav1d_msac_decode_symbol_adapt4_sse2
1.22 │ and $0x3,%eax ; also this is the % 4 in `rav1d_msac_decode_symbol_adapt4`.
0.71 │ cmp $0x3,%eax ; it's not strictly necessary.
│ ↑ je cf0
0.86 │ imul $0x17ff41,%eax,%eax
0.75 │ mov %eax,%edx
│ shr $0x9,%edx
0.69 │ mov 0x8(%rsp),%rdi
the fact that this has such an effect, i think, is symptomatic of the kind of diffused warm code that rav1d
and dav1d
have under load. i'd typically have expected the indirect calls to get lodged firmly in the branch predictor and not really be any worse than direct calls. the fact that (at least on Zen 4/Zen 5) it does stall at these indirect calls tells me that there might actually be a few thousand branches between calls to one of the assembly routines, so the predicted target address is evicted? direct calls are a lot easier at that point because their target address doesn't rely on anything fancy, it's just part of the instruction :)
another effect that is really subtle is that as indirect calls, LLVM correctly decides that it can be beneficial to load the target address once into a reigster, then call that multiple times. in a bunch of places this increases register pressure and some extra spills to save the value that gets clobbered by some assembly routine's address. in the region i've emphasized here, you can see that the last instruction loads rdi
from the next higher stack slot for exactly this reason - calls became direct, so there was no need to save a call target, so a spill didn't happen, so the stack slot wasn't needed, and the whole function's frame is 16 bytes smaller. kind of wild how much ends up shifting around!