-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
runtime: steal tasks from the LIFO slot #7431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit changes the LIFO slot on multi-threaded runtime workers from being a mutable `Option<Notified<Arc<Handle>>>` to a new `AtomicNotified` type. This type implements a cell containing a nullable task pointer which can be swapped atomically. It's analogous to `AtomicCell` but with the extra `PhantomData` to remember the task's scheduler type parameter, which would otherwise be erased by the conversion into a `*mut Header`` pointer. This change is in preparation for a subsequent change to allow work-stealing from the LIFO slot (see: #4941).
This way, it's accesible by the stealer. Leave all the LIFO *accounting* (i.e. deciding whether we hit the LIFO slot or not) up to the worker. Gotta figure out whether the load of lifo presence will race...ugh.
This commit adds a test ensuring that if a task is notified to the LIFO slot by another task which then blocks the worker thread forever, the LIFO task is eventually stolen by another worker. I've confimed that this test fails on the `master` branch, and passes after these changes.
This test spawns a task that sometimes ends up in the LIFO slot and sometimes doesn't. This was previously fine as the LIFO slot didn't count for `worker_local_queue_depth`, but now it does. Thus, we have to make sure that task no longer exists before asserting about queue depth.
(shoutout to @mkeeter for nerd-sniping me into actually doing this) |
// If we also grabbed the task from the LIFO slot, include that in the | ||
// steal count as well. | ||
dst_stats.incr_steal_count(n as u16 + lifo.is_some() as u16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also possible to use u16::from(bool)
, if you care about cast_lossless
cleanliness.
Co-authored-by: Matt Keeter <[email protected]>
as suggested by @ADD-SP in #7431 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not easy for me to understand the test lifo_stealable
, so I've proposed some comments to help with understanding.
I also have a question, did we have some performance regression test that covers the relevant code path? It would be better confirming no significant regression as this PR changes the hot path a lot.
If there is a performance regression, we can discuss if this is acceptable for supporting steal-able LIFO slot.
@@ -116,32 +175,59 @@ fn steal_overflow() { | |||
fn multi_stealer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers: This test is not changed, except the inner fn steal_tasks(..) -> usize
is moved out of this test for reusing in test multi_stealer_lifo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct.
Thanks @ADD-SP for the suggestions!
@ADD-SP thanks for your suggested comment improvements for the test! I ended up adding a bunch of comments in 0889f08 --- I wanted to also add a higher-level overview at the beginning of the test explaining what the different channels are used for, and I changed the names of some things. Let me know what you think? Hopefully it's a bit easier to understand now. |
I think the existing I'll go run these benchmarks and report my results. However, I think it might be worthwhile to also compare performance on an ARM system. I would guess that if there's a noticeable performance regression from changing the LIFO slot from a mutable local pointer to a shared Unfortunately, I don't have easy access to a multi-core ARM machine to run these benchmarks on. If someone else with an ARM machine, like an Apple Silicon Mac, wants to run the benches on ARM, I'd love to see the results. |
This FreeBSD CI run failure looks a bit weird --- it looks like maybe something timed out? I'm re-running it. |
.push_back_or_overflow(prev, self, &mut core.stats); | ||
true | ||
} | ||
None => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This boolean indicates whether other worker threads are notified that there is work available to steal. Since the lifo slot is now stealable, it should no longer be false in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we could have caught with a loom test? Maybe put something in the lifo slot and then block the thread, concurrently with another thread doing nothing?
as per [this comment][1] from @Darksonn, we should notify other workers any time we push to the LIFO slot, as it is stealable now. this simplifies the logic in `schedule_local` a bit, as we can get rid of the `should_notify` bool. [1]: #7431 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a failed test, could you take a look?
Motivation
Worker threads in the multi-threaded runtime include a per-worker LIFO
slot which stores the last task notified by another task running on that
worker. This allows the last-notified task to be polled first when the
currently running task completes, decreasing latency in message-passing
"ping-pong" scenarios.
However --- as described in #4941 --- there's an issue with this that
can cause severe problems in some situations: the task in the LIFO slot
cannot be stolen by a work-stealing worker thread. This means that if a
task notifies another task and then goes CPU bound for a long period of
time without yielding, the notified task will never be able to execute
until the task that notified it can yield. This can result in a very
severe latency bubble in some scenarios. See, for instance, #4323,
#6954, oxidecomputer/omicron#8334, etc.
As a workaround, PR #4936 added an unstable
runtime::Builder
optionto disable the LIFO slot. However, this is a less-than-ideal
solution, as it means that applications which disable the LIFO slot due
to occasional usage patterns that cause latency bubbles when it is
enabled cannot benefit from the potential latency improvements it
offers in other usage patterns. And, it's an unstable option which the
user has to discover. In most cases, people whose programs contain usage
patterns that are pathological with regards to the LIFO slot don't know
this ahead of time: the typical narrative is that you write code that
happens to follow such a pattern, discover an unexpected latency spike
or hang in production, and then learn how to disable the LIFO slot. It
would be much nicer if the task in the LIFO slot could participate in
work-stealing like every other task.
Solution
This branch makes tasks in the LIFO slot stealable.
Broadly, I've taken the following approach:
Add a new
AtomicNotified
type that implements an atomicallyswappable cell storing a
Notified
task(1220253), and use it to represent
the LIFO slot instead of an
Option<Notified<Arc<Handle>>>
. Thisway, other workers can take a task out of the LIFO slot while
work-stealing.
Move the LIFO slot out of the
worker::Core
struct and into the runqueue's
Inner
type (75d8116),making it shared state between the
Local
side of the queue owned bythe worker itself and the
Steal
side used by remote workers tosteal from the queue.
There's a bunch of additional code in
worker::Core
for managingwhether to atually run a task from the LIFO slot or not. I opted
not to move any of this code into the run queue itself, as it
depends on other bits of internal worker state. Instead, we just
expose to the worker separate APIs for pushing/popping to/from the
main queue and for pushing/popping to/from the LIFO slot, resulting
in a fairly small diff to the worker's run loop.
Change the work-stealing code to also steal the LIFO task
(730a581 and
cb27dda). This actually turned out
to be pretty straightforwards: once we've stolen a chunk of tasks
from the targeted worker's run queue, we now also grab whatever's in
its' LIFO slot as well. If we stole a LIFO task, it's returned from
the
steal_into
method in lieu of the first task in the run queue, sothat it gets to execute first, maintaining the latency improvement for
recently-notified tasks. This also was simpler than trying to wedge it
into the chunk of tasks to be pushed to the stealer's queue.
I've made the following test changes while working on this branch:
the LIFO slot is blocked from running when a task notifies it and then
blocks indefinitely. I've confirmed that this test fails on
master
and passes on this branch.
These are in addition to the existing work-stealing Loom tests, as
tasks notified by I/O or timers are still woken to the normal run
queue.
rt_unstable_metrics::worker_local_queue_depth
integration test,which was necessary as tasks in the LIFO slot now "count" towards the
worker's queue depth. We now have to make sure the no-op task that's
spawned has completed before spawning the tasks we actually attempt to
count, as it seems to sometimes end up in the LIFO slot and sometimes
not, causing the test to flake out.
Fixes #4941