Skip to content

runtime: steal tasks from the LIFO slot #7431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jun 27, 2025

Motivation

Worker threads in the multi-threaded runtime include a per-worker LIFO
slot which stores the last task notified by another task running on that
worker. This allows the last-notified task to be polled first when the
currently running task completes, decreasing latency in message-passing
"ping-pong" scenarios.

However --- as described in #4941 --- there's an issue with this that
can cause severe problems in some situations: the task in the LIFO slot
cannot be stolen by a work-stealing worker thread. This means that if a
task notifies another task and then goes CPU bound for a long period of
time without yielding, the notified task will never be able to execute
until the task that notified it can yield. This can result in a very
severe latency bubble in some scenarios. See, for instance, #4323,
#6954, oxidecomputer/omicron#8334, etc.

As a workaround, PR #4936 added an unstable runtime::Builder option
to disable the LIFO slot
. However, this is a less-than-ideal
solution, as it means that applications which disable the LIFO slot due
to occasional usage patterns that cause latency bubbles when it is
enabled cannot benefit from the potential latency improvements it
offers in other usage patterns. And, it's an unstable option which the
user has to discover. In most cases, people whose programs contain usage
patterns that are pathological with regards to the LIFO slot don't know
this ahead of time: the typical narrative is that you write code that
happens to follow such a pattern, discover an unexpected latency spike
or hang in production, and then learn how to disable the LIFO slot. It
would be much nicer if the task in the LIFO slot could participate in
work-stealing like every other task.

Solution

This branch makes tasks in the LIFO slot stealable.

Broadly, I've taken the following approach:

  1. Add a new AtomicNotified type that implements an atomically
    swappable cell storing a Notified task
    (1220253), and use it to represent
    the LIFO slot instead of an Option<Notified<Arc<Handle>>>. This
    way, other workers can take a task out of the LIFO slot while
    work-stealing.

  2. Move the LIFO slot out of the worker::Core struct and into the run
    queue's Inner type (75d8116),
    making it shared state between the Local side of the queue owned by
    the worker itself and the Steal side used by remote workers to
    steal from the queue.

    There's a bunch of additional code in worker::Core for managing
    whether to atually run a task from the LIFO slot or not. I opted
    not to move any of this code into the run queue itself, as it
    depends on other bits of internal worker state. Instead, we just
    expose to the worker separate APIs for pushing/popping to/from the
    main queue and for pushing/popping to/from the LIFO slot, resulting
    in a fairly small diff to the worker's run loop.

  3. Change the work-stealing code to also steal the LIFO task
    (730a581 and
    cb27dda). This actually turned out
    to be pretty straightforwards: once we've stolen a chunk of tasks
    from the targeted worker's run queue, we now also grab whatever's in
    its' LIFO slot as well. If we stole a LIFO task, it's returned from
    the steal_into method in lieu of the first task in the run queue, so
    that it gets to execute first, maintaining the latency improvement for
    recently-notified tasks. This also was simpler than trying to wedge it
    into the chunk of tasks to be pushed to the stealer's queue.

I've made the following test changes while working on this branch:

  • Added an integration test that reproduces a scenario where a task in
    the LIFO slot is blocked from running when a task notifies it and then
    blocks indefinitely. I've confirmed that this test fails on master
    and passes on this branch.
  • Added new Loom tests for work-stealing involving the LIFO slot.
    These are in addition to the existing work-stealing Loom tests, as
    tasks notified by I/O or timers are still woken to the normal run
    queue.
  • A small change to the
    rt_unstable_metrics::worker_local_queue_depth integration test,
    which was necessary as tasks in the LIFO slot now "count" towards the
    worker's queue depth. We now have to make sure the no-op task that's
    spawned has completed before spawning the tasks we actually attempt to
    count, as it seems to sometimes end up in the LIFO slot and sometimes
    not, causing the test to flake out.

Fixes #4941

hawkw added 3 commits June 27, 2025 14:15
This commit changes the LIFO slot on multi-threaded runtime workers from
being a mutable `Option<Notified<Arc<Handle>>>` to a new
`AtomicNotified` type. This
type implements a cell containing a nullable task pointer which can
be swapped atomically. It's analogous to `AtomicCell` but with the extra
`PhantomData` to remember the task's scheduler type parameter, which
would otherwise be erased by the conversion into a `*mut Header``
pointer.

This change is in preparation for a subsequent change to allow
work-stealing from the LIFO slot (see: #4941).
This way, it's accesible by the stealer. Leave all the LIFO *accounting*
(i.e. deciding whether we hit the LIFO slot or not) up to the worker.

Gotta figure out whether the load of lifo presence will race...ugh.
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt labels Jun 27, 2025
@hawkw hawkw removed the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025
@github-actions github-actions bot added the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025
This commit adds a test ensuring that if a task is notified to the LIFO
slot by another task which then blocks the worker thread forever, the
LIFO task is eventually stolen by another worker. I've confimed that
this test fails on the `master` branch, and passes after these changes.
hawkw and others added 5 commits June 30, 2025 12:53
This test spawns a task that sometimes ends up in the LIFO slot and
sometimes doesn't. This was previously fine as the LIFO slot didn't
count for `worker_local_queue_depth`, but now it does. Thus, we have to
make sure that task no longer exists before asserting about queue depth.
@hawkw hawkw changed the title [WIP] runtime: steal from the LIFO slot runtime: steal tasks from the LIFO slot Jul 1, 2025
@hawkw hawkw self-assigned this Jul 1, 2025
@hawkw hawkw added C-enhancement Category: A PR with an enhancement or bugfix. A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jul 1, 2025
@hawkw hawkw marked this pull request as ready for review July 1, 2025 19:39
@hawkw hawkw requested review from carllerche, Darksonn and ADD-SP July 1, 2025 19:39
@hawkw
Copy link
Member Author

hawkw commented Jul 1, 2025

(shoutout to @mkeeter for nerd-sniping me into actually doing this)

Comment on lines +463 to +465
// If we also grabbed the task from the LIFO slot, include that in the
// steal count as well.
dst_stats.incr_steal_count(n as u16 + lifo.is_some() as u16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: Rust has defined the casting from bool to u16.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to use u16::from(bool), if you care about cast_lossless cleanliness.

Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not easy for me to understand the test lifo_stealable, so I've proposed some comments to help with understanding.

I also have a question, did we have some performance regression test that covers the relevant code path? It would be better confirming no significant regression as this PR changes the hot path a lot.

If there is a performance regression, we can discuss if this is acceptable for supporting steal-able LIFO slot.

@@ -116,32 +175,59 @@ fn steal_overflow() {
fn multi_stealer() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: This test is not changed, except the inner fn steal_tasks(..) -> usize is moved out of this test for reusing in test multi_stealer_lifo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's correct.

@hawkw
Copy link
Member Author

hawkw commented Jul 3, 2025

@ADD-SP thanks for your suggested comment improvements for the test! I ended up adding a bunch of comments in 0889f08 --- I wanted to also add a higher-level overview at the beginning of the test explaining what the different channels are used for, and I changed the names of some things. Let me know what you think? Hopefully it's a bit easier to understand now.

@hawkw
Copy link
Member Author

hawkw commented Jul 3, 2025

@ADD-SP:

I also have a question, did we have some performance regression test that covers the relevant code path? It would be better confirming no significant regression as this PR changes the hot path a lot.

If there is a performance regression, we can discuss if this is acceptable for supporting steal-able LIFO slot.

I think the existing rt_multi_threaded.rs benchmarks should exercise the code changed here --- in particular, I think it's the rt_multi_ping_pong benchmark in that file exercises the performance improvement from the LIFO slot optimization, since it represents a situation where two tasks send messages back and forth.

I'll go run these benchmarks and report my results. However, I think it might be worthwhile to also compare performance on an ARM system. I would guess that if there's a noticeable performance regression from changing the LIFO slot from a mutable local pointer to a shared AtomicPointer would be more noticeable with ARM's weaker memory semantics. On x86 systems, changing accesses to the LIFO slot pointer to atomic accesses with AcqRel ordering should be basically the same as a non-atomic access, but these will result in different instructions on an ARM system.

Unfortunately, I don't have easy access to a multi-core ARM machine to run these benchmarks on. If someone else with an ARM machine, like an Apple Silicon Mac, wants to run the benches on ARM, I'd love to see the results.

@hawkw
Copy link
Member Author

hawkw commented Jul 3, 2025

This FreeBSD CI run failure looks a bit weird --- it looks like maybe something timed out? I'm re-running it.

.push_back_or_overflow(prev, self, &mut core.stats);
true
}
None => false,
Copy link
Contributor

@Darksonn Darksonn Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boolean indicates whether other worker threads are notified that there is work available to steal. Since the lifo slot is now stealable, it should no longer be false in this scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we could have caught with a loom test? Maybe put something in the lifo slot and then block the thread, concurrently with another thread doing nothing?

as per [this comment][1] from @Darksonn, we should notify other workers
any time we push to the LIFO slot, as it is stealable now. this
simplifies the logic in `schedule_local` a bit, as we can get rid of the
`should_notify` bool.

[1]: #7431 (comment)
Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a failed test, could you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rt: make the LIFO slot in the multi-threaded scheduler stealable
4 participants