time: reduce timer contention with min heap next deadline #7668

yakryder · 2025-10-10T05:52:39Z

POC for tracking next deadline using min heap

This benchmark demonstrates the mutex contention issue described in tokio-rs#6504, specifically focusing on the drop path for timers that are registered but never fire. The benchmark creates 10,000 sleep timers, polls each once to initialize and register it with the timer wheel, then drops them before they fire. This simulates the common case of timeouts that don't fire (e.g., operations that complete before their timeout). Baseline results show severe contention: the 8-worker case is only ~1.5x faster than single-threaded. Refs: tokio-rs#6504

Reduces lock contention in timer operations by registering timers in a per-worker HashMap for the multi-threaded runtime, while falling back to the global timer wheel for current_thread runtime and block_in_place. Benchmark results (benches/time_drop_sleep_contention.rs): - Single-threaded: 33.3ms → 32.7ms (no regression) - Multi-threaded (8 workers): 21.6ms → 16.0ms (25.9% faster) Refs tokio-rs#6504

Darksonn · 2025-10-10T15:38:18Z

tokio/src/runtime/scheduler/multi_thread/worker.rs

+    fn fire_expired_timers(&mut self, now: Instant) {
+        self.timers.retain(|&deadline, wakers| {
+            (now < deadline) || {
+                wakers.drain(..).for_each(Waker::wake);
+                false
+            }
+        });


This is a very expensive operation.

ADD-SP · 2025-10-11T06:59:59Z

tokio/src/runtime/scheduler/multi_thread/worker.rs

+    /// This is called from TimerEntry::poll_elapsed when a timer is registered.
+    /// The waker will be fired when fire_expired_timers() is called with a time
+    /// >= deadline.
+    pub(crate) fn register_timer(&mut self, deadline: Instant, waker: Waker) {


Apart from issues of O(n) time complexity, how to de-register a timer?

yakryder · 2025-10-13T01:55:29Z

@Darksonn @ADD-SP Yes, the retain is expensive, though still enough to outperform the global lock.

I have a WIP locally that works with the iteration and deregistration concerns. It is still a hashmap of instants to wakers but tracking next deadlines and with weak waker references. The next deadline tracking gets us to ~4x single-threaded performance in the eight worker 10000 timer benchmark (baseline with global wheel is around 1.5x I think).

I wanted to explore a wheelfree solution initially because my fallible intuition told me a wheel was more than what was needed. And the global wheel felt out of place with tokio's design. Having said that, the wheel is already written and ready to go back to being worker state anytime. I additionally resisted reaching for it directly because I thought there must have been a compelling reason to unify the worker wheels into one to support work stealing, even though to my fallible and incomplete understanding wakers are location transparent and gracefully noop when called on a completed task.

The first thing I wanted to do to address the contention, because it felt small, manageable, and simple, was make a mailbox for the global wheel. I was really into that idea for a couple hours. Then I started asking myself how can we make this less complex, not more? That's how I started thinking about "easier" solutions than a wheel. But an even less invasive solution is to axe the mutex and return the wheel to the workers with driver polling. Driver wakes worker, worker wakes task.

Does that seem reasonable?

ADD-SP · 2025-10-13T02:13:24Z

@yakryder Thanks for your efforts!

Not sure if you have read the #7384 and #7467, I'm currently working on solving the lock contention issue fundamentally.

It seems you already have some ideas, but I recommend reading my RFC (#7384) and PR (#7467) first and then you may want to share your suggestions.

If you have a better architectural design, please open an issue and describe it in detail before writing the code. Since fixing this lock contention issue typically involves significant code changes, it's best to reach consensus on the design first to avoid unnecessary work.

Darksonn · 2025-10-13T09:39:04Z

Yes, the retain is expensive, though still enough to outperform the global lock.

Maybe in the specific benchmark you came up with, but it's going to perform really badly in other scenarios. Imagine a normal application that has one million timers registered with relatively large durations. Every single time we hit this codepath, we're going to be spending a large amount of time checking every single timer again. Registering millions of timers is expected usage of Tokio, and it needs to perform well.

yakryder · 2025-10-14T13:27:25Z

@ADD-SP Thank you for your thoughts and the links. I'm caught up now.

I'm sorry for jumping in here -- I wouldn't have picked this issue had I known you were working it.

I have something simpler in mind for implementation.

Would you be willing to give your thoughts?

yakryder · 2025-10-14T13:28:34Z

@Darksonn Agreed, thank you.

I will put the wheel back where it was originally. It is very well proven and there are many people better qualified to do nuanced perf tuning in Rust

yakryder · 2025-10-14T13:30:56Z

It is OK with me if this PR scope gets downgraded to benchmark or, if providing a viable fix, does not merge. I have found doing this work a very empowering and rewarding experience

ADD-SP · 2025-10-14T14:26:08Z

I have something simpler in mind for implementation.

If you mean a simpler implementation of #7467 , you could leave comments in #7467.

If you mean a simpler and new architectural design, please open a issue and explain it in details before writing the code.

Some of these improvements may not actually be better, but helped me rule things out as a newcomer (e.g. do not start timing until after tasks and sleeps are created).

yakryder · 2025-10-16T12:59:03Z

@ADD-SP Sure thing 👍

Because of my ecosystem ignorance I was doing absurd numbers of iterations for first few rounds. Bringing the iteration count down enabled getting quick feedback from a much larger number of timers.

This reverts commit 24a53ea.

Introduce GlobalTimerBuckets, a ring buffer of per-bucket locks for timers 0-120 seconds in the future. This reduces occurrence and impact of global lock contention for short-lived timers, the common case. Timers > 120s fall back to the existing timer wheel. When a timer is dropped, it must be removed from whichever storage it's in before the underlying memory is freed. Add try_remove() to safely remove from buckets, and update clear_entry() to call it. Performance: Preliminary results from a million concurrent timer benchmark show 84x improvement in multi-threaded runs and 25x over single-threaded.

yakryder · 2025-10-25T15:44:34Z

Aware of multiple test failures:

loom time driver: Will investigate bucket synchronization under concurrent execution
macOS full tests: Will investigate
ARM/cross targets: Will investigate
Spelling check: Will address
Many more, more pending

yakryder · 2025-10-25T15:59:05Z

I might not have time to do more work on this for a few days.
What really makes the most sense to me from my fallible vantage point at this moment is to convert the three lower levels of the wheel to one level that is the new buckets. This would be with potentially an eye on making the time window for the"fast path" timers configurable

If we don't have appetite for this or similar approach, that's totally fine. Working on tokio has been a master class in concurrent Rust. I feel privileged to have been part of it.

ADD-SP

We need to make an agreement on the architectural design before writing the code, as this work is complex.

ADD-SP · 2025-10-26T15:32:25Z

tokio/src/runtime/time/timer_buckets.rs

+            let current_tick = self.ref_time.fetch_add(1, Ordering::AcqRel) + 1;
+
+            // Fire all timers in this bucket
+            let mut bucket = self.buckets[bucket_idx].timers.lock();


This looks similar to what we did in 1914e1e which was reverted in 1ae9434.

The reason we reverted it is that it introduced more lock contentions on the hot path.

That was shards. This is fine-grained locks. I think you surfaced something very meaningful when you went looking for the timer benchmarks that supported sharding and came up empty -- the benchmark-driven rollout for per-worker wheels feels absolutely indispensable

I will be rolling this back, but feel free to look at the benchmark numbers if curious

ADD-SP · 2025-10-26T15:35:37Z

tokio/src/runtime/time/timer_buckets.rs

+    ///
+    /// This is used to calculate when the driver should wake up.
+    pub(crate) fn next_expiration_time(&self) -> Option<u64> {
+        let next = self.next_wake.load(Ordering::Acquire);


Does this atomic load require extra synchronization?

Will be gone in next diff

ADD-SP · 2025-10-26T15:35:48Z

tokio/src/runtime/time/entry.rs

+
+    /// Returns true if this timer is registered in the buckets (vs the wheel).
+    pub(super) fn is_in_buckets(&self) -> bool {
+        self.in_buckets.load(Ordering::Relaxed)


Does this atomic load require extra synchronization?

Same as above 👍🏻

yakryder · 2025-10-28T11:57:59Z

Thanks @ADD-SP! I am switching focus to a more modest optimization that will retain its value after #7467 lands. It's just an implementation detail of the wheel, tracking deadlines on a min heap on each level. Next deadline calculation becomes O(1) whereas worst case currently we would traverse every level behind a lock. It's definitely low-value in a world where timer wheels are back on workers, but I'd expect it to ease some pain if it were ready sooner

I'd like to have that ready as a review candidate this week. I appreciate the consideration and feedback on the naive hashmap and per-bucket locks

yakryder added 2 commits October 10, 2025 00:57

github-actions bot added R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-time-driver Run loom time driver tests on this PR labels Oct 10, 2025

Darksonn reviewed Oct 10, 2025

View reviewed changes

ADD-SP reviewed Oct 11, 2025

View reviewed changes

Darksonn added A-tokio Area: The main tokio crate M-time Module: tokio/time labels Oct 11, 2025

ADD-SP added the S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. label Oct 13, 2025

yakryder added 2 commits October 16, 2025 04:54

benches: improve many timer benchmark

2f76baf

Some of these improvements may not actually be better, but helped me rule things out as a newcomer (e.g. do not start timing until after tasks and sleeps are created).

benches: rename many timer benchmark

c5315d3

yakryder added 2 commits October 20, 2025 03:40

benches: do housekeeping

3ebc1bc

Because of my ecosystem ignorance I was doing absurd numbers of iterations for first few rounds. Bringing the iteration count down enabled getting quick feedback from a much larger number of timers.

Revert "runtime: reduce timer drop contention with per-worker timers"

54f89f9

This reverts commit 24a53ea.

github-actions bot removed R-loom-time-driver Run loom time driver tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Oct 21, 2025

yakryder added 2 commits October 25, 2025 03:18

Merge branch 'master' into reduce-timer-contention

0e9a67c

github-actions bot added the R-loom-time-driver Run loom time driver tests on this PR label Oct 25, 2025

yakryder force-pushed the reduce-timer-contention branch from e7f1378 to 4fcc8ff Compare October 25, 2025 15:37

ADD-SP reviewed Oct 26, 2025

View reviewed changes

yakryder changed the title ~~time: reduce timer drop contention with per-worker timers~~ time: reduce timer contention with min heap next deadline Oct 28, 2025

Uh oh!

Uh oh!

time: reduce timer contention with min heap next deadline #7668

Are you sure you want to change the base?

time: reduce timer contention with min heap next deadline #7668

Conversation

yakryder commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Darksonn Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ADD-SP Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

yakryder commented Oct 13, 2025

Uh oh!

ADD-SP commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Darksonn commented Oct 13, 2025

Uh oh!

yakryder commented Oct 14, 2025

Uh oh!

yakryder commented Oct 14, 2025

Uh oh!

yakryder commented Oct 14, 2025

Uh oh!

ADD-SP commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yakryder commented Oct 16, 2025

Uh oh!

yakryder commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yakryder commented Oct 25, 2025

Uh oh!

ADD-SP left a comment

Choose a reason for hiding this comment

Uh oh!

ADD-SP Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yakryder Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ADD-SP Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

yakryder Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ADD-SP Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

yakryder Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yakryder commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yakryder commented Oct 10, 2025 •

edited

Loading

ADD-SP commented Oct 13, 2025 •

edited

Loading

ADD-SP commented Oct 14, 2025 •

edited

Loading

yakryder commented Oct 25, 2025 •

edited

Loading

ADD-SP Oct 26, 2025 •

edited

Loading