Skip to content

Conversation

@yakryder
Copy link

@yakryder yakryder commented Oct 10, 2025

Ref #6504

POC for tracking next deadline using min heap

This benchmark demonstrates the mutex contention issue described
in tokio-rs#6504, specifically focusing on the drop path for timers that
are registered but never fire.

The benchmark creates 10,000 sleep timers, polls each once to
initialize and register it with the timer wheel, then drops them
before they fire. This simulates the common case of timeouts
that don't fire (e.g., operations that complete before their
timeout).

Baseline results show severe contention: the 8-worker case is
only ~1.5x faster than single-threaded.

Refs: tokio-rs#6504
Reduces lock contention in timer operations by registering timers in a
per-worker HashMap for the multi-threaded runtime, while falling back
to the global timer wheel for current_thread runtime and block_in_place.

Benchmark results (benches/time_drop_sleep_contention.rs):
- Single-threaded: 33.3ms → 32.7ms (no regression)
- Multi-threaded (8 workers): 21.6ms → 16.0ms (25.9% faster)

Refs tokio-rs#6504
@github-actions github-actions bot added R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-time-driver Run loom time driver tests on this PR labels Oct 10, 2025
Comment on lines 1075 to 1081
fn fire_expired_timers(&mut self, now: Instant) {
self.timers.retain(|&deadline, wakers| {
(now < deadline) || {
wakers.drain(..).for_each(Waker::wake);
false
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very expensive operation.

/// This is called from TimerEntry::poll_elapsed when a timer is registered.
/// The waker will be fired when fire_expired_timers() is called with a time
/// >= deadline.
pub(crate) fn register_timer(&mut self, deadline: Instant, waker: Waker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from issues of O(n) time complexity, how to de-register a timer?

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-time Module: tokio/time labels Oct 11, 2025
@yakryder
Copy link
Author

@Darksonn @ADD-SP Yes, the retain is expensive, though still enough to outperform the global lock.

I have a WIP locally that works with the iteration and deregistration concerns. It is still a hashmap of instants to wakers but tracking next deadlines and with weak waker references. The next deadline tracking gets us to ~4x single-threaded performance in the eight worker 10000 timer benchmark (baseline with global wheel is around 1.5x I think).

I wanted to explore a wheelfree solution initially because my fallible intuition told me a wheel was more than what was needed. And the global wheel felt out of place with tokio's design. Having said that, the wheel is already written and ready to go back to being worker state anytime. I additionally resisted reaching for it directly because I thought there must have been a compelling reason to unify the worker wheels into one to support work stealing, even though to my fallible and incomplete understanding wakers are location transparent and gracefully noop when called on a completed task.

The first thing I wanted to do to address the contention, because it felt small, manageable, and simple, was make a mailbox for the global wheel. I was really into that idea for a couple hours. Then I started asking myself how can we make this less complex, not more? That's how I started thinking about "easier" solutions than a wheel. But an even less invasive solution is to axe the mutex and return the wheel to the workers with driver polling. Driver wakes worker, worker wakes task.

Does that seem reasonable?

@ADD-SP
Copy link
Member

ADD-SP commented Oct 13, 2025

@yakryder Thanks for your efforts!

Not sure if you have read the #7384 and #7467, I'm currently working on solving the lock contention issue fundamentally.

It seems you already have some ideas, but I recommend reading my RFC (#7384) and PR (#7467) first and then you may want to share your suggestions.

If you have a better architectural design, please open an issue and describe it in detail before writing the code. Since fixing this lock contention issue typically involves significant code changes, it's best to reach consensus on the design first to avoid unnecessary work.

@ADD-SP ADD-SP added the S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. label Oct 13, 2025
@Darksonn
Copy link
Contributor

Yes, the retain is expensive, though still enough to outperform the global lock.

Maybe in the specific benchmark you came up with, but it's going to perform really badly in other scenarios. Imagine a normal application that has one million timers registered with relatively large durations. Every single time we hit this codepath, we're going to be spending a large amount of time checking every single timer again. Registering millions of timers is expected usage of Tokio, and it needs to perform well.

@yakryder
Copy link
Author

@ADD-SP Thank you for your thoughts and the links. I'm caught up now.

I'm sorry for jumping in here -- I wouldn't have picked this issue had I known you were working it.

I have something simpler in mind for implementation.

Would you be willing to give your thoughts?

@yakryder
Copy link
Author

@Darksonn Agreed, thank you.

I will put the wheel back where it was originally. It is very well proven and there are many people better qualified to do nuanced perf tuning in Rust

@yakryder
Copy link
Author

It is OK with me if this PR scope gets downgraded to benchmark or, if providing a viable fix, does not merge. I have found doing this work a very empowering and rewarding experience

@ADD-SP
Copy link
Member

ADD-SP commented Oct 14, 2025

I have something simpler in mind for implementation.

If you mean a simpler implementation of #7467 , you could leave comments in #7467.

If you mean a simpler and new architectural design, please open a issue and explain it in details before writing the code.

Some of these improvements may not actually be better, but helped me rule things out as a newcomer (e.g. do not start timing until after tasks and sleeps are created).
@yakryder
Copy link
Author

@ADD-SP Sure thing 👍

Because of my ecosystem ignorance I was doing absurd numbers of iterations for first few rounds. Bringing the iteration count down enabled getting quick feedback from a much larger number of timers.
@github-actions github-actions bot removed R-loom-time-driver Run loom time driver tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Oct 21, 2025
Introduce GlobalTimerBuckets, a ring buffer of per-bucket locks for
timers 0-120 seconds in the future. This reduces occurrence and impact
of global lock contention for short-lived timers, the common case.

Timers > 120s fall back to the existing timer wheel. When a timer is
dropped, it must be removed from whichever storage it's in before the
underlying memory is freed. Add try_remove() to safely remove from
buckets, and update clear_entry() to call it.

Performance: Preliminary results from a million concurrent timer
benchmark show 84x improvement in multi-threaded runs and 25x over
single-threaded.
@github-actions github-actions bot added the R-loom-time-driver Run loom time driver tests on this PR label Oct 25, 2025
@yakryder yakryder force-pushed the reduce-timer-contention branch from e7f1378 to 4fcc8ff Compare October 25, 2025 15:37
@yakryder
Copy link
Author

yakryder commented Oct 25, 2025

Aware of multiple test failures:

  • loom time driver: Will investigate bucket synchronization under concurrent execution
  • macOS full tests: Will investigate
  • ARM/cross targets: Will investigate
  • Spelling check: Will address
    Many more, more pending

@yakryder
Copy link
Author

I might not have time to do more work on this for a few days.
What really makes the most sense to me from my fallible vantage point at this moment is to convert the three lower levels of the wheel to one level that is the new buckets. This would be with potentially an eye on making the time window for the"fast path" timers configurable

If we don't have appetite for this or similar approach, that's totally fine. Working on tokio has been a master class in concurrent Rust. I feel privileged to have been part of it.

Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make an agreement on the architectural design before writing the code, as this work is complex.

let current_tick = self.ref_time.fetch_add(1, Ordering::AcqRel) + 1;

// Fire all timers in this bucket
let mut bucket = self.buckets[bucket_idx].timers.lock();
Copy link
Member

@ADD-SP ADD-SP Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar to what we did in 1914e1e which was reverted in 1ae9434.

The reason we reverted it is that it introduced more lock contentions on the hot path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was shards. This is fine-grained locks. I think you surfaced something very meaningful when you went looking for the timer benchmarks that supported sharding and came up empty -- the benchmark-driven rollout for per-worker wheels feels absolutely indispensable

I will be rolling this back, but feel free to look at the benchmark numbers if curious

///
/// This is used to calculate when the driver should wake up.
pub(crate) fn next_expiration_time(&self) -> Option<u64> {
let next = self.next_wake.load(Ordering::Acquire);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this atomic load require extra synchronization?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be gone in next diff


/// Returns true if this timer is registered in the buckets (vs the wheel).
pub(super) fn is_in_buckets(&self) -> bool {
self.in_buckets.load(Ordering::Relaxed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this atomic load require extra synchronization?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above 👍🏻

@yakryder
Copy link
Author

Thanks @ADD-SP! I am switching focus to a more modest optimization that will retain its value after #7467 lands. It's just an implementation detail of the wheel, tracking deadlines on a min heap on each level. Next deadline calculation becomes O(1) whereas worst case currently we would traverse every level behind a lock. It's definitely low-value in a world where timer wheels are back on workers, but I'd expect it to ease some pain if it were ready sooner

I'd like to have that ready as a review candidate this week. I appreciate the consideration and feedback on the naive hashmap and per-bucket locks

@yakryder yakryder changed the title time: reduce timer drop contention with per-worker timers time: reduce timer contention with min heap next deadline Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-time Module: tokio/time R-loom-time-driver Run loom time driver tests on this PR S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants