-
-
Couldn't load subscription status.
- Fork 2.8k
time: reduce timer contention with min heap next deadline #7668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This benchmark demonstrates the mutex contention issue described in tokio-rs#6504, specifically focusing on the drop path for timers that are registered but never fire. The benchmark creates 10,000 sleep timers, polls each once to initialize and register it with the timer wheel, then drops them before they fire. This simulates the common case of timeouts that don't fire (e.g., operations that complete before their timeout). Baseline results show severe contention: the 8-worker case is only ~1.5x faster than single-threaded. Refs: tokio-rs#6504
Reduces lock contention in timer operations by registering timers in a per-worker HashMap for the multi-threaded runtime, while falling back to the global timer wheel for current_thread runtime and block_in_place. Benchmark results (benches/time_drop_sleep_contention.rs): - Single-threaded: 33.3ms → 32.7ms (no regression) - Multi-threaded (8 workers): 21.6ms → 16.0ms (25.9% faster) Refs tokio-rs#6504
| fn fire_expired_timers(&mut self, now: Instant) { | ||
| self.timers.retain(|&deadline, wakers| { | ||
| (now < deadline) || { | ||
| wakers.drain(..).for_each(Waker::wake); | ||
| false | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very expensive operation.
| /// This is called from TimerEntry::poll_elapsed when a timer is registered. | ||
| /// The waker will be fired when fire_expired_timers() is called with a time | ||
| /// >= deadline. | ||
| pub(crate) fn register_timer(&mut self, deadline: Instant, waker: Waker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from issues of O(n) time complexity, how to de-register a timer?
|
@Darksonn @ADD-SP Yes, the retain is expensive, though still enough to outperform the global lock. I have a WIP locally that works with the iteration and deregistration concerns. It is still a hashmap of instants to wakers but tracking next deadlines and with weak waker references. The next deadline tracking gets us to ~4x single-threaded performance in the eight worker 10000 timer benchmark (baseline with global wheel is around 1.5x I think). I wanted to explore a wheelfree solution initially because my fallible intuition told me a wheel was more than what was needed. And the global wheel felt out of place with tokio's design. Having said that, the wheel is already written and ready to go back to being worker state anytime. I additionally resisted reaching for it directly because I thought there must have been a compelling reason to unify the worker wheels into one to support work stealing, even though to my fallible and incomplete understanding wakers are location transparent and gracefully noop when called on a completed task. The first thing I wanted to do to address the contention, because it felt small, manageable, and simple, was make a mailbox for the global wheel. I was really into that idea for a couple hours. Then I started asking myself how can we make this less complex, not more? That's how I started thinking about "easier" solutions than a wheel. But an even less invasive solution is to axe the mutex and return the wheel to the workers with driver polling. Driver wakes worker, worker wakes task. Does that seem reasonable? |
|
@yakryder Thanks for your efforts! Not sure if you have read the #7384 and #7467, I'm currently working on solving the lock contention issue fundamentally. It seems you already have some ideas, but I recommend reading my RFC (#7384) and PR (#7467) first and then you may want to share your suggestions. If you have a better architectural design, please open an issue and describe it in detail before writing the code. Since fixing this lock contention issue typically involves significant code changes, it's best to reach consensus on the design first to avoid unnecessary work. |
Maybe in the specific benchmark you came up with, but it's going to perform really badly in other scenarios. Imagine a normal application that has one million timers registered with relatively large durations. Every single time we hit this codepath, we're going to be spending a large amount of time checking every single timer again. Registering millions of timers is expected usage of Tokio, and it needs to perform well. |
|
@ADD-SP Thank you for your thoughts and the links. I'm caught up now. I'm sorry for jumping in here -- I wouldn't have picked this issue had I known you were working it. I have something simpler in mind for implementation. Would you be willing to give your thoughts? |
|
@Darksonn Agreed, thank you. I will put the wheel back where it was originally. It is very well proven and there are many people better qualified to do nuanced perf tuning in Rust |
|
It is OK with me if this PR scope gets downgraded to benchmark or, if providing a viable fix, does not merge. I have found doing this work a very empowering and rewarding experience |
Some of these improvements may not actually be better, but helped me rule things out as a newcomer (e.g. do not start timing until after tasks and sleeps are created).
|
@ADD-SP Sure thing 👍 |
Because of my ecosystem ignorance I was doing absurd numbers of iterations for first few rounds. Bringing the iteration count down enabled getting quick feedback from a much larger number of timers.
This reverts commit 24a53ea.
Introduce GlobalTimerBuckets, a ring buffer of per-bucket locks for timers 0-120 seconds in the future. This reduces occurrence and impact of global lock contention for short-lived timers, the common case. Timers > 120s fall back to the existing timer wheel. When a timer is dropped, it must be removed from whichever storage it's in before the underlying memory is freed. Add try_remove() to safely remove from buckets, and update clear_entry() to call it. Performance: Preliminary results from a million concurrent timer benchmark show 84x improvement in multi-threaded runs and 25x over single-threaded.
e7f1378 to
4fcc8ff
Compare
|
Aware of multiple test failures:
|
|
I might not have time to do more work on this for a few days. If we don't have appetite for this or similar approach, that's totally fine. Working on tokio has been a master class in concurrent Rust. I feel privileged to have been part of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make an agreement on the architectural design before writing the code, as this work is complex.
| let current_tick = self.ref_time.fetch_add(1, Ordering::AcqRel) + 1; | ||
|
|
||
| // Fire all timers in this bucket | ||
| let mut bucket = self.buckets[bucket_idx].timers.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was shards. This is fine-grained locks. I think you surfaced something very meaningful when you went looking for the timer benchmarks that supported sharding and came up empty -- the benchmark-driven rollout for per-worker wheels feels absolutely indispensable
I will be rolling this back, but feel free to look at the benchmark numbers if curious
| /// | ||
| /// This is used to calculate when the driver should wake up. | ||
| pub(crate) fn next_expiration_time(&self) -> Option<u64> { | ||
| let next = self.next_wake.load(Ordering::Acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this atomic load require extra synchronization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be gone in next diff
|
|
||
| /// Returns true if this timer is registered in the buckets (vs the wheel). | ||
| pub(super) fn is_in_buckets(&self) -> bool { | ||
| self.in_buckets.load(Ordering::Relaxed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this atomic load require extra synchronization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above 👍🏻
|
Thanks @ADD-SP! I am switching focus to a more modest optimization that will retain its value after #7467 lands. It's just an implementation detail of the wheel, tracking deadlines on a min heap on each level. Next deadline calculation becomes O(1) whereas worst case currently we would traverse every level behind a lock. It's definitely low-value in a world where timer wheels are back on workers, but I'd expect it to ease some pain if it were ready sooner I'd like to have that ready as a review candidate this week. I appreciate the consideration and feedback on the naive hashmap and per-bucket locks |
Ref #6504
POC for tracking next deadline using min heap