RFC: Delayed cancellation of timer

# Summary

*Please check out the #6504 for initial context.*

This design reduces the lock contention of the time module, this improves the scalability for the async server to read/write async sockets with timeout. 

* Use a local wheel for each worker thread instead of a global wheel.  
* Do not register the timer into the wheel on creation.  
* Register timer into the local wheel on the first `poll`.  
* Cancel a timer by removing it from the local wheel directly if this timer is canceled by local thread.  
* Cancel a timer by sending the intrusive list entry using `std::sync::mpsc` to the worker thread that owns it if this timer is canceled by remote thread.

# Current Implementation

*You can skip it if you are already familiar with the codebase.*

There is a global wheel protected by `std::sync::Mutex`, so we have to acquire the `Mutex` before registering and cancelling a timer.

*Resetting a timer doesn't acquire the `Mutex` as it just updates an `AtomicU64`.*

# What's the problem?

Worker threads can concurrently process their owned sockets, this is scalable.

However, since timer is heavily used for socket timeout in async server/client, there are many lock contentions while read/write a async socket with timeout, which introduces much lock contentions.

# What have we tried?

commit: [time: use sharding for timer implementation](https://github.com/tokio-rs/tokio/commit/1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210)

This commit splits the global wheel into several wheels (indexed by thread\_id), and wraps it by `Mutex` in the `Runtime`,

* To calculate the earliest timer, it locks all wheels one-by-one while parking the driver.  
* To register/cancel a timer, it locks the specific wheel and removes it.

Unfortunately, this was reverted by [1ae9434](https://github.com/tokio-rs/tokio/commit/1ae9434e8e4a419ce25644e6c8d2b2e2e8c34750) due to the overhead of locking multiple `Mutex` in hot paths is much more expensive than the global lock contention.

# How does the kernel handle this case?

*I'm not a kernel developer, please correct me if I'm wrong.*

For non-hrtimer, all timers are registered into a per-CPU timer wheel, and the kernel also locks the per-CPU timer wheel when registering/canceling the timer.

This looks just like what we did in [time: use sharding for timer implementation](https://github.com/tokio-rs/tokio/commit/1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210) (already been reverted).

However, I think the big difference is that kernel spinlocks the per-CPU wheel, this is much more efficient than user space `Mutex` since the kernel is able to control the kernel scheduler and interrupt.

With the benefits of spinlock in kernel context, the lock contention should be much less than userspace `Mutex`.

So we cannot reference the kernel's design directly.

# How does Golang handle this case?

*I'm not a Golang expert, please correct me if I'm wrong.*

Golang uses per-GOPROC 4-ary heap to register all timers.

Golang marks the timer as deleted canceling a timer, and marked timers will be deleted while registering a new timer if the marked timer is at the top of the heap.

Because all timers are delayed for removing from the per-GOPROC heap, registering a new timer doesn’t need to acquire the lock.

The cost of this design is that if too many (\>= 1/4) marked timers are not at the top of the heap, GOPROC must be stopped to scan the entire 4-ary heap, which means trigger a `O(nlgn)` operations at an any time, which could lead to task starvation.

This might not acceptable in tokio.

# How does [`pingora`] handle this case?

[`pingora`]: https://docs.rs/pingora-timeout/latest/pingora_timeout/

Pingora faced the similar issue, so it has its own timeout implementation.

* Per-thread `RwLock<BTreeMap<u128, Timer>>`
* A dedicated thread periodically polls all per-thread `BTreeMap`, fires expired timers, and removes canceled timers.
* All timers will be rounded to next 10ms.
* Cloning the existing timer instead of inserting a new one (into `BTreeMap`) if there is already a timer at the same 10ms while registering a new one.
* Canceling a timer does nothing, its memory can only be released after that dedicated thread reaches that timestamp.

Pingora's solution is good at dealing with the socket timeout, we can reference something during the implementing.

# Proposed design

*The initial idea came from @carllerche .* 

## What does the new `TimerEntry` look like?

```rust
struct TimerEntry {
   // -- omitted -- 
   deadline: Instant,
   node: Option<NonNull<TimerShared>>, // stored into the intrusive list
   // -- omitted --
}
```

## Where is the local wheel stored? 

It stores in the `Core` that also stores the `queue::Local`.

## How to register a new timer?

1. Retrieve the runtime handle using thread-local `CONTEXT`.  
   1. panic on error (such as no context).  
2. Retrieve the local worker `Context`.  
   1. Register timer to the local wheel if the `Core` is here.  
   2. Otherwise, push into the inject timer list, this list will be drained while maintaining the local core.

## How to cancel a timer?

1. Just `drop` the handle if this timer has never been registered.
2. Otherwise, mark the timer as deleted in `AtomicU64`.  
3. Send the timer using `std::mpsc::Sender`, the `Receiver` is owned by `Core`, the sender was stored in `TimerEntry` when creating it.
4. The core drains the receiver and remove timers from the local wheel.

## How to wake up timer?

Timers that were registered in the current worker thread's local wheel will be polled while maintaining the local core, and the `AtomicWaker` is used for waking up.

## Overhead of the extra memory allocation

This design requires the extra memory allocation of the node to be saved in an intrusive list compared to the current design.

This should not be a big concern because modern allocator has well multi-core scalability, so the extra allocation should be cheaper than global lock contention.

# FAQ

Will be added progressively as the discussion progresses.

## Will the timer be delayed too much if the worker thread is blocked for a long timer?

Yes, timers were registered on the local wheel might be delayed too much in this case, but I think is not a big issue, please check out the https://github.com/tokio-rs/tokio/issues/7384#issuecomment-2952532962 for detailed explanation.

## Does sending pending cancel timers through `std::sync::mpsc` causes extra memory allocation?

Yes, an extra slot will be allocated for saving the value in the channel. The good news is that the standard `mpsc` optimises this case, please check out the https://github.com/tokio-rs/tokio/issues/7384#issuecomment-2967423312 for detailed explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Delayed cancellation of timer #7384

Summary

Current Implementation

What's the problem?

What have we tried?

How does the kernel handle this case?

How does Golang handle this case?

How does `pingora` handle this case?

Proposed design

What does the new `TimerEntry` look like?

Where is the local wheel stored?

How to register a new timer?

How to cancel a timer?

How to wake up timer?

Overhead of the extra memory allocation

FAQ

Will the timer be delayed too much if the worker thread is blocked for a long timer?

Does sending pending cancel timers through `std::sync::mpsc` causes extra memory allocation?

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Delayed cancellation of timer #7384

Description

Summary

Current Implementation

What's the problem?

What have we tried?

How does the kernel handle this case?

How does Golang handle this case?

How does pingora handle this case?

Proposed design

What does the new TimerEntry look like?

Where is the local wheel stored?

How to register a new timer?

How to cancel a timer?

How to wake up timer?

Overhead of the extra memory allocation

FAQ

Will the timer be delayed too much if the worker thread is blocked for a long timer?

Does sending pending cancel timers through std::sync::mpsc causes extra memory allocation?

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How does `pingora` handle this case?

What does the new `TimerEntry` look like?

Does sending pending cancel timers through `std::sync::mpsc` causes extra memory allocation?