Description
Summary
Please check out the #6504 for initial context.
This design reduces the lock contention of the time module, this improves the scalability for the async server to read/write async sockets with timeout.
- Use a local wheel for each worker thread instead of a global wheel.
- Do not register the timer into the wheel on creation.
- Register timer into the local wheel on the first
poll
. - Cancel a timer by removing it from the local wheel directly if this timer is canceled by local thread.
- Cancel a timer by sending the intrusive list entry using
std::sync::mpsc
to the worker thread that owns it if this timer is canceled by remote thread.
Current Implementation
You can skip it if you are already familiar with the codebase.
There is a global wheel protected by std::sync::Mutex
, so we have to acquire the Mutex
before registering and cancelling a timer.
Resetting a timer doesn't acquire the Mutex
as it just updates an AtomicU64
.
What's the problem?
Worker threads can concurrently process their owned sockets, this is scalable.
However, since timer is heavily used for socket timeout in async server/client, there are many lock contentions while read/write a async socket with timeout, which introduces much lock contentions.
What have we tried?
commit: time: use sharding for timer implementation
This commit splits the global wheel into several wheels (indexed by thread_id), and wraps it by Mutex
in the Runtime
,
- To calculate the earliest timer, it locks all wheels one-by-one while parking the driver.
- To register/cancel a timer, it locks the specific wheel and removes it.
Unfortunately, this was reverted by 1ae9434 due to the overhead of locking multiple Mutex
in hot paths is much more expensive than the global lock contention.
How does the kernel handle this case?
I'm not a kernel developer, please correct me if I'm wrong.
For non-hrtimer, all timers are registered into a per-CPU timer wheel, and the kernel also locks the per-CPU timer wheel when registering/canceling the timer.
This looks just like what we did in time: use sharding for timer implementation (already been reverted).
However, I think the big difference is that kernel spinlocks the per-CPU wheel, this is much more efficient than user space Mutex
since the kernel is able to control the kernel scheduler and interrupt.
With the benefits of spinlock in kernel context, the lock contention should be much less than userspace Mutex
.
So we cannot reference the kernel's design directly.
How does Golang handle this case?
I'm not a Golang expert, please correct me if I'm wrong.
Golang uses per-GOPROC 4-ary heap to register all timers.
Golang marks the timer as deleted canceling a timer, and marked timers will be deleted while registering a new timer if the marked timer is at the top of the heap.
Because all timers are delayed for removing from the per-GOPROC heap, registering a new timer doesn’t need to acquire the lock.
The cost of this design is that if too many (>= 1/4) marked timers are not at the top of the heap, GOPROC must be stopped to scan the entire 4-ary heap, which means trigger a O(nlgn)
operations at an any time, which could lead to task starvation.
This might not acceptable in tokio.
How does pingora
handle this case?
Pingora faced the similar issue, so it has its own timeout implementation.
- Per-thread
RwLock<BTreeMap<u128, Timer>>
- A dedicated thread periodically polls all per-thread
BTreeMap
, fires expired timers, and removes canceled timers. - All timers will be rounded to next 10ms.
- Cloning the existing timer instead of inserting a new one (into
BTreeMap
) if there is already a timer at the same 10ms while registering a new one. - Canceling a timer does nothing, its memory can only be released after that dedicated thread reaches that timestamp.
Pingora's solution is good at dealing with the socket timeout, we can reference something during the implementing.
Proposed design
The initial idea came from @carllerche .
What does the new TimerEntry
look like?
struct TimerEntry {
// -- omitted --
deadline: Instant,
node: Option<NonNull<TimerShared>>, // stored into the intrusive list
// -- omitted --
}
Where is the local wheel stored?
It stores in the Core
that also stores the queue::Local
.
How to register a new timer?
- Retrieve the runtime handle using thread-local
CONTEXT
.- panic on error (such as no context).
- Retrieve the local worker
Context
.- Register timer to the local wheel if the
Core
is here. - Otherwise, push into the inject timer list, this list will be drained while maintaining the local core.
- Register timer to the local wheel if the
How to cancel a timer?
- Just
drop
the handle if this timer has never been registered. - Otherwise, mark the timer as deleted in
AtomicU64
. - Send the timer using
std::mpsc::Sender
, theReceiver
is owned byCore
, the sender was stored inTimerEntry
when creating it. - The core drains the receiver and remove timers from the local wheel.
How to wake up timer?
Timers that were registered in the current worker thread's local wheel will be polled while maintaining the local core, and the AtomicWaker
is used for waking up.
Overhead of the extra memory allocation
This design requires the extra memory allocation of the node to be saved in an intrusive list compared to the current design.
This should not be a big concern because modern allocator has well multi-core scalability, so the extra allocation should be cheaper than global lock contention.
FAQ
Will be added progressively as the discussion progresses.
Will the timer be delayed too much if the worker thread is blocked for a long timer?
Yes, timers were registered on the local wheel might be delayed too much in this case, but I think is not a big issue, please check out the #7384 (comment) for detailed explanation.
Does sending pending cancel timers through std::sync::mpsc
causes extra memory allocation?
Yes, an extra slot will be allocated for saving the value in the channel. The good news is that the standard mpsc
optimises this case, please check out the #7384 (comment) for detailed explanation.