-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Bug: graceful shutdowns delayed due to duplicate sleeps. #813
Description
Describe the bug
Graceful shutdowns take roughly double the expected time per runtime. When graceful_shutdown_timeout_seconds is set to a value higher than the default 5s, this is very noticeable. The code calls thread::sleep(shutdown_timeout) runs after rt.shutdown_timeout(shutdown_timeout), doubling the length of this window, for no obvious reason.
Pingora info
Pingora version: 0.7.0 (current main, but the issue has existed since graceful_shutdown_timeout_seconds was made configurable in PR #99 )
Rust version: cargo 1.86.0
Operating system version: Arch Linux 6.18, also reproduced on Debian 12
Steps to reproduce
- Configure a Pingora server with
graceful_shutdown_timeout_seconds = 30andgrace_period_seconds = 5 - Start the server with
--daemonbehind systemd (Type=forking) - Send SIGTERM (
systemctl stop) - Observe that shutdown takes ~65 seconds (5 + 30 + 30) instead of the expected ~35 seconds (5 + 30)
The relevant code is in pingora-core/src/server/mod.rs:
let join = thread::spawn(move || {
rt.shutdown_timeout(shutdown_timeout); // blocks up to shutdown_timeout
thread::sleep(shutdown_timeout) // then sleeps for shutdown_timeout again
});
Runtime::shutdown_timeout() already blocks the calling thread for up to the full timeout duration.
The thread::sleep afterward just adds dead time.
Expected results
Shutdown should complete in grace_period_seconds + graceful_shutdown_timeout_seconds. With my config that's 35 seconds.
Observed results
Shutdown takes grace_period_seconds + graceful_shutdown_timeout_seconds × 2. With my config that's 65 seconds, which exceeds my systemd TimeoutStopSec and gets the process SIGKILL'd every time.
Additional context
I think this was fine in the original 0.1.0 code where the shutdown timeout was hardcoded to 5 seconds, but when PR #99 made it configurable, the sleep started using the configurable value too, which turns it into a noticeable problem for anyone setting a longer timeout.
I'm currently patching it out locally by just removing the thread::sleep line. Haven't had any issues with that, runtimes shut down cleanly without the extra wait. I am doing daily production changes with my code so if I do not run into any problems with this change I will PR it.