Skip to content

Bug: graceful shutdowns delayed due to duplicate sleeps. #813

@jaw-sh

Description

@jaw-sh

Describe the bug

Graceful shutdowns take roughly double the expected time per runtime. When graceful_shutdown_timeout_seconds is set to a value higher than the default 5s, this is very noticeable. The code calls thread::sleep(shutdown_timeout) runs after rt.shutdown_timeout(shutdown_timeout), doubling the length of this window, for no obvious reason.

Pingora info

Pingora version: 0.7.0 (current main, but the issue has existed since graceful_shutdown_timeout_seconds was made configurable in PR #99 )
Rust version: cargo 1.86.0
Operating system version: Arch Linux 6.18, also reproduced on Debian 12

Steps to reproduce

  1. Configure a Pingora server with graceful_shutdown_timeout_seconds = 30 and grace_period_seconds = 5
  2. Start the server with --daemon behind systemd (Type=forking)
  3. Send SIGTERM (systemctl stop)
  4. Observe that shutdown takes ~65 seconds (5 + 30 + 30) instead of the expected ~35 seconds (5 + 30)

The relevant code is in pingora-core/src/server/mod.rs:

let join = thread::spawn(move || {
    rt.shutdown_timeout(shutdown_timeout);  // blocks up to shutdown_timeout
    thread::sleep(shutdown_timeout)         // then sleeps for shutdown_timeout again
});

Runtime::shutdown_timeout() already blocks the calling thread for up to the full timeout duration.
The thread::sleep afterward just adds dead time.

Expected results

Shutdown should complete in grace_period_seconds + graceful_shutdown_timeout_seconds. With my config that's 35 seconds.

Observed results

Shutdown takes grace_period_seconds + graceful_shutdown_timeout_seconds × 2. With my config that's 65 seconds, which exceeds my systemd TimeoutStopSec and gets the process SIGKILL'd every time.

Additional context

I think this was fine in the original 0.1.0 code where the shutdown timeout was hardcoded to 5 seconds, but when PR #99 made it configurable, the sleep started using the configurable value too, which turns it into a noticeable problem for anyone setting a longer timeout.

I'm currently patching it out locally by just removing the thread::sleep line. Haven't had any issues with that, runtimes shut down cleanly without the extra wait. I am doing daily production changes with my code so if I do not run into any problems with this change I will PR it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingduplicateThis issue or pull request already exists

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions