Graceful shutdown behavior is undocumented and untested under Kubernetes lifecycle

## Problem

The proxy handles `SIGTERM` and `SIGINT` with graceful shutdown logic (via `src/serve.rs`):

1. Stop accepting new connections
2. Finish in-flight requests
3. Drain all connected backends in parallel
4. Each backend gets 5 seconds for graceful `shutdown()` before being force-killed via `kill_on_drop`

This works well on bare metal and Docker, but Kubernetes has specific shutdown semantics that introduce edge cases:

### 1. terminationGracePeriodSeconds vs backend count

Kubernetes sends `SIGTERM`, then waits `terminationGracePeriodSeconds` (default: 30s) before sending `SIGKILL`. The proxy shuts down backends in parallel with a 5s timeout each, but:

- If multiple backends stall on shutdown, the parallel join still waits up to 5s total (not 5s × N) — this is fine
- However, the proxy also has a 10s overall shutdown timeout (`src/serve.rs:1333`): if the full drain exceeds 10s, it logs "shutdown timed out — forcing exit" and calls `process::exit(1)`
- The `process::exit(1)` bypasses any remaining Kubernetes lifecycle hooks and may not flush all logs

The interaction between the proxy's internal 10s timeout, Kubernetes' default 30s grace period, and any `preStop` hooks is undocumented and untested.

### 2. preStop hook and connection draining

Kubernetes removes the pod from the Service endpoints **asynchronously** from the `SIGTERM` signal. This means:
- New requests can arrive at the pod **after** `SIGTERM` is sent
- The proxy stops accepting connections immediately on `SIGTERM` — these late requests get connection refused
- The standard mitigation is a `preStop` hook with a small `sleep` to allow endpoint propagation

There's no guidance or default `preStop` configuration for this.

### 3. Stdio backends and PID namespace

When running in Kubernetes, the proxy spawns stdio backends as child processes. If the container uses a shared PID namespace (`shareProcessNamespace: true`), or if the container runtime sends signals to all processes in the cgroup, stdio backends may receive `SIGTERM` before the proxy has a chance to shut them down gracefully.

The proxy's `kill_on_drop` guarantee assumes it controls the lifecycle of its children. External signal delivery to children can cause:
- Backends exiting before the proxy sends `shutdown()`
- Broken pipe errors on the proxy's stdin/stdout transport
- Race conditions in the reaper task

### 4. No readiness gate for startup

The proxy binds and starts serving immediately, but backend discovery is async and lazy. On startup in Kubernetes:
- The pod reports Ready (assuming a simple `/health` check) before any backends are connected
- The first request triggers backend connection, adding latency
- If backend connection fails, the first N requests get errors while the proxy reports healthy

There's no startup probe configuration or readiness gate that accounts for the lazy initialization model.

### Related issues

- #68 — Kubernetes manifests need to encode the correct shutdown configuration
- #67 — Container paths and config must work in the k8s security context

## Expected behavior

The proxy's shutdown behavior should be documented and tested under Kubernetes semantics: `SIGTERM` + grace period + `preStop` hooks + async endpoint removal. Any edge cases (late requests, PID namespace, startup readiness) should be explicitly addressed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown behavior is undocumented and untested under Kubernetes lifecycle #70

Problem

1. terminationGracePeriodSeconds vs backend count

2. preStop hook and connection draining

3. Stdio backends and PID namespace

4. No readiness gate for startup

Related issues

Expected behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Graceful shutdown behavior is undocumented and untested under Kubernetes lifecycle #70

Description

Problem

1. terminationGracePeriodSeconds vs backend count

2. preStop hook and connection draining

3. Stdio backends and PID namespace

4. No readiness gate for startup

Related issues

Expected behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions