Skip to content

fix(certgen): trigger rolling restart of Rate Limit on cert rotation#8535

Open
OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
OliverBailey:fix/ratelimit-ca-restart
Open

fix(certgen): trigger rolling restart of Rate Limit on cert rotation#8535
OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
OliverBailey:fix/ratelimit-ca-restart

Conversation

@OliverBailey
Copy link
Contributor

Summary

⚠️ Depends on #8534 — that PR must be merged first. The diff here includes those changes; the net-new change in this PR is solely in internal/cmd/certgen.go.

Completes the fix for #4891.

Problem

Rate Limit loads its CA certificate once at startup and does not watch the mounted Secret volume for changes. After certgen --overwrite rotates certificates:

  1. The kubelet updates the /certs volume on disk for the Rate Limit pod.
  2. The running Rate Limit process continues verifying incoming client certs against the old CA in memory.
  3. Any Envoy pod that has already reloaded its new leaf cert via SDS is subsequently rejected by Rate Limit, causing mTLS failures.

Unlike Envoy (which uses SDS path-based reload) or Envoy Gateway (which uses GetConfigForClient to re-read certs per-connection), Rate Limit has no equivalent hot-reload path for its CA.

This is the root cause of the incident described in #4891: the failure was observed after a weekend rotation — pods had been running long enough that no natural restart had occurred since the previous cert write.

Fix

After writing rotated Secrets (--overwrite), patch the Rate Limit Deployment's pod-template annotation with the current timestamp. This is identical to what kubectl rollout restart does:

patched.Spec.Template.Annotations["kubectl.kubernetes.io/restartedAt"] = metav1.Now().UTC().Format(time.RFC3339)

Kubernetes performs a rolling replacement of Rate Limit pods using the Deployment's existing RollingUpdate strategy. By default this keeps at least 75% of replicas available at all times and respects any PodDisruptionBudget the operator has configured.

The restart is gated on --overwrite so it does not fire on the initial install, where Rate Limit has just started with the correct certs. If the Rate Limit Deployment does not exist (Rate Limit not enabled), the function is a no-op.

Why this depends on #8534

During the rolling restart, old and new Rate Limit pods run concurrently for a brief overlap period. The CA bundle written by #8534 ([newCA, previousCA]) ensures Envoy can authenticate against both old and new Rate Limit pods throughout that window. Without the bundle, the rolling restart itself would cause a brief mTLS failure as new Rate Limit pods come up with new leaf certs before Envoy has reloaded the new CA.

Note on single-replica deployments

If Rate Limit has 1 replica and no PodDisruptionBudget, there will be a brief gap between the old pod terminating and the new pod becoming ready. This is a pre-existing limitation of single-replica deployments and is not introduced by this change. Users with availability requirements should configure at least 2 replicas or a PDB with minAvailable: 1.

@OliverBailey OliverBailey requested a review from a team as a code owner March 16, 2026 23:22
@netlify
Copy link

netlify bot commented Mar 16, 2026

Deploy Preview for cerulean-figolla-1f9435 canceled.

Name Link
🔨 Latest commit c014347
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/69c05d4ecca83400089a2b5f

@OliverBailey OliverBailey force-pushed the fix/ratelimit-ca-restart branch 2 times, most recently from a0fe117 to 1cff50f Compare March 20, 2026 23:12
… disruption

When certgen --overwrite rotates certificates, the ca.crt field of each
control-plane Secret was replaced atomically with the new CA. Kubernetes
propagates Secret updates to pods via the kubelet volume sync loop, and
Envoy reloads its xDS TLS context via SDS: neither is instantaneous.
During the convergence window, pods that have picked up a new leaf cert
(signed by the new CA) are rejected by peers that still hold only the old
CA in their trust store, causing mTLS authentication failures.

This is the backwards-incompatible rotation problem described in envoyproxy#4891
and reproduced on v1.6.1 by users in that thread.

Fix: when updating an existing Secret that already contains a ca.crt,
bundle the outgoing CA together with the incoming CA so that every
component trusts both during the transition. Concretely, CreateOrUpdate
Secrets now calls bundleCACerts(newCA, oldCA) which:

  1. Starts the bundle with all certs from newCA (the freshly generated CA).
  2. Appends the first non-expired, non-duplicate cert from oldCA (the CA
     that was active at the previous rotation).
  3. Skips any further certs from oldCA.

The cap of one carry-over cert keeps the bundle at a maximum of two
entries regardless of how frequently rotations occur. The reasoning is:
by the time an operator runs certgen --overwrite a second time, all
components (kubelet sync period + SDS reload) will have converged on the
certs written during the first rotation. The CA from two rotations ago is
therefore never needed, and carrying it forward indefinitely would cause
unbounded bundle growth for long-lived CAs (e.g. the default 5-year
lifetime). The single carry-over is dropped automatically at the rotation
after it would have been needed.

The HMAC secret (envoy-oidc-hmac) carries no ca.crt and is unaffected.

Fixes envoyproxy#4891 (partial — Rate Limit CA hot-reload addressed separately)

Signed-off-by: Oliver Bailey <github@obailey.co.uk>
Rate Limit loads its CA certificate once at startup and does not watch
the mounted Secret volume for changes. After certgen --overwrite rotates
certificates, the kubelet updates the /certs volume on disk, but the
running Rate Limit process continues verifying client certs against the
old CA in memory. Any Envoy pod that has already reloaded its new leaf
cert via SDS is subsequently rejected by Rate Limit, causing mTLS
failures that persist until Rate Limit is manually restarted.

This was the root cause of the incident described in envoyproxy#4891 where the
failure was observed after a weekend rotation: the pods had been running
long enough that the previous restart (which would have loaded the fresh
CA) was well in the past.

Fix: after writing the rotated Secrets, patch the Rate Limit Deployment's
pod-template annotation with the current timestamp. This is the same
mechanism used by kubectl rollout restart. Kubernetes will then perform
a rolling replacement of Rate Limit pods using the Deployment's existing
RollingUpdate strategy, which by default keeps at least 75% of replicas
available at all times and respects any PodDisruptionBudget the operator
has configured.

The restart is gated on --overwrite so it does not fire on the initial
install (where Rate Limit has just started with the correct certs). If
the Rate Limit Deployment does not exist (Rate Limit not enabled) the
function is a no-op.

Note: this fix depends on the CA bundling change introduced in
fix/ca-bundle-rotation. During the rolling restart, old and new Rate
Limit pods run concurrently for a brief period. The CA bundle (new CA +
previous CA) written by the prior fix ensures that Envoy can authenticate
against both the old and new Rate Limit pod throughout the overlap window.

Fixes envoyproxy#4891

Signed-off-by: Oliver Bailey <github@obailey.co.uk>
@OliverBailey OliverBailey force-pushed the fix/ratelimit-ca-restart branch from 1cff50f to c014347 Compare March 22, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant