fix(certgen): trigger rolling restart of Rate Limit on cert rotation by OliverBailey · Pull Request #8535 · envoyproxy/gateway

OliverBailey · 2026-03-16T23:22:42Z

Summary

⚠️ Depends on #8534 — that PR must be merged first. The diff here includes those changes; the net-new change in this PR is solely in internal/cmd/certgen.go.

Completes the fix for #4891.

Problem

Rate Limit loads its CA certificate once at startup and does not watch the mounted Secret volume for changes. After certgen --overwrite rotates certificates:

The kubelet updates the /certs volume on disk for the Rate Limit pod.
The running Rate Limit process continues verifying incoming client certs against the old CA in memory.
Any Envoy pod that has already reloaded its new leaf cert via SDS is subsequently rejected by Rate Limit, causing mTLS failures.

Unlike Envoy (which uses SDS path-based reload) or Envoy Gateway (which uses GetConfigForClient to re-read certs per-connection), Rate Limit has no equivalent hot-reload path for its CA.

This is the root cause of the incident described in #4891: the failure was observed after a weekend rotation — pods had been running long enough that no natural restart had occurred since the previous cert write.

Fix

After writing rotated Secrets (--overwrite), patch the Rate Limit Deployment's pod-template annotation with the current timestamp. This is identical to what kubectl rollout restart does:

patched.Spec.Template.Annotations["kubectl.kubernetes.io/restartedAt"] = metav1.Now().UTC().Format(time.RFC3339)

Kubernetes performs a rolling replacement of Rate Limit pods using the Deployment's existing RollingUpdate strategy. By default this keeps at least 75% of replicas available at all times and respects any PodDisruptionBudget the operator has configured.

The restart is gated on --overwrite so it does not fire on the initial install, where Rate Limit has just started with the correct certs. If the Rate Limit Deployment does not exist (Rate Limit not enabled), the function is a no-op.

Why this depends on #8534

During the rolling restart, old and new Rate Limit pods run concurrently for a brief overlap period. The CA bundle written by #8534 ([newCA, previousCA]) ensures Envoy can authenticate against both old and new Rate Limit pods throughout that window. Without the bundle, the rolling restart itself would cause a brief mTLS failure as new Rate Limit pods come up with new leaf certs before Envoy has reloaded the new CA.

Note on single-replica deployments

If Rate Limit has 1 replica and no PodDisruptionBudget, there will be a brief gap between the old pod terminating and the new pod becoming ready. This is a pre-existing limitation of single-replica deployments and is not introduced by this change. Users with availability requirements should configure at least 2 replicas or a PDB with minAvailable: 1.

netlify · 2026-03-16T23:22:49Z

✅ Deploy Preview for cerulean-figolla-1f9435 canceled.

Name	Link
🔨 Latest commit	`c014347`
🔍 Latest deploy log	https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/69c05d4ecca83400089a2b5f

… disruption When certgen --overwrite rotates certificates, the ca.crt field of each control-plane Secret was replaced atomically with the new CA. Kubernetes propagates Secret updates to pods via the kubelet volume sync loop, and Envoy reloads its xDS TLS context via SDS: neither is instantaneous. During the convergence window, pods that have picked up a new leaf cert (signed by the new CA) are rejected by peers that still hold only the old CA in their trust store, causing mTLS authentication failures. This is the backwards-incompatible rotation problem described in envoyproxy#4891 and reproduced on v1.6.1 by users in that thread. Fix: when updating an existing Secret that already contains a ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition. Concretely, CreateOrUpdate Secrets now calls bundleCACerts(newCA, oldCA) which: 1. Starts the bundle with all certs from newCA (the freshly generated CA). 2. Appends the first non-expired, non-duplicate cert from oldCA (the CA that was active at the previous rotation). 3. Skips any further certs from oldCA. The cap of one carry-over cert keeps the bundle at a maximum of two entries regardless of how frequently rotations occur. The reasoning is: by the time an operator runs certgen --overwrite a second time, all components (kubelet sync period + SDS reload) will have converged on the certs written during the first rotation. The CA from two rotations ago is therefore never needed, and carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs (e.g. the default 5-year lifetime). The single carry-over is dropped automatically at the rotation after it would have been needed. The HMAC secret (envoy-oidc-hmac) carries no ca.crt and is unaffected. Fixes envoyproxy#4891 (partial — Rate Limit CA hot-reload addressed separately) Signed-off-by: Oliver Bailey <github@obailey.co.uk>

Rate Limit loads its CA certificate once at startup and does not watch the mounted Secret volume for changes. After certgen --overwrite rotates certificates, the kubelet updates the /certs volume on disk, but the running Rate Limit process continues verifying client certs against the old CA in memory. Any Envoy pod that has already reloaded its new leaf cert via SDS is subsequently rejected by Rate Limit, causing mTLS failures that persist until Rate Limit is manually restarted. This was the root cause of the incident described in envoyproxy#4891 where the failure was observed after a weekend rotation: the pods had been running long enough that the previous restart (which would have loaded the fresh CA) was well in the past. Fix: after writing the rotated Secrets, patch the Rate Limit Deployment's pod-template annotation with the current timestamp. This is the same mechanism used by kubectl rollout restart. Kubernetes will then perform a rolling replacement of Rate Limit pods using the Deployment's existing RollingUpdate strategy, which by default keeps at least 75% of replicas available at all times and respects any PodDisruptionBudget the operator has configured. The restart is gated on --overwrite so it does not fire on the initial install (where Rate Limit has just started with the correct certs). If the Rate Limit Deployment does not exist (Rate Limit not enabled) the function is a no-op. Note: this fix depends on the CA bundling change introduced in fix/ca-bundle-rotation. During the rolling restart, old and new Rate Limit pods run concurrently for a brief period. The CA bundle (new CA + previous CA) written by the prior fix ensures that Envoy can authenticate against both the old and new Rate Limit pod throughout the overlap window. Fixes envoyproxy#4891 Signed-off-by: Oliver Bailey <github@obailey.co.uk>

OliverBailey requested a review from a team as a code owner March 16, 2026 23:22

OliverBailey force-pushed the fix/ratelimit-ca-restart branch 2 times, most recently from a0fe117 to 1cff50f Compare March 20, 2026 23:12

OliverBailey added 2 commits March 22, 2026 21:21

OliverBailey force-pushed the fix/ratelimit-ca-restart branch from 1cff50f to c014347 Compare March 22, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(certgen): trigger rolling restart of Rate Limit on cert rotation#8535

fix(certgen): trigger rolling restart of Rate Limit on cert rotation#8535
OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
OliverBailey:fix/ratelimit-ca-restart

OliverBailey commented Mar 16, 2026

Uh oh!

netlify bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OliverBailey commented Mar 16, 2026

Summary

Problem

Fix

Why this depends on #8534

Note on single-replica deployments

Uh oh!

netlify bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cerulean-figolla-1f9435 canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify bot commented Mar 16, 2026 •

edited

Loading