fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534
fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
Conversation
✅ Deploy Preview for cerulean-figolla-1f9435 canceled.
|
3e4c937 to
d122acc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8534 +/- ##
==========================================
+ Coverage 74.14% 74.15% +0.01%
==========================================
Files 242 242
Lines 37749 37784 +35
==========================================
+ Hits 27989 28020 +31
- Misses 7806 7808 +2
- Partials 1954 1956 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… disruption When certgen --overwrite rotates certificates, the ca.crt field of each control-plane Secret was replaced atomically with the new CA. Kubernetes propagates Secret updates to pods via the kubelet volume sync loop, and Envoy reloads its xDS TLS context via SDS: neither is instantaneous. During the convergence window, pods that have picked up a new leaf cert (signed by the new CA) are rejected by peers that still hold only the old CA in their trust store, causing mTLS authentication failures. This is the backwards-incompatible rotation problem described in envoyproxy#4891 and reproduced on v1.6.1 by users in that thread. Fix: when updating an existing Secret that already contains a ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition. Concretely, CreateOrUpdate Secrets now calls bundleCACerts(newCA, oldCA) which: 1. Starts the bundle with all certs from newCA (the freshly generated CA). 2. Appends the first non-expired, non-duplicate cert from oldCA (the CA that was active at the previous rotation). 3. Skips any further certs from oldCA. The cap of one carry-over cert keeps the bundle at a maximum of two entries regardless of how frequently rotations occur. The reasoning is: by the time an operator runs certgen --overwrite a second time, all components (kubelet sync period + SDS reload) will have converged on the certs written during the first rotation. The CA from two rotations ago is therefore never needed, and carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs (e.g. the default 5-year lifetime). The single carry-over is dropped automatically at the rotation after it would have been needed. The HMAC secret (envoy-oidc-hmac) carries no ca.crt and is unaffected. Fixes envoyproxy#4891 (partial — Rate Limit CA hot-reload addressed separately) Signed-off-by: Oliver Bailey <github@obailey.co.uk>
226d66f to
47031b5
Compare
|
|
Thanks for the questions; good ones @arkodg Does rotation impact existing connections? No. TLS/mTLS verification only happens at the handshake. An established connection that completed its handshake before rotation doesn't get re-verified and won't be disrupted. For the Envoy ↔ Envoy Gateway xDS gRPC stream this is a single long-lived connection per Envoy pod, so rotation alone won't break anything in flight. During the rotation window, won't the slower peer be unable to verify a newer cert? You're right. The bundle is a targeted improvement, not a complete solution. What What it doesn't cover: if the already-updated pod is presenting its new leaf cert (signed by The mitigating factors are:
A fully race-free approach would require a two-phase rotation: push only the updated |
Summary
Fixes #4891 (partial — Rate Limit CA hot-reload addressed in a follow-up PR)
Problem
When
certgen --overwriterotates certificates,ca.crtin each control-plane Secret is replaced atomically with the new CA. Two propagation mechanisms are at play after that write:Neither is synchronous. During the convergence window a pod that has picked up a new leaf cert (signed by the new CA) is rejected by a peer that still holds only the old CA in its trust store, causing mTLS authentication failures. This is precisely the incident reproduced on v1.6.1 described in the issue thread.
Fix
When updating an existing Secret that already contains a
ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition.bundleCACerts(newCA, oldCA):newCA(the freshly generated CA).oldCA— the CA active at the previous rotation.break).Why a maximum of two CAs
Carrying forward only one previous CA keeps the bundle at exactly two entries regardless of rotation frequency.
By the time an operator runs
certgen --overwritea second time, all components will have converged on the certs written during the first rotation (kubelet sync + SDS reload happen within seconds to a minute). The CA from two rotations ago is therefore never needed in practice. Carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs — the default lifetime is 5 years. The single carry-over is naturally dropped at the rotation after it would have been needed.Scope
envoy-oidc-hmacSecret carries noca.crtand is unaffected.fix/ratelimit-ca-restart→ this branch). Rate Limit does not watch its CA file for changes; that PR triggers a rolling restart of the Rate Limit Deployment after rotation.Testing
Added
TestCreateOrUpdateSecretsBundlesCAandTestBundleCACertscovering: