From c0e06060424583689628b501ecb5d8f993fc95ea Mon Sep 17 00:00:00 2001 From: Maxi Wittich Date: Tue, 5 Aug 2025 09:45:41 +0200 Subject: [PATCH] backporting docs --- .../pages/operations/pod_disruptions.adoc | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index c9bbf561a..7d1da11ac 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -117,3 +117,37 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow == Details Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details. + +== Known issue with PDBs and certificate rotations +PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire. +commons-operator uses the `evict` API in Kubernetes, which respects the PDB. +If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted. +Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable. +As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates. +As only the container crash-loops (not the entire Pod), no new certificate is issued. +As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod. +However, this is prohibited, as the PDB would be violated. + +NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances. + +=== Workaround +If you encounter this only manually deleting those pods can help out of this situation. +A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway. +All restarted Pods will get a new certificate, the stacklet should turn healthy again. + +=== Restore working state +Delete pods with e.g. `kubectl``. +[source, bash] +---- +kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk +pod "simple-zk-server-default-0" deleted +pod "simple-zk-server-default-1" deleted +pod "simple-zk-server-default-2" deleted +---- + + +=== Preventing this situation +The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire. + +A hacky way to prevent this situation could be to disable PDBs for the specific stacklet. +But this also has the downside, that you are now missing the benefits of the PDB.