From 657842055759026bb3dcfe40c6b187efe1949639 Mon Sep 17 00:00:00 2001 From: Maxi Wittich Date: Mon, 4 Aug 2025 17:21:46 +0200 Subject: [PATCH 1/5] Add documentation about PDBs --- .../pages/operations/pod_disruptions.adoc | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index c9bbf561a..d6c075c57 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -1,5 +1,6 @@ = Allowed Pod disruptions :k8s-pdb: https://kubernetes.io/docs/tasks/run-application/configure-pdb/ +:commons-operator: xref:commons-operator:index.adoc :description: Configure PodDisruptionBudgets (PDBs) to minimize planned downtime for Stackable products. Default values are based on fault tolerance and can be customized. Any downtime of our products is generally considered to be bad. @@ -117,3 +118,35 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow == Details Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details. + +== Known issues with PDBs and e.g. certificate rotations +PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates. + +NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances. + +=== Workaround +If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`. + +==== k9s +If you are using `k9s` you can start it in your terminal +[source, bash] +---- +k9s +---- +and type `0` to view all namespaces and then type e.g. `/zookeeper` and hit enter. Go with up and down to your pod and press `CTL + D` and confirm to delete the pod. Repeat with all other instances of the stuck product. + +==== kubectl +List your pods with +[source, bash] +---- +kubectl get pods -A +---- +and copy the name of the pod you want to delete. Type +[source, bash] +---- +kubectl delete pod zookeeper-server-default-0 +---- +to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product. + +=== Preventing this Situation entirely +Only disabling PDBs can guarantee to not run into this scenario. \ No newline at end of file From 56646d676a61eaf2e6d75337017f835ac595cc7e Mon Sep 17 00:00:00 2001 From: Maxi Wittich Date: Mon, 4 Aug 2025 17:40:03 +0200 Subject: [PATCH 2/5] Fix precommit --- modules/concepts/pages/operations/pod_disruptions.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index d6c075c57..04469522d 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -149,4 +149,4 @@ kubectl delete pod zookeeper-server-default-0 to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product. === Preventing this Situation entirely -Only disabling PDBs can guarantee to not run into this scenario. \ No newline at end of file +Only disabling PDBs can guarantee to not run into this scenario. From 5912569daf70439a995c47be4eee365b16db044e Mon Sep 17 00:00:00 2001 From: Maximilian Wittich <56642549+Maleware@users.noreply.github.com> Date: Tue, 5 Aug 2025 09:42:08 +0200 Subject: [PATCH 3/5] Requested Changes Co-authored-by: Sebastian Bernauer --- .../pages/operations/pod_disruptions.adoc | 22 ++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index 04469522d..dbd706bce 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -119,13 +119,22 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow == Details Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details. -== Known issues with PDBs and e.g. certificate rotations -PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable until certificates expire. Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs a quorum and a PDB of `1`. If commons-operator comes back, it will rotate all expired certificates and thus `evict` pods. PDBs count exclusively for `eviction` and as the certificates are expired this pod will be stuck in `CrashLoopBackOff` forever unable to connect to the quorum, blocking PDB for the other pods. Other pods are unable to restart and rotate certificates. +== Known issue with PDBs and certificate rotations +PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire. +commons-operator uses the `evict` API in Kubernetes, which respects the PDB. +If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted. +Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable. +As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates. +As only the container crash-loops (not the entire Pod), no new certificate is issued. +As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod. +However, this is prohibited, as the PDB would be violated. NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances. === Workaround -If you encounter this only manually deleting those pods can help out of this situation as it is no `eviction`. +If you encounter this only manually deleting those pods can help out of this situation. +A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway. +All restarted Pods will get a new certificate, the stacklet should turn healthy again. ==== k9s If you are using `k9s` you can start it in your terminal @@ -148,5 +157,8 @@ kubectl delete pod zookeeper-server-default-0 ---- to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product. -=== Preventing this Situation entirely -Only disabling PDBs can guarantee to not run into this scenario. +=== Preventing this situation +The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire. + +A hacky way to prevent this situation could be to disable PDBs for the specific stacklet. +But this also has the downside, that you are now missing the benefits of the PDB. From b81d587e16d38e5c258678aa1ff393b6a3dad256 Mon Sep 17 00:00:00 2001 From: Maxi Wittich Date: Tue, 5 Aug 2025 09:45:41 +0200 Subject: [PATCH 4/5] Only k delte command rather then options --- .../pages/operations/pod_disruptions.adoc | 22 +++++-------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index dbd706bce..923f055b0 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -136,26 +136,16 @@ If you encounter this only manually deleting those pods can help out of this sit A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway. All restarted Pods will get a new certificate, the stacklet should turn healthy again. -==== k9s -If you are using `k9s` you can start it in your terminal +=== Restore working state +Delete pods with e.g. `kubectl``. [source, bash] ---- -k9s +kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk +pod "simple-zk-server-default-0" deleted +pod "simple-zk-server-default-1" deleted +pod "simple-zk-server-default-2" deleted ---- -and type `0` to view all namespaces and then type e.g. `/zookeeper` and hit enter. Go with up and down to your pod and press `CTL + D` and confirm to delete the pod. Repeat with all other instances of the stuck product. -==== kubectl -List your pods with -[source, bash] ----- -kubectl get pods -A ----- -and copy the name of the pod you want to delete. Type -[source, bash] ----- -kubectl delete pod zookeeper-server-default-0 ----- -to delete the instance with the name `zookeeper-server-default-0`. Repeat it for all instances of your product. === Preventing this situation The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire. From 6e776c302222b3f55ff0ca75aedb9e557692a22e Mon Sep 17 00:00:00 2001 From: Maximilian Wittich <56642549+Maleware@users.noreply.github.com> Date: Tue, 5 Aug 2025 10:59:31 +0200 Subject: [PATCH 5/5] Apply formatting changes Co-authored-by: Sebastian Bernauer --- modules/concepts/pages/operations/pod_disruptions.adoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc index 923f055b0..657986692 100644 --- a/modules/concepts/pages/operations/pod_disruptions.adoc +++ b/modules/concepts/pages/operations/pod_disruptions.adoc @@ -123,6 +123,7 @@ Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire. commons-operator uses the `evict` API in Kubernetes, which respects the PDB. If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted. + Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable. As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates. As only the container crash-loops (not the entire Pod), no new certificate is issued. @@ -132,7 +133,7 @@ However, this is prohibited, as the PDB would be violated. NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances. === Workaround -If you encounter this only manually deleting those pods can help out of this situation. +If you encounter this, only manually deleting those pods can help you out of this situation. A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway. All restarted Pods will get a new certificate, the stacklet should turn healthy again.