Skip to content

Commit 3f9aaf7

Browse files
Malewaresbernauer
andauthored
Add known issue on PDBs and certificate rotation (#769)
* Add documentation about PDBs * Fix precommit * Requested Changes Co-authored-by: Sebastian Bernauer <[email protected]> * Only k delte command rather then options * Apply formatting changes Co-authored-by: Sebastian Bernauer <[email protected]> --------- Co-authored-by: Sebastian Bernauer <[email protected]>
1 parent b112848 commit 3f9aaf7

File tree

1 file changed

+36
-0
lines changed

1 file changed

+36
-0
lines changed

modules/concepts/pages/operations/pod_disruptions.adoc

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
= Allowed Pod disruptions
22
:k8s-pdb: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
3+
:commons-operator: xref:commons-operator:index.adoc
34
:description: Configure PodDisruptionBudgets (PDBs) to minimize planned downtime for Stackable products. Default values are based on fault tolerance and can be customized.
45

56
Any downtime of our products is generally considered to be bad.
@@ -117,3 +118,38 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
117118

118119
== Details
119120
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
121+
122+
== Known issue with PDBs and certificate rotations
123+
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire.
124+
commons-operator uses the `evict` API in Kubernetes, which respects the PDB.
125+
If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted.
126+
127+
Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable.
128+
As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates.
129+
As only the container crash-loops (not the entire Pod), no new certificate is issued.
130+
As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod.
131+
However, this is prohibited, as the PDB would be violated.
132+
133+
NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.
134+
135+
=== Workaround
136+
If you encounter this, only manually deleting those pods can help you out of this situation.
137+
A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway.
138+
All restarted Pods will get a new certificate, the stacklet should turn healthy again.
139+
140+
=== Restore working state
141+
Delete pods with e.g. `kubectl``.
142+
[source, bash]
143+
----
144+
kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk
145+
pod "simple-zk-server-default-0" deleted
146+
pod "simple-zk-server-default-1" deleted
147+
pod "simple-zk-server-default-2" deleted
148+
----
149+
150+
151+
=== Preventing this situation
152+
The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire.
153+
154+
A hacky way to prevent this situation could be to disable PDBs for the specific stacklet.
155+
But this also has the downside, that you are now missing the benefits of the PDB.

0 commit comments

Comments
 (0)