Skip to content

Commit 8f7cc3b

Browse files
committed
Revert "Add known issue on PDBs and certificate rotation (backport)"
This reverts commit a48569b.
1 parent a48569b commit 8f7cc3b

File tree

1 file changed

+0
-34
lines changed

1 file changed

+0
-34
lines changed

modules/concepts/pages/operations/pod_disruptions.adoc

Lines changed: 0 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -117,37 +117,3 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
117117

118118
== Details
119119
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
120-
121-
== Known issue with PDBs and certificate rotations
122-
PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire.
123-
commons-operator uses the `evict` API in Kubernetes, which respects the PDB.
124-
If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted.
125-
Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable.
126-
As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates.
127-
As only the container crash-loops (not the entire Pod), no new certificate is issued.
128-
As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod.
129-
However, this is prohibited, as the PDB would be violated.
130-
131-
NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances.
132-
133-
=== Workaround
134-
If you encounter this only manually deleting those pods can help out of this situation.
135-
A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway.
136-
All restarted Pods will get a new certificate, the stacklet should turn healthy again.
137-
138-
=== Restore working state
139-
Delete pods with e.g. `kubectl``.
140-
[source, bash]
141-
----
142-
kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk
143-
pod "simple-zk-server-default-0" deleted
144-
pod "simple-zk-server-default-1" deleted
145-
pod "simple-zk-server-default-2" deleted
146-
----
147-
148-
149-
=== Preventing this situation
150-
The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire.
151-
152-
A hacky way to prevent this situation could be to disable PDBs for the specific stacklet.
153-
But this also has the downside, that you are now missing the benefits of the PDB.

0 commit comments

Comments
 (0)