|
1 | 1 | = Allowed Pod disruptions
|
2 | 2 | :k8s-pdb: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
|
| 3 | +:commons-operator: xref:commons-operator:index.adoc |
3 | 4 | :description: Configure PodDisruptionBudgets (PDBs) to minimize planned downtime for Stackable products. Default values are based on fault tolerance and can be customized.
|
4 | 5 |
|
5 | 6 | Any downtime of our products is generally considered to be bad.
|
@@ -117,3 +118,38 @@ This PDB allows only one Pod out of all the Namenodes and Journalnodes to be dow
|
117 | 118 |
|
118 | 119 | == Details
|
119 | 120 | Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
|
| 121 | + |
| 122 | +== Known issue with PDBs and certificate rotations |
| 123 | +PDBs together with certificate rotations can be problematic in case e.g. {commons-operator}[commons-operator] was unavailable to restart the Pods before the certificate expire. |
| 124 | +commons-operator uses the `evict` API in Kubernetes, which respects the PDB. |
| 125 | +If a Pod is evicted and a PDB would be violated, the Pod is *not* restarted. |
| 126 | + |
| 127 | +Assume a product like xref:zookeeper:index.adoc[Apache ZooKeeper] which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable. |
| 128 | +As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates. |
| 129 | +As only the container crash-loops (not the entire Pod), no new certificate is issued. |
| 130 | +As soon as commons-operator comes online again it tries to `evict` a Zookeeper Pod. |
| 131 | +However, this is prohibited, as the PDB would be violated. |
| 132 | + |
| 133 | +NOTE: We encountered this problem only with the specific outlined case above and only under this circumstances. |
| 134 | + |
| 135 | +=== Workaround |
| 136 | +If you encounter this, only manually deleting those pods can help you out of this situation. |
| 137 | +A Pod deletion (other than evictions) does *not* respect PDBs, so the Pods can be restarted anyway. |
| 138 | +All restarted Pods will get a new certificate, the stacklet should turn healthy again. |
| 139 | + |
| 140 | +=== Restore working state |
| 141 | +Delete pods with e.g. `kubectl``. |
| 142 | +[source, bash] |
| 143 | +---- |
| 144 | +kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk |
| 145 | +pod "simple-zk-server-default-0" deleted |
| 146 | +pod "simple-zk-server-default-1" deleted |
| 147 | +pod "simple-zk-server-default-2" deleted |
| 148 | +---- |
| 149 | + |
| 150 | + |
| 151 | +=== Preventing this situation |
| 152 | +The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire. |
| 153 | + |
| 154 | +A hacky way to prevent this situation could be to disable PDBs for the specific stacklet. |
| 155 | +But this also has the downside, that you are now missing the benefits of the PDB. |
0 commit comments