-
Notifications
You must be signed in to change notification settings - Fork 53
Description
What would you like to be added:
Add a webhook to prevent eviction of pods on kosmos NotReady nodes
Why is this needed:
When the kosmos node remains not ready for more than 5 minutes, the node-controller of the controller-manager initiates eviction, which is equivalent to deleting pods. However, this approach may not always be appropriate because when the cluster reconnects, it leads to pod restarts.
The NotReady state of a node is more likely due to a kosmos service outage or cross-cluster network issues rather than a physical node failure. Therefore, there is a need for a mechanism to prevent the node-controller from deleting pods.
Since deletion is irreversible, one proposed solution is to intercept the pod deletion operation for the system:serviceaccount:kube-system:node-controller. Certain conditions need to be met before interception, such as utils.IsKosmosNode(node) && utils.IsNotReady(node) && v.needToPrevent(req.UserInfo.Username).