-
Notifications
You must be signed in to change notification settings - Fork 352
Description
Using helm to spawn DHCP server (dgxie service), for unknown reason, we lost the docker daemon on the master, following that the kubelet could no more keep running/start critical services like apiserver... A reboot allowed docker + kubelet recovery.
But some worker nodes lost their ips (not able to renew their leases during the incident), we got an unhealthy ceph cluster. Looking at dgxie pod state after reboot, the pod was stuck on ContainerCreating state due to ceph partial failure (Volume Claim stuck).
Finally all the things recovered replacing Volume claims by empty volumes at dgxie helm service creation, the dgxie service could from there start, nodes recovered their ips, the ceph cluster went healthy and any volume claim could be satisfied.
So two things here:
- Spawning dgxie service into k8s cluster should propose a ha mechanism (dhcp is a critical service)
- We should think what are the dgxie true storage requirement and ensure dgxie storage resiliency