Alert about spawning DHCP server into the cluster via helm (dgxie service)

Using helm to spawn DHCP server (dgxie service), for unknown reason, we lost the docker daemon on the master, following that the kubelet could no more keep running/start critical services like apiserver... A reboot allowed docker + kubelet recovery.
But some worker nodes lost their ips (not able to renew their leases during the incident), we got an unhealthy ceph cluster. Looking at dgxie pod state after reboot, the pod was stuck on ContainerCreating state due to ceph partial failure (Volume Claim stuck).
Finally all the things recovered replacing Volume claims by empty volumes at dgxie helm service creation, the dgxie service could from there start, nodes recovered their ips, the ceph cluster went healthy and any volume claim could be satisfied.

So two things here:
- Spawning dgxie service into k8s cluster should propose a ha mechanism (dhcp is a critical service)
- We should think what are the dgxie true storage requirement and ensure dgxie storage resiliency



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alert about spawning DHCP server into the cluster via helm (dgxie service) #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alert about spawning DHCP server into the cluster via helm (dgxie service) #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions