This repo includes working code examples for our blog about Distributed Deep Learning Training with Kubernetes.
In both directories, you can find a Dockerfile, and one or two Kubernetes manifests. The image generated with the Dockerfile should be used in the manifests.
In nccl-tests/, training.yaml should be applied first, and after all pods are ready,
launcher.yaml can be applied to trigger the tests.
You can read more in the blog!
In torchrun/, there is only one manifest since there is no launcher.
Note that the Dockerfile here expects some training script.
You can read more in the blog!