Distributed Training with Kubernetes

Code Examples

This repo includes working code examples for our blog about Distributed Deep Learning Training with Kubernetes.

In both directories, you can find a Dockerfile, and one or two Kubernetes manifests. The image generated with the Dockerfile should be used in the manifests.

NCCL Tests

In nccl-tests/, training.yaml should be applied first, and after all pods are ready, launcher.yaml can be applied to trigger the tests. You can read more in the blog!

Torchrun

In torchrun/, there is only one manifest since there is no launcher. Note that the Dockerfile here expects some training script. You can read more in the blog!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
nccl-tests		nccl-tests
torchrun		torchrun
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training with Kubernetes

Code Examples

NCCL Tests

Torchrun

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Training with Kubernetes

Code Examples

NCCL Tests

Torchrun

About

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages