Skip to content

Distributed Tensor Parallelism with Megatron-LM and TrainJobΒ #3178

@andreyvelich

Description

@andreyvelich

What you would like to be added?

Megatron-LM is a popular library for large-scale distributed training and inference. It implements Tensor Parallelism (TP) and Pipeline Parallelism (PP), enabling the training of LLMs that cannot fit on a single GPU even with FSDP.

We can showcase a Kubernetes-native approach to running Megatron-LM jobs at scale using KF Trainer.

Since Megatron-LM uses torchrun as the primary entrypoint for distributed execution, it would be valuable to provide example Notebook that leverage the torch-distributed runtime. Once @jaiakash finalizes support for GPU ARC runners, we can validate TP Notebook through our E2E tests.

Looking ahead, we can also explore whether introducing a dedicated runtime for Megatron-LM would make sense.

Useful resources:

cc @akshaychitneni @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @bigsur0 @kuizhiqing @zw0610 @johnugeorge @Ronkahn21 @ko3n1g

Why is this needed?

Demonstrate how Trainer can be used for large-scale TP and Megatron-LM

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions