Distributed Tensor Parallelism with Megatron-LM and TrainJob

### What you would like to be added?

[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) is a popular library for large-scale distributed training and inference. It implements Tensor Parallelism (TP) and Pipeline Parallelism (PP), enabling the training of LLMs that cannot fit on a single GPU even with FSDP.

We can showcase a Kubernetes-native approach to running Megatron-LM jobs at scale using KF Trainer.

Since Megatron-LM uses `torchrun` as the primary entrypoint for distributed execution, it would be valuable to provide example Notebook that leverage the `torch-distributed` runtime. Once @jaiakash finalizes support for GPU ARC runners, we can validate TP Notebook through our E2E tests.

Looking ahead, we can also explore whether introducing a dedicated runtime for Megatron-LM would make sense.

**Useful resources:**
- Megatron bridge for HF <-> Megatron model conversion: https://github.com/NVIDIA-NeMo/Megatron-Bridge
- https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/quickstart.html
- https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/parallelisms.html#enable-tensor-parallelism)
- https://github.com/kubeflow/trainer/issues/1836


cc @akshaychitneni @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @bigsur0 @kuizhiqing @zw0610 @johnugeorge @Ronkahn21 @ko3n1g

### Why is this needed?

Demonstrate how Trainer can be used for large-scale TP and Megatron-LM

### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Tensor Parallelism with Megatron-LM and TrainJob #3178

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed Tensor Parallelism with Megatron-LM and TrainJob #3178

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions