-
Notifications
You must be signed in to change notification settings - Fork 899
Description
What you would like to be added?
Megatron-LM is a popular library for large-scale distributed training and inference. It implements Tensor Parallelism (TP) and Pipeline Parallelism (PP), enabling the training of LLMs that cannot fit on a single GPU even with FSDP.
We can showcase a Kubernetes-native approach to running Megatron-LM jobs at scale using KF Trainer.
Since Megatron-LM uses torchrun as the primary entrypoint for distributed execution, it would be valuable to provide example Notebook that leverage the torch-distributed runtime. Once @jaiakash finalizes support for GPU ARC runners, we can validate TP Notebook through our E2E tests.
Looking ahead, we can also explore whether introducing a dedicated runtime for Megatron-LM would make sense.
Useful resources:
- Megatron bridge for HF <-> Megatron model conversion: https://github.com/NVIDIA-NeMo/Megatron-Bridge
- https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/quickstart.html
- https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/parallelisms.html#enable-tensor-parallelism)
- [Discussion] PyTorch Operator ImprovementΒ #1836
cc @akshaychitneni @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @bigsur0 @kuizhiqing @zw0610 @johnugeorge @Ronkahn21 @ko3n1g
Why is this needed?
Demonstrate how Trainer can be used for large-scale TP and Megatron-LM
Love this feature?
Give it a π We prioritize the features with most π