Skip to content

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Jan 21, 2026

Summary

This PR introduces Local Parallel Groups support for Megatron-Bridge, enabling the use of ProcessGroupCollection passed through functions instead of relying on Megatron-Core's global parallel state (mpu) variables.

Key Features

1. New Config Flag

  • Added use_decentralized_pg flag to DistributedInitConfig
  • When enabled, creates ProcessGroupCollection using HyperCommGrid instead of mpu globals

2. ProcessGroupCollection Propagation

  • Updated training modules (train.py, gpt_step.py, vlm_step.py) to pass pg_collection explicitly
  • Removed direct parallel_state dependencies from training loop

3. Model Provider Updates

  • Added pg_collection parameter to get_model() and provide_distributed_model()
  • Models receive process groups via explicit parameter instead of global state

4. Checkpointing Updates

  • Refactored to use ProcessGroupCollection instead of mpu globals

5. Data Loading

  • Updated setup_data_iterators to accept dp_group parameter for data sharding

Examples

Added examples in examples/recipes/local_parallel_groups/:

File Description
pretrain_qwen3_simple.py Simple: Use a recipe with use_decentralized_pg=True
pretrain_qwen3_with_local_parallel_groups.py Advanced: Manual HyperCommGrid and ProcessGroupCollection creation
README.md Documentation for both approaches

Usage

Simple Approach

cfg = qwen3_4b_pretrain_config(...)
cfg.dist.use_decentralized_pg = True
cfg.dist.use_gloo_process_groups = False
pretrain(config=cfg, forward_step_func=forward_step)

Use Cases

  • Reinforcement Learning: Multiple model instances (policy, value, reference) with different parallelism
  • Multi-Model Pipelines: Complex workflows requiring explicit control over communication
  • Testing/Debugging: Isolated process groups without global state side effects

Testing

  • Unit tests: tests/unit_tests/training/test_local_parallel_groups.py
  • Functional tests: tests/functional_tests/training/test_local_parallel_groups.py
# Run example
torchrun --nproc_per_node=8 examples/recipes/local_parallel_groups/pretrain_qwen3_simple.py

Limitations

  • Gloo process groups are not supported (NCCL only)
  • ModelOpt sharded checkpointing is disabled when using local parallel groups

yaoyu-33 and others added 30 commits October 23, 2025 12:24
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
# Conflicts:
#	src/megatron/bridge/training/eval.py
#	src/megatron/bridge/training/gpt_step.py
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
# Conflicts:
#	src/megatron/bridge/training/tensor_inspect.py
#	src/megatron/bridge/training/train.py
#	tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py
Signed-off-by: yaoyu-33 <[email protected]>
@shifangx
Copy link
Contributor

Hi, @yaoyu-33, can you add subscription "x. Optimizer Updates" in pr readme.
Because we did changed optimizer, to let it get pg_collection as init.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants