[M4]: Add M4 end2end support and qwen3 examples #2011

yaoyu-33 · 2026-01-21T01:12:56Z

Summary

This PR introduces Local Parallel Groups support for Megatron-Bridge, enabling the use of ProcessGroupCollection passed through functions instead of relying on Megatron-Core's global parallel state (mpu) variables.

Key Features

1. New Config Flag

Added use_decentralized_pg flag to DistributedInitConfig
When enabled, creates ProcessGroupCollection using HyperCommGrid instead of mpu globals

2. ProcessGroupCollection Propagation

Updated training modules (train.py, gpt_step.py, vlm_step.py) to pass pg_collection explicitly
Removed direct parallel_state dependencies from training loop

3. Model Provider Updates

Added pg_collection parameter to get_model() and provide_distributed_model()
Models receive process groups via explicit parameter instead of global state

4. Checkpointing Updates

Refactored to use ProcessGroupCollection instead of mpu globals

5. Data Loading

Updated setup_data_iterators to accept dp_group parameter for data sharding

Examples

Added examples in examples/recipes/local_parallel_groups/:

File	Description
`pretrain_qwen3_simple.py`	Simple: Use a recipe with `use_decentralized_pg=True`
`pretrain_qwen3_with_local_parallel_groups.py`	Advanced: Manual `HyperCommGrid` and `ProcessGroupCollection` creation
`README.md`	Documentation for both approaches

Usage

Simple Approach

cfg = qwen3_4b_pretrain_config(...)
cfg.dist.use_decentralized_pg = True
cfg.dist.use_gloo_process_groups = False
pretrain(config=cfg, forward_step_func=forward_step)

Use Cases

Reinforcement Learning: Multiple model instances (policy, value, reference) with different parallelism
Multi-Model Pipelines: Complex workflows requiring explicit control over communication
Testing/Debugging: Isolated process groups without global state side effects

Testing

Unit tests: tests/unit_tests/training/test_local_parallel_groups.py
Functional tests: tests/functional_tests/training/test_local_parallel_groups.py

# Run example
torchrun --nproc_per_node=8 examples/recipes/local_parallel_groups/pretrain_qwen3_simple.py

Limitations

Gloo process groups are not supported (NCCL only)
ModelOpt sharded checkpointing is disabled when using local parallel groups

Signed-off-by: yaoyu-33 <[email protected]>

# Conflicts: # src/megatron/bridge/training/eval.py # src/megatron/bridge/training/gpt_step.py

Signed-off-by: yaoyu-33 <[email protected]>

Signed-off-by: Yu Yao <[email protected]>

Signed-off-by: yaoyu-33 <[email protected]>

# Conflicts: # src/megatron/bridge/training/tensor_inspect.py # src/megatron/bridge/training/train.py # tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py

Signed-off-by: yaoyu-33 <[email protected]>

shifangx · 2026-01-24T06:04:18Z

Hi, @yaoyu-33, can you add subscription "x. Optimizer Updates" in pr readme.
Because we did changed optimizer, to let it get pg_collection as init.

yaoyu-33 and others added 30 commits October 23, 2025 12:24

retrieve PGCollection from legacy globals via parallel_state in setup

71821c2

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/0_prepare

911ec14

fix setup

3f7ff31

Signed-off-by: yaoyu-33 <[email protected]>

pass pg_collection directly not leverage global state

57c971a

Signed-off-by: yaoyu-33 <[email protected]>

add unit test

b6a2b59

Signed-off-by: yaoyu-33 <[email protected]>

license

70ae249

Signed-off-by: yaoyu-33 <[email protected]>

lint

14607ba

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/0_prepare

e5acfd9

fix unit tests

224e1a3

Signed-off-by: yaoyu-33 <[email protected]>

fix pretrain api

7ca7dee

Signed-off-by: yaoyu-33 <[email protected]>

remove parallel_state from train.py

05939dc

Signed-off-by: yaoyu-33 <[email protected]>

update gpt_step and vlm_step to not rely on parallel_state

bac52e2

Signed-off-by: yaoyu-33 <[email protected]>

add util to get pg collection from model

0a2e29f

Signed-off-by: yaoyu-33 <[email protected]>

remove parallel state from train utils

384488b

Signed-off-by: yaoyu-33 <[email protected]>

unit test update

aa82d5e

Signed-off-by: yaoyu-33 <[email protected]>

unit tests fixes

1b54119

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/1_train_loops_and_steps

2f57a7b

# Conflicts: # src/megatron/bridge/training/eval.py # src/megatron/bridge/training/gpt_step.py

update get_pg_collection to use get_attr_wrapped_model

3df91bf

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/1_train_loops_and_steps

10acaad

update model provider to m4

c8e6636

Signed-off-by: yaoyu-33 <[email protected]>

update model providers for m4

44c9bb4

Signed-off-by: yaoyu-33 <[email protected]>

fix model provider unit tests

6a5a16b

Signed-off-by: yaoyu-33 <[email protected]>

fix unit tests

ca797f8

Signed-off-by: yaoyu-33 <[email protected]>

update data part to use m4

0639dfb

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/1_train_loops_and_steps

adec736

Signed-off-by: Yu Yao <[email protected]>

Merge branch 'm4/1_train_loops_and_steps' into m4/4_data

534781c

update unit tests and functional tests

3abca31

Signed-off-by: yaoyu-33 <[email protected]>

address comments

1745d2f

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/3_model_provider

71f43cb

# Conflicts: # src/megatron/bridge/training/tensor_inspect.py # src/megatron/bridge/training/train.py # tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py

lint

522acef

Signed-off-by: yaoyu-33 <[email protected]>

copy-pr-bot bot temporarily deployed to nemo-ci January 23, 2026 21:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 23, 2026 21:34 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[M4]: Add M4 end2end support and qwen3 examples #2011

[M4]: Add M4 end2end support and qwen3 examples #2011

yaoyu-33 commented Jan 21, 2026 •

edited

Loading

Uh oh!

shifangx commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[M4]: Add M4 end2end support and qwen3 examples #2011

Are you sure you want to change the base?

[M4]: Add M4 end2end support and qwen3 examples #2011

Conversation

yaoyu-33 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

1. New Config Flag

2. ProcessGroupCollection Propagation

3. Model Provider Updates

4. Checkpointing Updates

5. Data Loading

Examples

Usage

Simple Approach

Use Cases

Testing

Limitations

Uh oh!

shifangx commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaoyu-33 commented Jan 21, 2026 •

edited

Loading