[megatron] feat: support Megatron-FSDP mode for Megatron backend by conver334 · Pull Request #5423 · verl-project/verl

conver334 · 2026-02-27T06:09:18Z

What does this PR do?

Support using Megatron-FSDP for SFT and RL.

Add Megatron-FSDP as a new training backend option for the Megatron engine. This is implementation of #5244 .

Key changes:

Add example scripts for GRPO examples/grpo_trainer/run_qwen2-7b_math_megatron_fsdp.sh and SFT examples/sft/gsm8k/run_qwen_megatron_fsdp.sh. Run these following the user guide docs/examples/megatron_fsdp_example.rst
Add use_megatron_fsdp config flag to McoreEngineConfig, enabling using Megatron-FSDP via a single config option.
When enabled, automatically configure the required DDP settings (distributed optimizer, Zero-3 sharding strategy, grad reduce overlap) with sensible defaults, while still allowing fine-grained override via override_ddp_config.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [megatron, model] feat: qwen3.5 example #5381
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Loss is normal in SFT. Tested on 8*H100 with Moonlight-16B-A3B-Instruct, GSM8K SFT dataset

Loss is normal in SFT. Tested on 8*H100 with Qwen2.5-Math-7B, GSM8K SFT dataset

Reward is normal in GRPO. Tested on 8*H100 with Qwen2.5-Math-7B, GSM8K

MFU

API and Usage Example

Enable Megatron-FSDP by setting three config flags:

    actor_rollout_ref.actor.megatron.use_mbridge=True \
    actor_rollout_ref.actor.megatron.vanilla_mbridge=False \
    actor_rollout_ref.actor.megatron.use_megatron_fsdp=True \

The FSDP-specific DDP settings (sharding strategy, overlap, etc.) are auto-configured with defaults. Advanced users can override them:

    actor_rollout_ref.actor.megatron.override_ddp_config.data_parallel_sharding_strategy=optim_grads \
    actor_rollout_ref.actor.megatron.override_ddp_config.overlap_grad_reduce=False \

Design & Code Changes

Megatron-FSDP use the same training loop as Megatron.
Conversion between HuggingFace format and Megatron-FSDP DTensor is implemented via NVIDIA-NeMo/Megatron-Bridge#1910.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

Signed-off-by: conver334 <conver334@gmail.com>

gemini-code-assist

Code Review

This pull request introduces support for Megatron-FSDP as a new training backend, including configuration flags, automatic DDP settings, and state lifecycle management. The overall implementation is sound, but I've identified a critical issue in the FSDP parameter synchronization logic that could lead to inconsistent model states during inference. Additionally, the new example script has a hardcoded model path, which impacts its portability. I've provided suggestions to fix these issues.

…oad crash

Resolve 3 conflicts: - megatron_utils.py: keep FSDP ddp_config in main's cleaner structure; take main's GDN-aware grad buffer handling - megatron_checkpoint_manager.py: apply FSDP while-loop unwrap in main's should_generate_model_sections structure; use main's refactored PEFT save path with FSDP skip for HF checkpoint - run_sft_engine.sh: keep both megatron_fsdp and automodel backends Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yxs · 2026-04-10T23:36:59Z

I tested both SFT and GRPO (without ALL_OFFLOAD) on 8×H100 with Qwen2.5-0.5B, Megatron-LM main, and Megatron-Bridge PR #1910:

SFT: loss decreasing normally (0.695 → 0.543 over 21 steps)
GRPO: 447 steps completed, reward improving (0.006 → 0.72), DTensor weight export works correctly

@conver334 All looks good to me now.

# Conflicts: # verl/workers/megatron_workers.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Simiao Zuo <simiaoz@nvidia.com>

Update the Megatron-Bridge pin in the Megatron-FSDP CI job and example doc to 6fea5bb (merge commit of NVIDIA-NeMo/Megatron-Bridge#3512), which is the preferred version now that the HF<->Megatron-FSDP weight conversion PR has landed. Also drop the now-merged Megatron-LM PR3191 and Megatron-Bridge PR1910 checkouts from the example doc in favor of the same pinned commits used in CI, and refresh the doc's "Last updated" date. Co-authored-by: Claude

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: conver334 <conver334@gmail.com>

…path Co-authored-by: Claude

Co-authored-by: OpenAI Codex

Co-authored-by: OpenAI Codex <codex@openai.com>

Save and load Megatron-FSDP trainer checkpoints through PyTorch DCP, including model, optimizer, scheduler, and RNG state. Preserve HF export support through Megatron-Bridge and document the current example assumptions. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: conver334 <conver334@gmail.com>

Co-authored-by: OpenAI Codex <codex@openai.com>

conver334 added 2 commits February 26, 2026 21:40

support Megatron-FSDP

95f9b6b

Merge remote-tracking branch 'origin/main' into megatron-fsdp

1c66800

Signed-off-by: conver334 <conver334@gmail.com>

conver334 requested review from FightingZhen, ISEEKYAN, PeterSH6, eric-haibin-lin, ji-huazhong, tardis-key and vermouth1992 as code owners February 27, 2026 06:09

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread verl/utils/megatron_utils.py Outdated

Comment thread examples/grpo_trainer/run_qwen2-7b_math_megatron_fsdp.sh Outdated

ISEEKYAN marked this pull request as draft February 27, 2026 06:41

add SFT support and CI test

66acb73

ISEEKYAN changed the title ~~[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend~~ [megatron] feat: support Megatron-FSDP mode for Megatron backend Mar 4, 2026

conver334 added 4 commits March 4, 2026 01:37

adjust override

f250312

support mfsdp in Megatron-Bridge

b360f94

add user guide for mfsdp

b3d0def

update SFT in doc

aa4c097

conver334 mentioned this pull request Apr 9, 2026

[megatron] feat: enable Megatron FSDP for SFT training #5854

Open

4 tasks

conver334 and others added 3 commits April 10, 2026 01:01

fix checkpoint and adapted to the latest mcore interface

f636701

fix: restore PEFT adapter loading condition and fix FSDP optimizer l…

be9b783

…oad crash

conver334 marked this pull request as ready for review April 13, 2026 09:10

conver334 requested review from tongyx361 and wucong25 as code owners April 13, 2026 09:10

conver334 and others added 5 commits April 22, 2026 23:34

Merge remote-tracking branch 'origin/main' into megatron-fsdp

c850595

# Conflicts: # verl/workers/megatron_workers.py

remove mbridge mfsdp

1503690

[ci] fix: fix CI test for Megatron-FSDP

01fa0c5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Simiao Zuo <simiaoz@nvidia.com>

[ci] fix: pin Megatron-LM and Megatron-Bridge commits

7583f68

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Simiao Zuo <simiaoz@nvidia.com>

conver334 force-pushed the megatron-fsdp branch from 59dd36c to 63ea0d6 Compare April 27, 2026 02:27

conver334 and others added 14 commits April 27, 2026 00:48

[ci] fix: install modelopt and disable ALL_OFFLOAD for Megatron-FSDP

fc679e1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: conver334 <conver334@gmail.com>

[ci] fix: drop pkill cleanup before Megatron-FSDP GRPO step

c1fa8cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: conver334 <conver334@gmail.com>

[megatron, ckpt] fix: state_dict module. prefix mismatch on non-FSDP …

8b039f3

…path Co-authored-by: Claude

Refine Megatron-FSDP integration

0963dc4

Co-authored-by: OpenAI Codex

update doc

5526a42

change buffer

8b1d2a7

update example

2832ded

Merge remote-tracking branch 'origin/main' into megatron-fsdp

9178a21

docs: update Megatron-FSDP setup

d19082a

Co-authored-by: OpenAI Codex <codex@openai.com>

test: update Megatron-FSDP coverage

51d9eca

Co-authored-by: OpenAI Codex <codex@openai.com>

Fix Megatron-FSDP HF checkpoint save

e11c462

revert wrong checkpoint implementation

c4aaa82

fix: make Megatron-FSDP examples runnable

d1f9b83

Co-authored-by: OpenAI Codex <codex@openai.com>

conver334 force-pushed the megatron-fsdp branch from c051f7b to d1f9b83 Compare May 1, 2026 04:03

style: apply pre-commit formatting

6f18193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] feat: support Megatron-FSDP mode for Megatron backend#5423

[megatron] feat: support Megatron-FSDP mode for Megatron backend#5423
conver334 wants to merge 30 commits intoverl-project:mainfrom
conver334:megatron-fsdp

conver334 commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yxs commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

conver334 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yxs commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

conver334 commented Feb 27, 2026 •

edited

Loading