Skip to content

[trainer] feat: add mindspeedmm backend engine support on NPU.support Qwen3.5-27B、Qwen3.5-35B#6199

Open
OneMondy wants to merge 1 commit intoverl-project:mainfrom
OneMondy:main
Open

[trainer] feat: add mindspeedmm backend engine support on NPU.support Qwen3.5-27B、Qwen3.5-35B#6199
OneMondy wants to merge 1 commit intoverl-project:mainfrom
OneMondy:main

Conversation

@OneMondy
Copy link
Copy Markdown

… Qwen3.5-27B、Qwen3.5-35B

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, vllm_omni, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

… Qwen3.5-27B、Qwen3.5-35B

Co-authored-by: pengnuoheng <18720048515@163.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MindSpeed FSDP and refactors the MindSpeed Megatron backend, including new training scripts for Qwen3.5 models and a dedicated MindSpeed optimizer configuration. Key changes involve updating MindSpeedEngineConfig to support multiple strategies, enhancing VLM support through improved 3D position ID handling in TensorDict, and implementing MindSpeedFSDPEngineWithLMHead. Review feedback correctly identified several typos in strategy names (minspeed_megatron) within the shell scripts and missing imports for torch, re, and shutil in the MindSpeed engine implementation.

+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1
+actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_grad_reduce=True
+actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_param_gather=True
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Typo in strategy name: minspeed_megatron should be mindspeed_megatron. This will cause an assertion failure in MindSpeedEngineConfig.__post_init__. Additionally, use + as the prefix for overriding configuration values in Hydra.

Suggested change
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
+actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron
References
  1. Use + instead of ++ as the prefix for overriding configuration values in Hydra.

+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_method=uniform
+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_granularity=full
+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.

Suggested change
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
+actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron
References
  1. Use + instead of ++ as the prefix for overriding configuration values in Hydra.

+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_method=uniform
+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_granularity=full
+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.

Suggested change
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
+actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron
References
  1. Use + instead of ++ as the prefix for overriding configuration values in Hydra.

+actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1
+actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_grad_reduce=True
+actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_param_gather=True
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.

Suggested change
actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron
+actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron
References
  1. Use + instead of ++ as the prefix for overriding configuration values in Hydra.

if not self.mm_args.training.no_save_optim:
state["optimizer"] = self.optimizer
if not self.mm_args.training.no_save_rng:
state["extra_state"]["torch_rng_state"] = torch.get_rng_state()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The torch module is not imported in this scope. This will cause a NameError when attempting to save the RNG state.

Suggested change
state["extra_state"]["torch_rng_state"] = torch.get_rng_state()
import torch
state["extra_state"]["torch_rng_state"] = torch.get_rng_state()

if "lr_scheduler" in state["extra_state"]:
self.lr_scheduler.load_state_dict(state["extra_state"]["lr_scheduler"])
if not self.mm_args.training.no_load_rng and "torch_rng_state" in state["extra_state"]:
torch.set_rng_state(state["extra_state"]["torch_rng_state"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The torch module is not imported in this scope. This will cause a NameError when attempting to load the RNG state.

Suggested change
torch.set_rng_state(state["extra_state"]["torch_rng_state"])
import torch
torch.set_rng_state(state["extra_state"]["torch_rng_state"])

Comment on lines +343 to +344
def _cleanup_old_checkpoints(self, base_path, max_to_keep):
iter_pattern = re.compile(r"iter_(\d+)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The re module is not imported in this file. This will cause a NameError when attempting to compile the iteration pattern.

Suggested change
def _cleanup_old_checkpoints(self, base_path, max_to_keep):
iter_pattern = re.compile(r"iter_(\d+)")
def _cleanup_old_checkpoints(self, base_path, max_to_keep):
import re
iter_pattern = re.compile(r"iter_(\d+)")

Comment on lines +354 to +355
if os.path.isdir(old_dir):
shutil.rmtree(old_dir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The shutil module is not imported in this file. This will cause a NameError when attempting to remove old checkpoint directories.

Suggested change
if os.path.isdir(old_dir):
shutil.rmtree(old_dir)
if os.path.isdir(old_dir):
import shutil
shutil.rmtree(old_dir)

@wuxibin89
Copy link
Copy Markdown
Collaborator

Too many changes are coupled into this PR, please move all changes except for mindspeedllm to separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants