[trainer] feat: add mindspeedmm backend engine support on NPU.support Qwen3.5-27B、Qwen3.5-35B#6199
[trainer] feat: add mindspeedmm backend engine support on NPU.support Qwen3.5-27B、Qwen3.5-35B#6199OneMondy wants to merge 1 commit intoverl-project:mainfrom
Conversation
… Qwen3.5-27B、Qwen3.5-35B Co-authored-by: pengnuoheng <18720048515@163.com>
|
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for MindSpeed FSDP and refactors the MindSpeed Megatron backend, including new training scripts for Qwen3.5 models and a dedicated MindSpeed optimizer configuration. Key changes involve updating MindSpeedEngineConfig to support multiple strategies, enhancing VLM support through improved 3D position ID handling in TensorDict, and implementing MindSpeedFSDPEngineWithLMHead. Review feedback correctly identified several typos in strategy names (minspeed_megatron) within the shell scripts and missing imports for torch, re, and shutil in the MindSpeed engine implementation.
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1 | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_grad_reduce=True | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_param_gather=True | ||
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron |
There was a problem hiding this comment.
Typo in strategy name: minspeed_megatron should be mindspeed_megatron. This will cause an assertion failure in MindSpeedEngineConfig.__post_init__. Additionally, use + as the prefix for overriding configuration values in Hydra.
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron | |
| +actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron |
References
- Use + instead of ++ as the prefix for overriding configuration values in Hydra.
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_method=uniform | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_granularity=full | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1 | ||
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron |
There was a problem hiding this comment.
Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron | |
| +actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron |
References
- Use + instead of ++ as the prefix for overriding configuration values in Hydra.
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_method=uniform | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_granularity=full | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1 | ||
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron |
There was a problem hiding this comment.
Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron | |
| +actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron |
References
- Use + instead of ++ as the prefix for overriding configuration values in Hydra.
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.recompute_num_layers=1 | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_grad_reduce=True | ||
| +actor_rollout_ref.actor.mindspeed.llm_kwargs.overlap_param_gather=True | ||
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron |
There was a problem hiding this comment.
Typo in strategy name: minspeed_megatron should be mindspeed_megatron. Additionally, use + as the prefix for overriding configuration values in Hydra.
| actor_rollout_ref.actor.mindspeed.strategy=minspeed_megatron | |
| +actor_rollout_ref.actor.mindspeed.strategy=mindspeed_megatron |
References
- Use + instead of ++ as the prefix for overriding configuration values in Hydra.
| if not self.mm_args.training.no_save_optim: | ||
| state["optimizer"] = self.optimizer | ||
| if not self.mm_args.training.no_save_rng: | ||
| state["extra_state"]["torch_rng_state"] = torch.get_rng_state() |
There was a problem hiding this comment.
| if "lr_scheduler" in state["extra_state"]: | ||
| self.lr_scheduler.load_state_dict(state["extra_state"]["lr_scheduler"]) | ||
| if not self.mm_args.training.no_load_rng and "torch_rng_state" in state["extra_state"]: | ||
| torch.set_rng_state(state["extra_state"]["torch_rng_state"]) |
There was a problem hiding this comment.
| def _cleanup_old_checkpoints(self, base_path, max_to_keep): | ||
| iter_pattern = re.compile(r"iter_(\d+)") |
There was a problem hiding this comment.
The re module is not imported in this file. This will cause a NameError when attempting to compile the iteration pattern.
| def _cleanup_old_checkpoints(self, base_path, max_to_keep): | |
| iter_pattern = re.compile(r"iter_(\d+)") | |
| def _cleanup_old_checkpoints(self, base_path, max_to_keep): | |
| import re | |
| iter_pattern = re.compile(r"iter_(\d+)") |
| if os.path.isdir(old_dir): | ||
| shutil.rmtree(old_dir) |
There was a problem hiding this comment.
|
Too many changes are coupled into this PR, please move all changes except for mindspeedllm to separate PR. |
… Qwen3.5-27B、Qwen3.5-35B
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.