-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
save_steps/eval_steps=5 正常
save_steps/eval_steps=10 iter=10/20都正常
save_steps/eval_steps=20 iter=20 保存权重时报错如下(20->200一样)
2026-01-09 17:40:31.169301 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/base.py", line 223, in save
2026-01-09 17:40:31.169303 master-0 >> [rank7]: async_request = self.async_save(sharded_state_dict, checkpoint_dir)
2026-01-09 17:40:31.169305 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169306 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 764, in async_save
2026-01-09 17:40:31.169307 master-0 >> [rank7]: ) = save_state_dict_async_plan(
2026-01-09 17:40:31.169309 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.16931 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 141, in save_state_dict_async_plan
2026-01-09 17:40:31.169312 master-0 >> [rank7]: all_local_plans = dist_wrapper.gather_object(local_plan)
2026-01-09 17:40:31.169313 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169315 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 135, in gather_object
2026-01-09 17:40:31.169318 master-0 >> [rank7]: dist.gather_object(
2026-01-09 17:40:31.169319 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169321 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.169322 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169324 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3301, in gather_object
2026-01-09 17:40:31.169325 master-0 >> [rank7]: gather(
2026-01-09 17:40:31.169327 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169328 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.16933 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169331 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4201, in gather
2026-01-09 17:40:31.169333 master-0 >> [rank7]: work = group.gather(output_tensors, input_tensors, opts)
2026-01-09 17:40:31.169334 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169336 master-0 >> [rank7]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
2026-01-09 17:41:12.871213 master-0 >> [rank7]:[W109 17:41:12.150540851 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
2026-01-09 17:41:22.907089 master-0 >> W0109 17:41:22.906000 74 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 139 closing signal SIGTERM按照#5594 改动megatron/core/energy_monitor.py如下没有效果
# def _get_energy(self) -> int:
# """Get current energy consumption from NVML."""
# try:
# return nvmlDeviceGetTotalEnergyConsumption(self._handle)
# except NVMLError:
# return self._last_energy # return *something* if it errors
def _get_energy(self) -> int:
return self._last_energy # return *something* if it errorsHow to Reproduce / 如何复现
环境是官方提供的镜像modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
并升级或安装下面的库
pip install transformers=5.2.0
pip install peft -U
pip install ms-swift==4.0.0
pip install flash-linear-attention==0.4.1
pip install casual-conv1d # https://github.com/Dao-AILab/causal-conv1d/releases/tag/v1.6.0
以及flash-attn-v3
设备是8卡H100
微调脚本如下
export SWIFT_PATCH_CONV3D=False
EXP_NAME="xxx"
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft \
--model /workspace/models/pretrained/Qwen3.5-35B-A3B \
--dataset "/workspace/swift_channel.jsonl" \
--load_from_cache_file true \
--add_non_thinking_prefix true \
--split_dataset_ratio 0.02 \
--train_type lora \
--lora_rank 16 \
--lora_alpha 32 \
--target_modules all-linear \
--expert_model_parallel_size 4 \
--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 5e-4 \
--micro_batch_size 1 \
--global_batch_size 128 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--num_train_epochs 3 \
--finetune true \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--cross_entropy_loss_fusion true \
--lr 5e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--wandb_project moe-train \
--wandb_exp_name "${EXP_NAME}" \
--output_dir "megatron_output/${EXP_NAME}" \
--eval_steps 20 \
--save_steps 20 \
--logging_steps 5 \
--max_length 32000 \
--dataloader_num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--tensor_model_parallel_size 2 \
--sequence_parallel true \
--attention_backend flash \
--padding_free false \
--use_rslora \
--enable_channel_loss \数据为纯文本数据
其他可能相关的库版本
megatron-core 0.15.0
pynvml 13.0.1
Additional Information / 补充信息
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working