Skip to content

qwen3.5-35B-A3B微调 保存权重时报错 NCCL Error 1: unhandled cuda error #8228

@ooochen-30

Description

@ooochen-30

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

save_steps/eval_steps=5 正常
save_steps/eval_steps=10 iter=10/20都正常
save_steps/eval_steps=20 iter=20 保存权重时报错如下(20->200一样)

2026-01-09 17:40:31.169301 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/base.py", line 223, in save
2026-01-09 17:40:31.169303 master-0 >> [rank7]: async_request = self.async_save(sharded_state_dict, checkpoint_dir)
2026-01-09 17:40:31.169305 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169306 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 764, in async_save
2026-01-09 17:40:31.169307 master-0 >> [rank7]: ) = save_state_dict_async_plan(
2026-01-09 17:40:31.169309 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.16931 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 141, in save_state_dict_async_plan
2026-01-09 17:40:31.169312 master-0 >> [rank7]: all_local_plans = dist_wrapper.gather_object(local_plan)
2026-01-09 17:40:31.169313 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169315 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 135, in gather_object
2026-01-09 17:40:31.169318 master-0 >> [rank7]: dist.gather_object(
2026-01-09 17:40:31.169319 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169321 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.169322 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169324 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3301, in gather_object
2026-01-09 17:40:31.169325 master-0 >> [rank7]: gather(
2026-01-09 17:40:31.169327 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169328 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.16933 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169331 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4201, in gather
2026-01-09 17:40:31.169333 master-0 >> [rank7]: work = group.gather(output_tensors, input_tensors, opts)
2026-01-09 17:40:31.169334 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169336 master-0 >> [rank7]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
2026-01-09 17:41:12.871213 master-0 >> [rank7]:[W109 17:41:12.150540851 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
2026-01-09 17:41:22.907089 master-0 >> W0109 17:41:22.906000 74 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 139 closing signal SIGTERM

按照#5594 改动megatron/core/energy_monitor.py如下没有效果

    # def _get_energy(self) -> int:
    #     """Get current energy consumption from NVML."""
    #     try:
    #         return nvmlDeviceGetTotalEnergyConsumption(self._handle)
    #     except NVMLError:
    #         return self._last_energy  # return *something* if it errors
    def _get_energy(self) -> int:
        return self._last_energy  # return *something* if it errors

@Jintao-Huang

How to Reproduce / 如何复现

环境是官方提供的镜像modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
并升级或安装下面的库
pip install transformers=5.2.0
pip install peft -U
pip install ms-swift==4.0.0
pip install flash-linear-attention==0.4.1
pip install casual-conv1d # https://github.com/Dao-AILab/causal-conv1d/releases/tag/v1.6.0
以及flash-attn-v3

设备是8卡H100
微调脚本如下

export SWIFT_PATCH_CONV3D=False

EXP_NAME="xxx"

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export NPROC_PER_NODE=8 
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 
megatron sft \
    --model /workspace/models/pretrained/Qwen3.5-35B-A3B \
    --dataset "/workspace/swift_channel.jsonl" \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --split_dataset_ratio 0.02 \
    --train_type lora \
    --lora_rank 16 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --expert_model_parallel_size 4 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 5e-4 \
    --micro_batch_size 1 \
    --global_batch_size 128 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 3 \
    --finetune true \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --cross_entropy_loss_fusion true \
    --lr 5e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --wandb_project moe-train \
    --wandb_exp_name "${EXP_NAME}" \
    --output_dir "megatron_output/${EXP_NAME}" \
    --eval_steps 20 \
    --save_steps 20 \
    --logging_steps 5 \
    --max_length 32000 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --tensor_model_parallel_size 2 \
    --sequence_parallel true \
    --attention_backend flash \
    --padding_free false \
    --use_rslora \
    --enable_channel_loss \

数据为纯文本数据
其他可能相关的库版本
megatron-core 0.15.0
pynvml 13.0.1

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions