qwen3.5-35B-A3B微调 保存权重时报错 NCCL Error 1: unhandled cuda error

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

save_steps/eval_steps=5 正常
save_steps/eval_steps=10  iter=10/20都正常
save_steps/eval_steps=20 iter=20 保存权重时报错如下(20->200一样）
```bash
2026-01-09 17:40:31.169301 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/base.py", line 223, in save
2026-01-09 17:40:31.169303 master-0 >> [rank7]: async_request = self.async_save(sharded_state_dict, checkpoint_dir)
2026-01-09 17:40:31.169305 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169306 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 764, in async_save
2026-01-09 17:40:31.169307 master-0 >> [rank7]: ) = save_state_dict_async_plan(
2026-01-09 17:40:31.169309 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.16931 master-0 >> [rank7]: File "/root/.cache/modelscope/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 141, in save_state_dict_async_plan
2026-01-09 17:40:31.169312 master-0 >> [rank7]: all_local_plans = dist_wrapper.gather_object(local_plan)
2026-01-09 17:40:31.169313 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169315 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 135, in gather_object
2026-01-09 17:40:31.169318 master-0 >> [rank7]: dist.gather_object(
2026-01-09 17:40:31.169319 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169321 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.169322 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169324 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3301, in gather_object
2026-01-09 17:40:31.169325 master-0 >> [rank7]: gather(
2026-01-09 17:40:31.169327 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
2026-01-09 17:40:31.169328 master-0 >> [rank7]: return func(*args, **kwargs)
2026-01-09 17:40:31.16933 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169331 master-0 >> [rank7]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4201, in gather
2026-01-09 17:40:31.169333 master-0 >> [rank7]: work = group.gather(output_tensors, input_tensors, opts)
2026-01-09 17:40:31.169334 master-0 >> [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-09 17:40:31.169336 master-0 >> [rank7]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
2026-01-09 17:41:12.871213 master-0 >> [rank7]:[W109 17:41:12.150540851 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
2026-01-09 17:41:22.907089 master-0 >> W0109 17:41:22.906000 74 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 139 closing signal SIGTERM
```
按照https://github.com/modelscope/ms-swift/issues/5594 改动`megatron/core/energy_monitor.py`如下没有效果
```python
    # def _get_energy(self) -> int:
    #     """Get current energy consumption from NVML."""
    #     try:
    #         return nvmlDeviceGetTotalEnergyConsumption(self._handle)
    #     except NVMLError:
    #         return self._last_energy  # return *something* if it errors
    def _get_energy(self) -> int:
        return self._last_energy  # return *something* if it errors
```
@Jintao-Huang 

### How to Reproduce / 如何复现

环境是官方提供的镜像modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
并升级或安装下面的库
pip install transformers=5.2.0
pip install peft -U
pip install ms-swift==4.0.0
pip install flash-linear-attention==0.4.1
pip install casual-conv1d # https://github.com/Dao-AILab/causal-conv1d/releases/tag/v1.6.0
以及flash-attn-v3

设备是8卡H100
微调脚本如下
```bash

export SWIFT_PATCH_CONV3D=False

EXP_NAME="xxx"

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export NPROC_PER_NODE=8 
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 
megatron sft \
    --model /workspace/models/pretrained/Qwen3.5-35B-A3B \
    --dataset "/workspace/swift_channel.jsonl" \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --split_dataset_ratio 0.02 \
    --train_type lora \
    --lora_rank 16 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --expert_model_parallel_size 4 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 5e-4 \
    --micro_batch_size 1 \
    --global_batch_size 128 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 3 \
    --finetune true \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --cross_entropy_loss_fusion true \
    --lr 5e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --wandb_project moe-train \
    --wandb_exp_name "${EXP_NAME}" \
    --output_dir "megatron_output/${EXP_NAME}" \
    --eval_steps 20 \
    --save_steps 20 \
    --logging_steps 5 \
    --max_length 32000 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --tensor_model_parallel_size 2 \
    --sequence_parallel true \
    --attention_backend flash \
    --padding_free false \
    --use_rslora \
    --enable_channel_loss \
```
数据为纯文本数据
其他可能相关的库版本
megatron-core 0.15.0
pynvml 13.0.1

### Additional Information / 补充信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen3.5-35B-A3B微调保存权重时报错 NCCL Error 1: unhandled cuda error #8228

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

qwen3.5-35B-A3B微调 保存权重时报错 NCCL Error 1: unhandled cuda error #8228

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

qwen3.5-35B-A3B微调保存权重时报错 NCCL Error 1: unhandled cuda error #8228