Skip to content

四机8卡910B3昇腾环境下用swift-megatron微调deepseek_r1报错 #7330

@smallshallot

Description

@smallshallot

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
四机8卡910B3昇腾环境下,用swift-megatron微调deepseek_r1报错,执行脚本如下:
主节点脚本node0.sh如下

export PYTHONPATH=$PYTHONPATH:/NFS/zsi/train/swift/code/Megatron-LM
export MEGATRON_LM_PATH=/NFS/zsj/train/swift/code/Megatron-LM

nproc_per_node=8
nnodes=4

NNODES=$nnodes \
NODE_RANK=0 \
MASTER_ADDR=192.168.70.2 \
MASTER_PORT=29512 \
NPROC_PER_NODE=$nproc_per_node \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HCCL_SOCKET_IFNAME=enp67s0f0np0 \
megatron sft \
--model /NFS/pruning models/pruning_output_models_80_final_layer_256_1201 \
--model_type deepseek_r1 \
--torch_dtype bfloat16 \
--dataset '/NFS/xk/train_data/ori_data/10456_1201.ison' \
--save /NFS/xk/saved_models \
--train_type 'lora' \
--lora_rank 8 \
--lora_alpha 32 \
--freeze_parameters ratio 1 \
--trainable_parameters_regex ".*.experts.*." \
--tensor_model_parallel_size 2 \
--pipeline_model_parallel_size 1 \
--context_parallel_size 1 \
--sequence_parallel true \
--micro_batch_size 1 \
--global_batch_size 64 \
--recompute_granularity selective \
--recompute_modules_core attn \
--cross_entropy_loss_fusion true \
--no_gradient_accumulation_fusion true \
--lr le-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--log_interval 5 \
--num_workers 4 

从节点1-3脚本如下,只有NODE_RANK不同

export PYTHONPATH=$PYTHONPATH:/NFS/zsi/train/swift/code/Megatron-LM
export MEGATRON_LM_PATH=/NFS/zsj/train/swift/code/Megatron-LM

nproc_per_node=8
nnodes=4

NNODES=$nnodes \
NODE_RANK=1 \
MASTER_ADDR=192.168.70.2 \
MASTER_PORT=29512 \
NPROC_PER_NODE=$nproc_per_node \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HCCL_SOCKET_IFNAME=enp67s0f0np0 \
megatron sft \
--model /NFS/pruning models/pruning_output_models_80_final_layer_256_1201 \
--model_type deepseek_r1 \
--torch_dtype bfloat16 \
--dataset '/NFS/xk/train_data/ori_data/10456_1201.ison' \
--save /NFS/xk/saved_models \
--train_type 'lora' \
--lora_rank 8 \
--lora_alpha 32 \
--freeze_parameters_ratio 1 \
--trainable_parameters_regex ".*.experts.*." \
--tensor_model_parallel_size 2 \
--pipeline_model_parallel_size 1 \
--context_parallel_size 1 \
--sequence_parallel true \
--micro_batch_size 1 \
--global_batch_size 64 \
--recompute_granularity selective \
--recompute_modules_core attn \
--cross_entropy_loss_fusion true \
--no_gradient_accumulation_fusion true \
--lr le-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--log_interval 5 \
--num_workers 4 

报错如下:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/cli/_megatron/sft.py", line 7, in <module>
[rank1]:     megatron_sft_main()
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank1]:     return MegatronSft(args).main()
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/llm/base.py", line 49, in main
[rank1]:     result = self.run()
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/train/sft.py", line 77, in run
[rank1]:     self.trainer.train(train_dataset, val_dataset, data_collator)
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/trainers/base.py", line 1098, in train
[rank1]:     pretrain(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/training/training.py", line 746, in pretrain
[rank1]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/trainers/base.py", line 498, in setup_model_and_optimizer
[rank1]:     model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/training/training.py", line 1116, in setup_model_and_optimizer
[rank1]:     model = get_model(model_provider_func, model_type)
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/training/training.py", line 942, in get_model
[rank1]:     model = build_model()
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/training/training.py", line 932, in build_model
[rank1]:     model = model_provider_func(
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/trainers/base.py", line 476, in new_model_provider_func
[rank1]:     model = model_provider_func(*_args, **kwargs)
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/model/model_provider.py", line 149, in model_provider
[rank1]:     model = megatron_model_meta.model_cls(
[rank1]:   File "/data1/aibd/zsj/model_train/ms-swift/swift/megatron/model/gpt_model.py", line 93, in __init__
[rank1]:     super().__init__(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 169, in __init__
[rank1]:     self.decoder = TransformerBlock(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/transformer_block.py", line 267, in __init__
[rank1]:     self._build_layers()
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/transformer_block.py", line 293, in _build_layers
[rank1]:     [
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/transformer_block.py", line 294, in <listcomp>
[rank1]:     build_layer(layer_spec, i + 1)
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/transformer_block.py", line 288, in build_layer
[rank1]:     module = build_module(layer_spec, config=layer_config, layer_number=layer_number)
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 104, in build_module
[rank1]:     raise type(e)(f"{str(e)} when instantiating {module.__name__}").with_traceback(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 97, in build_module
[rank1]:     return module(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 301, in __init__
[rank1]:     self.self_attention = build_module(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 104, in build_module
[rank1]:     raise type(e)(f"{str(e)} when instantiating {module.__name__}").with_traceback(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 97, in build_module
[rank1]:     return module(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/multi_latent_attention.py", line 243, in __init__
[rank1]:     super().__init__(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/multi_latent_attention.py", line 110, in __init__
[rank1]:     self.core_attention = build_module(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 104, in build_module
[rank1]:     raise type(e)(f"{str(e)} when instantiating {module.__name__}").with_traceback(
[rank1]:   File "/data1/aibd/zsj/model_train/Megatron-LM/megatron/core/transformer/spec_utils.py", line 97, in build_module
[rank1]:     return module(
[rank1]: TypeError: DotProductAttention.__init__() got an unexpected keyword argument 'k_channels' when instantiating DotProductAttention when instantiating MLASelfAttention when instantiating TransformerLayer

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
npu-smi版本25.2.0
4台*8卡 910B3昇腾环境
环境严格按照 https://github.com/modelscope/ms-swift/blob/main/docs/source/BestPractices/NPU-support.md 执行

# 1. 获取并切换 Megatron-LM 至 core_v0.12.1 版本
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_v0.12.1
cd ..

# 2. 获取并安装 MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout 0016137f0dcfeab3308e0d16994046740c0e4ad9
pip install -e .
cd ..

Additional context
Add any other context about the problem here(在这里补充其他信息)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingnpu

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions