Skip to content

Node crashes when preparing datasets on multi-node environment #8234

@gonggongjohn

Description

@gonggongjohn

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

The framework produced the following error when I tried to do distributed finetuning on a multi-node cluster:

[rank8]:` Traceback (most recent call last):
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/cli/_megatron/sft.py", line 7, in <module>
[rank8]:     megatron_sft_main()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/megatron/pipelines/train/sft.py", line 88, in megatron_sft_main
[rank8]:     return MegatronSft(args).main()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/base.py", line 52, in main
[rank8]:     result = self.run()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/megatron/pipelines/train/sft.py", line 63, in run
[rank8]:     train_dataset, val_dataset = self._prepare_dataset()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/ray/base.py", line 168, in wrapper
[rank8]:     return func(self, *args, **kwargs)
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/train/sft.py", line 120, in _prepare_dataset
[rank8]:     train_dataset, val_dataset = self._get_dataset()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/train/sft.py", line 84, in _get_dataset
[rank8]:     train_dataset, val_dataset = load_dataset(
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 327, in load_dataset
[rank8]:     train_dataset = loader.load(dataset_syntax, dataset_meta, use_hf=use_hf)
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 159, in load
[rank8]:     dataset = self._load_dataset_path(
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 55, in _load_dataset_path
[rank8]:     dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/datasets/load.py", line 2084, in load_dataset
[rank8]:     builder_instance.download_and_prepare(
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/datasets/builder.py", line 860, in download_and_prepare
[rank8]:     with FileLock(lock_path) if is_local else contextlib.nullcontext():
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_api.py", line 550, in __enter__
[rank8]:     self.acquire()
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_api.py", line 498, in acquire
[rank8]:     self._acquire()
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_unix.py", line 49, in _acquire
[rank8]:     fd = os.open(self.lock_file, open_flags, open_mode)
[rank8]: FileExistsError: [Errno 17] File exists: '/mnt/data/ongoing/ms-cache/datasets/json/default-1d036539cdfac9c2/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092_builder.lock'
[rank8]:[W306 13:47:06.167415737 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

The dataset processing function appears to have been invoked on every node, causing some nodes to crash when attempting to acquire the file lock for the temporary dataset file.

How to Reproduce / 如何复现

The script used is the following:

export MODELSCOPE_CACHE=/mnt/data/ongoing/ms-cache
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export OMP_NUM_THREADS=14
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NNODES=4
export SKIP_MULTIMODAL_MTP_VALIDATION=1

megatron sft \
    --model "/mnt/data/ongoing/models/Qwen3.5-35B-A3B" \
    --save_safetensors true \
    --dataset "/mnt/data/ongoing/sft-data/example-data.jsonl" \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --split_dataset_ratio 0.01 \
    --tuner_type full \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 2 \
    --expert_model_parallel_size 4 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 4 \
    --global_batch_size 128 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 3 \
    --group_by_length true \
    --finetune true \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --decoder_first_pipeline_num_layers 24 \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --output_dir /mnt/data/ongoing/ms-swift/output/Qwen3.5-35B-A3B-test-train \
    --eval_steps 8000 \
    --save_steps 8000 \
    --max_length 32768 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --moe_expert_capacity_factor 2 \
    --mtp_num_layers 1 \
    --optimizer_cpu_offload true \
    --use_precision_aware_optimizer true \
    --optimizer_offload_fraction 0.62 \
    --attention_backend flash \
    --padding_free false

The SWIFT environment is built on source codes of commit ID (2026/03/06): 78d1aba

Versions of some highly-related libraries are:

datasets==3.6.0
megatron-core==0.15.3
pyarrow==23.0.1
torch==2.10.0
transformers==5.2.0
transformer_engine_cu12==2.12.0

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions