Node crashes when preparing datasets on multi-node environment

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

The framework produced the following error when I tried to do distributed finetuning on a multi-node cluster:
```
[rank8]:` Traceback (most recent call last):
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/cli/_megatron/sft.py", line 7, in <module>
[rank8]:     megatron_sft_main()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/megatron/pipelines/train/sft.py", line 88, in megatron_sft_main
[rank8]:     return MegatronSft(args).main()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/base.py", line 52, in main
[rank8]:     result = self.run()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/megatron/pipelines/train/sft.py", line 63, in run
[rank8]:     train_dataset, val_dataset = self._prepare_dataset()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/ray/base.py", line 168, in wrapper
[rank8]:     return func(self, *args, **kwargs)
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/train/sft.py", line 120, in _prepare_dataset
[rank8]:     train_dataset, val_dataset = self._get_dataset()
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/pipelines/train/sft.py", line 84, in _get_dataset
[rank8]:     train_dataset, val_dataset = load_dataset(
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 327, in load_dataset
[rank8]:     train_dataset = loader.load(dataset_syntax, dataset_meta, use_hf=use_hf)
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 159, in load
[rank8]:     dataset = self._load_dataset_path(
[rank8]:   File "/mnt/data/ongoing/ms-swift/swift/dataset/loader.py", line 55, in _load_dataset_path
[rank8]:     dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/datasets/load.py", line 2084, in load_dataset
[rank8]:     builder_instance.download_and_prepare(
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/datasets/builder.py", line 860, in download_and_prepare
[rank8]:     with FileLock(lock_path) if is_local else contextlib.nullcontext():
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_api.py", line 550, in __enter__
[rank8]:     self.acquire()
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_api.py", line 498, in acquire
[rank8]:     self._acquire()
[rank8]:   File "/mnt/data/miniconda/envs/swift-env/lib/python3.10/site-packages/filelock/_unix.py", line 49, in _acquire
[rank8]:     fd = os.open(self.lock_file, open_flags, open_mode)
[rank8]: FileExistsError: [Errno 17] File exists: '/mnt/data/ongoing/ms-cache/datasets/json/default-1d036539cdfac9c2/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092_builder.lock'
[rank8]:[W306 13:47:06.167415737 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
```

The dataset processing function appears to have been invoked on every node, causing some nodes to crash when attempting to acquire the file lock for the temporary dataset file.

### How to Reproduce / 如何复现

The script used is the following:
```bash
export MODELSCOPE_CACHE=/mnt/data/ongoing/ms-cache
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export OMP_NUM_THREADS=14
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NNODES=4
export SKIP_MULTIMODAL_MTP_VALIDATION=1

megatron sft \
    --model "/mnt/data/ongoing/models/Qwen3.5-35B-A3B" \
    --save_safetensors true \
    --dataset "/mnt/data/ongoing/sft-data/example-data.jsonl" \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --split_dataset_ratio 0.01 \
    --tuner_type full \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 2 \
    --expert_model_parallel_size 4 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 4 \
    --global_batch_size 128 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 3 \
    --group_by_length true \
    --finetune true \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --decoder_first_pipeline_num_layers 24 \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --output_dir /mnt/data/ongoing/ms-swift/output/Qwen3.5-35B-A3B-test-train \
    --eval_steps 8000 \
    --save_steps 8000 \
    --max_length 32768 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --moe_expert_capacity_factor 2 \
    --mtp_num_layers 1 \
    --optimizer_cpu_offload true \
    --use_precision_aware_optimizer true \
    --optimizer_offload_fraction 0.62 \
    --attention_backend flash \
    --padding_free false
```

The SWIFT environment is built on source codes of commit ID (2026/03/06): 78d1abae244df51dea72a1cf1f73ae6d21a76a36

Versions of some highly-related libraries are:
```
datasets==3.6.0
megatron-core==0.15.3
pyarrow==23.0.1
torch==2.10.0
transformers==5.2.0
transformer_engine_cu12==2.12.0
```

### Additional Information / 补充信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node crashes when preparing datasets on multi-node environment #8234

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node crashes when preparing datasets on multi-node environment #8234

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions