Skip to content

使用V100 loar 微调Qwen 3.5-9B模型报错 #8220

@wookpeckerjohn

Description

@wookpeckerjohn

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

python 3.10
pip install -U ms-swift # 4.0.0
pip install -U "transformers==5.3.0" "qwen_vl_utils==0.0.14" peft liger-kernel

显卡
V100-32G

微调脚本
export SKIP_MULTIMODAL_MTP_VALIDATION=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NPROC_PER_NODE=1
export CUDA_VISIBLE_DEVICES=0

swift sft
--model Qwen/Qwen3.5-9B
--tuner_type lora
--torch_dtype float16
--dataset /home/powerop/work/split_word/ms-swift/qwen/data/train_data.jsonl
--num_train_epochs 2
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 2
--gradient_checkpointing true
--max_length 2048
--dataloader_num_workers 4
--output_dir Output
--save_strategy epoch
--save_total_limit 100
--save_only_model true
--deepspeed zero2

报如下错误
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248044}.
Before initializing optimizer states
MA 17.77 GB Max_MA 17.85 GB CA 17.95 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing optimizer states
MA 17.77 GB Max_MA 17.93 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing ZeRO optimizer
MA 17.77 GB Max_MA 17.77 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
Train: 0%| | 0/32630 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/powerop/work/split_word/ms-swift/Output/v10-20260306-114249/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 354, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/base.py", line 52, in main
[rank0]: result = self.run()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/ray/base.py", line 168, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 197, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 270, in train
[rank0]: trainer.train(resume_checkpoint)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/trainers/mixin.py", line 916, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1424, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1506, in _inner_training_loop
[rank0]: self._run_epoch(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1701, in _run_epoch
[rank0]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 2102, in get_batch_samples
[rank0]: batch_samples.append(next(epoch_iterator))
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/dataloader/shard.py", line 93, in iter
[rank0]: for item in super().iter():
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
[rank0]: data = self._next_data()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
[rank0]: return self._process_data(data, worker_id)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
[rank0]: data.reraise()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
[rank0]: raise exception
[rank0]: TypeError: Caught TypeError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]: return self.collate_fn(data)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/base.py", line 1516, in data_collator
[rank0]: res = self._data_collator(batch, padding_to=padding_to)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 461, in _data_collator
[rank0]: res['position_ids'] = self._get_position_ids(res)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 450, in _get_position_ids
[rank0]: position_ids, _ = get_rope_index(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1548, in get_rope_index
[rank0]: input_token_type = mm_token_type_ids[batch_idx]
[rank0]: TypeError: 'NoneType' object is not subscriptable

Train: 0%| | 0/32630 [00:02<?, ?it/s]
[rank0]:[W306 11:43:17.903446885 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0306 11:43:19.498000 5880 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5921) of binary: /home/powerop/.conda/envs/split_word_train/bin/python3.10
Traceback (most recent call last):
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

How to Reproduce / 如何复现

python 3.10
pip install -U ms-swift # 4.0.0
pip install -U "transformers==5.3.0" "qwen_vl_utils==0.0.14" peft liger-kernel

显卡
V100-32G

微调脚本
export SKIP_MULTIMODAL_MTP_VALIDATION=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NPROC_PER_NODE=1
export CUDA_VISIBLE_DEVICES=0

swift sft
--model Qwen/Qwen3.5-9B
--tuner_type lora
--torch_dtype float16
--dataset /home/powerop/work/split_word/ms-swift/qwen/data/train_data.jsonl
--num_train_epochs 2
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 2
--gradient_checkpointing true
--max_length 2048
--dataloader_num_workers 4
--output_dir Output
--save_strategy epoch
--save_total_limit 100
--save_only_model true
--deepspeed zero2

报如下错误
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248044}.
Before initializing optimizer states
MA 17.77 GB Max_MA 17.85 GB CA 17.95 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing optimizer states
MA 17.77 GB Max_MA 17.93 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing ZeRO optimizer
MA 17.77 GB Max_MA 17.77 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
Train: 0%| | 0/32630 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/powerop/work/split_word/ms-swift/Output/v10-20260306-114249/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 354, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/base.py", line 52, in main
[rank0]: result = self.run()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/ray/base.py", line 168, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 197, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 270, in train
[rank0]: trainer.train(resume_checkpoint)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/trainers/mixin.py", line 916, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1424, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1506, in _inner_training_loop
[rank0]: self._run_epoch(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1701, in _run_epoch
[rank0]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 2102, in get_batch_samples
[rank0]: batch_samples.append(next(epoch_iterator))
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/dataloader/shard.py", line 93, in iter
[rank0]: for item in super().iter():
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
[rank0]: data = self._next_data()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
[rank0]: return self._process_data(data, worker_id)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
[rank0]: data.reraise()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
[rank0]: raise exception
[rank0]: TypeError: Caught TypeError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]: return self.collate_fn(data)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/base.py", line 1516, in data_collator
[rank0]: res = self._data_collator(batch, padding_to=padding_to)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 461, in _data_collator
[rank0]: res['position_ids'] = self._get_position_ids(res)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 450, in _get_position_ids
[rank0]: position_ids, _ = get_rope_index(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1548, in get_rope_index
[rank0]: input_token_type = mm_token_type_ids[batch_idx]
[rank0]: TypeError: 'NoneType' object is not subscriptable

Train: 0%| | 0/32630 [00:02<?, ?it/s]
[rank0]:[W306 11:43:17.903446885 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0306 11:43:19.498000 5880 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5921) of binary: /home/powerop/.conda/envs/split_word_train/bin/python3.10
Traceback (most recent call last):
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions