-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
python 3.10
pip install -U ms-swift # 4.0.0
pip install -U "transformers==5.3.0" "qwen_vl_utils==0.0.14" peft liger-kernel
显卡
V100-32G
微调脚本
export SKIP_MULTIMODAL_MTP_VALIDATION=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NPROC_PER_NODE=1
export CUDA_VISIBLE_DEVICES=0
swift sft
--model Qwen/Qwen3.5-9B
--tuner_type lora
--torch_dtype float16
--dataset /home/powerop/work/split_word/ms-swift/qwen/data/train_data.jsonl
--num_train_epochs 2
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 2
--gradient_checkpointing true
--max_length 2048
--dataloader_num_workers 4
--output_dir Output
--save_strategy epoch
--save_total_limit 100
--save_only_model true
--deepspeed zero2
报如下错误
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248044}.
Before initializing optimizer states
MA 17.77 GB Max_MA 17.85 GB CA 17.95 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing optimizer states
MA 17.77 GB Max_MA 17.93 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing ZeRO optimizer
MA 17.77 GB Max_MA 17.77 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
Train: 0%| | 0/32630 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/powerop/work/split_word/ms-swift/Output/v10-20260306-114249/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 354, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/base.py", line 52, in main
[rank0]: result = self.run()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/ray/base.py", line 168, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 197, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 270, in train
[rank0]: trainer.train(resume_checkpoint)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/trainers/mixin.py", line 916, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1424, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1506, in _inner_training_loop
[rank0]: self._run_epoch(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1701, in _run_epoch
[rank0]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 2102, in get_batch_samples
[rank0]: batch_samples.append(next(epoch_iterator))
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/dataloader/shard.py", line 93, in iter
[rank0]: for item in super().iter():
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
[rank0]: data = self._next_data()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
[rank0]: return self._process_data(data, worker_id)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
[rank0]: data.reraise()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
[rank0]: raise exception
[rank0]: TypeError: Caught TypeError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]: return self.collate_fn(data)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/base.py", line 1516, in data_collator
[rank0]: res = self._data_collator(batch, padding_to=padding_to)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 461, in _data_collator
[rank0]: res['position_ids'] = self._get_position_ids(res)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 450, in _get_position_ids
[rank0]: position_ids, _ = get_rope_index(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1548, in get_rope_index
[rank0]: input_token_type = mm_token_type_ids[batch_idx]
[rank0]: TypeError: 'NoneType' object is not subscriptable
Train: 0%| | 0/32630 [00:02<?, ?it/s]
[rank0]:[W306 11:43:17.903446885 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0306 11:43:19.498000 5880 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5921) of binary: /home/powerop/.conda/envs/split_word_train/bin/python3.10
Traceback (most recent call last):
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
How to Reproduce / 如何复现
python 3.10
pip install -U ms-swift # 4.0.0
pip install -U "transformers==5.3.0" "qwen_vl_utils==0.0.14" peft liger-kernel
显卡
V100-32G
微调脚本
export SKIP_MULTIMODAL_MTP_VALIDATION=1
export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export MAX_PIXELS=1003520
export VIDEO_MAX_PIXELS=50176
export FPS_MAX_FRAMES=12
export NPROC_PER_NODE=1
export CUDA_VISIBLE_DEVICES=0
swift sft
--model Qwen/Qwen3.5-9B
--tuner_type lora
--torch_dtype float16
--dataset /home/powerop/work/split_word/ms-swift/qwen/data/train_data.jsonl
--num_train_epochs 2
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 2
--gradient_checkpointing true
--max_length 2048
--dataloader_num_workers 4
--output_dir Output
--save_strategy epoch
--save_total_limit 100
--save_only_model true
--deepspeed zero2
报如下错误
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248044}.
Before initializing optimizer states
MA 17.77 GB Max_MA 17.85 GB CA 17.95 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing optimizer states
MA 17.77 GB Max_MA 17.93 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
After initializing ZeRO optimizer
MA 17.77 GB Max_MA 17.77 GB CA 18.11 GB Max_CA 18 GB
CPU Virtual Memory: used = 2.51 GB, percent = 5.6%
Train: 0%| | 0/32630 [00:00<?, ?it/s][INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/powerop/work/split_word/ms-swift/Output/v10-20260306-114249/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 354, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/base.py", line 52, in main
[rank0]: result = self.run()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/ray/base.py", line 168, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 197, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/pipelines/train/sft.py", line 270, in train
[rank0]: trainer.train(resume_checkpoint)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/trainers/mixin.py", line 916, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1424, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1506, in _inner_training_loop
[rank0]: self._run_epoch(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 1701, in _run_epoch
[rank0]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/trainer.py", line 2102, in get_batch_samples
[rank0]: batch_samples.append(next(epoch_iterator))
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/dataloader/shard.py", line 93, in iter
[rank0]: for item in super().iter():
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
[rank0]: data = self._next_data()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
[rank0]: return self._process_data(data, worker_id)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
[rank0]: data.reraise()
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
[rank0]: raise exception
[rank0]: TypeError: Caught TypeError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank0]: return self.collate_fn(data)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/base.py", line 1516, in data_collator
[rank0]: res = self._data_collator(batch, padding_to=padding_to)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 461, in _data_collator
[rank0]: res['position_ids'] = self._get_position_ids(res)
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/swift/template/templates/qwen.py", line 450, in _get_position_ids
[rank0]: position_ids, _ = get_rope_index(
[rank0]: File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1548, in get_rope_index
[rank0]: input_token_type = mm_token_type_ids[batch_idx]
[rank0]: TypeError: 'NoneType' object is not subscriptable
Train: 0%| | 0/32630 [00:02<?, ?it/s]
[rank0]:[W306 11:43:17.903446885 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0306 11:43:19.498000 5880 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5921) of binary: /home/powerop/.conda/envs/split_word_train/bin/python3.10
Traceback (most recent call last):
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/powerop/.conda/envs/split_word_train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Additional Information / 补充信息
No response