-
Notifications
You must be signed in to change notification settings - Fork 30
Description
데이터셋 문제는 덕분에 해결됐습니다. 감사합니다. 이후로도 조금씩 없는 데이터 문제가 뜨긴 했지만, 데이터허브에서 받은 파일에 있는거라 쉽게 해결했습니다.
그런데 현재는 "RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432" 에러 때문에 또다시 학습이 막힌 상태입니다.
.view(bsz, q_len, self.num_heads, self.head_dim)로 결정되고, 'bsz'랑 'q_len'은 "finetune_lora.sh"의 per_device_train_batch_size, model_max_length가 결정한다는 것은 알고있습니다. 그래서 'bsz'를 16에서 4로 바꿔봤더니 "RuntimeError: shape '[4, 2048, 32, 128]' is invalid for input of size 8388608"이 뜨더라고요. 'q_len'을 바꿔봐도 소용이 없고요.
혹시 파인튜닝용 KoLLaVA-v1.5-Synatra-7B 모델의 global batch size의 크기가(128) 이 문제에 기여하는지 확인해봤는데, 그건 또 아닌것 같습니다. (참고로 이 128에 맞추기 위해 gradient_accumulation_steps값을 4로 바꿨습니다. 제 서버 환경의 GPU 개수가 2개여서요.)
역시나 이는 'self.num_heads', 'self.head_dim'와는 전혀 관계가 없어보이는데, 어떤 값을 수정해야 할까요?
다시 한번 감사드리며, 커맨드창의 에러 부분만 공유해드리겠습니다.
wandb: 🚀 View run at https://wandb.ai/jiwon_ha/huggingface/runs/d5rk2eng
0%| | 0/4543 [00:00<?, ?it/s]/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
train()
File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
trainer.train()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
loss = self.module(*inputs, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
return super().forward(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
return fn(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
outputs = run_function(*args)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
[rank1]: train()
[rank1]: File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
[rank1]: trainer.train()
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
[rank1]: loss = self.compute_loss(model, inputs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
[rank1]: outputs = model(**inputs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
[rank1]: loss = self.module(*inputs, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
[rank1]: return self.base_model(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
[rank1]: return super().forward(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
[rank1]: outputs = self.model(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
[rank1]: layer_outputs = torch.utils.checkpoint.checkpoint(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
[rank1]: return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]: return fn(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
[rank1]: return fn(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
[rank1]: return CheckpointFunction.apply(function, preserve, *args)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
[rank1]: outputs = run_function(*args)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
[rank1]: return module(*inputs, output_attentions, None)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
[rank1]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
[rank1]: .view(bsz, q_len, self.num_heads, self.head_dim)
[rank1]: RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
.view(bsz, q_len, self.num_heads, self.head_dim)
RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
[rank0]: train()
[rank0]: File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
[rank0]: trainer.train()
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
[rank0]: return self.base_model(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
[rank0]: return super().forward(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
[rank0]: outputs = self.model(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
[rank0]: layer_outputs = torch.utils.checkpoint.checkpoint(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
[rank0]: return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
[rank0]: return CheckpointFunction.apply(function, preserve, *args)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
[rank0]: outputs = run_function(*args)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
[rank0]: return module(*inputs, output_attentions, None)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
[rank0]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
[rank0]: .view(bsz, q_len, self.num_heads, self.head_dim)
[rank0]: RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432