Skip to content

Multi GPU not training #13

@haris-waqar123

Description

@haris-waqar123

Even I set the config.toml file to device_ids = "0,1,2,3" # set the gpu devices on which you want to train your model but on my GPU 0 there are 4 processes and remaining 3 have single processes each?

and gives below error:

`Epoch 0:
2025-09-08 10:43:48.815435: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:48.986409: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:49.012472: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:49.135623: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:59.861859: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.356222: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.402556: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.479490: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:10.972153: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:11.482530: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:11.586530: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:12.392137: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:22.370105: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:22.514207: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:23.197054: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:23.369505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x75e2ac6c5b20>
Traceback (most recent call last):
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in del
self._shutdown_workers()
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
if not wait([self.sentinel], timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240211) is killed by signal: Aborted.
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x77e7876adb20>
Traceback (most recent call last):
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in del
self._shutdown_workers()
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
if not wait([self.sentinel], timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240991) is killed by signal: Aborted.
Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x726312ab5b20>
Traceback (most recent call last):
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in del
self._shutdown_workers()
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
if not wait([self.sentinel], timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240210) is killed by signal: Aborted.
W0908 10:44:40.490000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238616 via signal SIGTERM
W0908 10:44:40.491000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238618 via signal SIGTERM
W0908 10:44:40.491000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238619 via signal SIGTERM
Traceback (most recent call last):
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/train.py", line 192, in
mp.spawn(
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/train.py", line 173, in main
trainer.train()
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/base/base_trainer.py", line 221, in train
self._train_epoch(epoch)
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/trainer/trainer.py", line 94, in _train_epoch
outputs = self.model(**batch)
^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
else self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1867, in forward
outputs = self.wav2vec2(
^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1467, in forward
encoder_outputs = self.encoder(
^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 831, in forward
layer_outputs = layer(
^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 669, in forward
hidden_states, attn_weights, _ = self.attention(
^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 556, in forward
value_states = self.v_proj(current_states).view(*kv_input_shape).transpose(1, 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 102.00 MiB. GPU 1 has a total capacity of 39.38 GiB of which 5.38 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 130.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions