Multi GPU not training

Even I set the config.toml file to **`device_ids = "0,1,2,3" # set the gpu devices on which you want to train your model`**  but on my GPU 0 there are 4 processes and remaining 3 have single processes each?

and gives below error:

`Epoch 0: 
2025-09-08 10:43:48.815435: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:48.986409: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:49.012472: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:49.135623: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:43:59.861859: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.356222: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.402556: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:00.479490: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:10.972153: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:11.482530: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:11.586530: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:12.392137: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:22.370105: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:22.514207: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:23.197054: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-08 10:44:23.369505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x75e2ac6c5b20>
Traceback (most recent call last):
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in __del__
    self._shutdown_workers()
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240211) is killed by signal: Aborted. 
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x77e7876adb20>
Traceback (most recent call last):
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in __del__
    self._shutdown_workers()
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240991) is killed by signal: Aborted. 
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x726312ab5b20>
Traceback (most recent call last):
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1618, in __del__
    self._shutdown_workers()
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.11/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2240210) is killed by signal: Aborted. 
W0908 10:44:40.490000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238616 via signal SIGTERM
W0908 10:44:40.491000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238618 via signal SIGTERM
W0908 10:44:40.491000 2238354 .venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 2238619 via signal SIGTERM
Traceback (most recent call last):
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/train.py", line 192, in <module>
    mp.spawn(
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/train.py", line 173, in main
    trainer.train()
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/base/base_trainer.py", line 221, in train
    self._train_epoch(epoch)
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/trainer/trainer.py", line 94, in _train_epoch
    outputs = self.model(**batch)
              ^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1867, in forward
    outputs = self.wav2vec2(
              ^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1467, in forward
    encoder_outputs = self.encoder(
                      ^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 831, in forward
    layer_outputs = layer(
                    ^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 669, in forward
    hidden_states, attn_weights, _ = self.attention(
                                     ^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 556, in forward
    value_states = self.v_proj(current_states).view(*kv_input_shape).transpose(1, 2)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nvme0n1-disk/ASR_Training/ASR-Wav2vec-Finetune/.venv/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 102.00 MiB. GPU 1 has a total capacity of 39.38 GiB of which 5.38 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 130.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`

<img width="1893" height="1013" alt="Image" src="https://github.com/user-attachments/assets/422948e3-7bc5-4c2f-9695-b5462891f921" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU not training #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi GPU not training #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions