[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257
              
                Unanswered
              
          
                  
                    
                      kfoynt
                    
                  
                
                  asked this question in
                DDP / multi-GPU / multi-node
              
            Replies: 1 comment 1 reply
-
| Any news with this? | 
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
I am trying to train a model using 2 GPUs in 1 node on SLURM.
But I am getting the following error:
[E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). Traceback (most recent call last): File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 681, in <module> main(args) File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 668, in main trainer.fit(lightning_model, my_dataset) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 947, in _run self.strategy.setup_environment() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 147, in setup_environment self.setup_distributed() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 198, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store return TCPStore( ^^^^^^^^^ TimeoutError: The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). srun: error: watgpu108: task 1: Exited with exit code 1Here is my sbatch file:
#SBATCH --nodes=1#SBATCH --mem=96GB#SBATCH --ntasks-per-node=2#SBATCH --gres=gpu:2source activate jupyter-serverexport NCCL_P2P_DISABLE=1export NCCL_DEBUG=INFOexport PYTHONFAULTHANDLER=1srun python two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.pyand my trainer
trainer = pl.Trainer(precision="bf16-mixed", accelerator="gpu", devices=2, num_nodes=1, strategy='ddp', max_epochs=100000)I am very stuck on this. I have been googling and trying potential solutions for hours, but I still get the same problem.
I tried changing the backend to gloo, but I get the same issue.
Any help would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions