[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257

kfoynt · 2024-01-10T07:15:34Z

kfoynt
Jan 10, 2024

Hi

I am trying to train a model using 2 GPUs in 1 node on SLURM.
But I am getting the following error:

[E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). Traceback (most recent call last): File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 681, in <module> main(args) File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 668, in main trainer.fit(lightning_model, my_dataset) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 947, in _run self.strategy.setup_environment() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 147, in setup_environment self.setup_distributed() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 198, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store return TCPStore( ^^^^^^^^^ TimeoutError: The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). srun: error: watgpu108: task 1: Exited with exit code 1

Here is my sbatch file:

#SBATCH --nodes=1
#SBATCH --mem=96GB
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

source activate jupyter-server

export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

srun python two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py

and my trainer

trainer = pl.Trainer(precision="bf16-mixed", accelerator="gpu", devices=2, num_nodes=1, strategy='ddp', max_epochs=100000)

I am very stuck on this. I have been googling and trying potential solutions for hours, but I still get the same problem.
I tried changing the backend to gloo, but I get the same issue.

Any help would be greatly appreciated.

oabuhamdan · 2024-06-21T21:23:30Z

oabuhamdan
Jun 21, 2024

Any news with this?

1 reply

kfoynt Jun 25, 2024
Author

There is a parameter in DDPStrategy

DDPStrategy(process_group_backend='nccl', timeout=datetime.timedelta(seconds=1800))

It is set by default to 1800. All you have to do is to increase it to whatever works for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257

Uh oh!

Uh oh!

kfoynt Jan 10, 2024

Replies: 1 comment · 1 reply

Uh oh!

oabuhamdan Jun 21, 2024

Uh oh!

kfoynt Jun 25, 2024 Author

kfoynt
Jan 10, 2024

Replies: 1 comment 1 reply

oabuhamdan
Jun 21, 2024

kfoynt Jun 25, 2024
Author