瓜神您好，有关num_generations设置以及代码运行时会hang住的问题

瓜神您好，最近在复现 X-R1 任务时遇到以下两个问题，希望可以得到回复。
1、我理解，num_generations 参数的含义是 GRPO 算法对于每个 prompt 的 rollout 次数，再结合 num_processes (用于训练的GPU数)、per_device_train_batch_size (每块GPU的batch_size)、gradient_accumulation_steps(梯度累计次数) 这3个参数各自的含义，那么梯度进行一次 backward 的 total_batch_size 应该为：num_processes * per_device_train_batch_size * gradient_accumulation_steps * num_generations ？不知道我这样理解的是否正确？

2、在复现 1.5B, 以及 3B 的过程中，会时不时遇到代码 hang 住的问题，最终会抛出 NCCL.timeout 的错误导致实验失败。环境为 4*3090，该环境已经成功运行过 0.5B 的代码。
例如：
2025-04-18 13:08:57 - WARNING - latex2sympy2_extended.math_normalization - equations is deprecated, as it handled by the parser now
[rank0]:[E418 13:08:57.396667294 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 30419, last enqueued NCCL work: 30419, last completed NCCL work: 30418.
[rank0]:[E418 13:08:57.396709430 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E418 13:08:57.396733920 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E418 13:08:57.415528487 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30419, OpType=BROADCAST, NumelIn=25884, NumelOut=25884, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7571a2617446 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7571579cc772 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7571579d3bb3 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7571579d561d in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7571a277e5c0 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7571b4c9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7571b4d29c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30419, OpType=BROADCAST, NumelIn=25884, NumelOut=25884, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7571a2617446 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libc10.so)

我的使用了项目中给出的 yaml 文件配置，认为可能是 显存不足 的原因，尽管我尽力调小了配置，num_processes=3， per_device_train_batch_size=2，gradient_accumulation_steps=6， num_generations=2，依然会出现这个错误。
您是否有解决此类问题的经验？通过进一步扩大实验配置是否可行？比如改用 5*4090 ？

期待您的回复，祝好。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

瓜神您好，有关num_generations设置以及代码运行时会hang住的问题 #70

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

瓜神您好，有关num_generations设置以及代码运行时会hang住的问题 #70

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions