Skip to content

瓜神您好,有关num_generations设置以及代码运行时会hang住的问题 #70

@ZhangEnmao

Description

@ZhangEnmao

瓜神您好,最近在复现 X-R1 任务时遇到以下两个问题,希望可以得到回复。
1、我理解,num_generations 参数的含义是 GRPO 算法对于每个 prompt 的 rollout 次数,再结合 num_processes (用于训练的GPU数)、per_device_train_batch_size (每块GPU的batch_size)、gradient_accumulation_steps(梯度累计次数) 这3个参数各自的含义,那么梯度进行一次 backward 的 total_batch_size 应该为:num_processes * per_device_train_batch_size * gradient_accumulation_steps * num_generations ?不知道我这样理解的是否正确?

2、在复现 1.5B, 以及 3B 的过程中,会时不时遇到代码 hang 住的问题,最终会抛出 NCCL.timeout 的错误导致实验失败。环境为 4*3090,该环境已经成功运行过 0.5B 的代码。
例如:
2025-04-18 13:08:57 - WARNING - latex2sympy2_extended.math_normalization - equations is deprecated, as it handled by the parser now
[rank0]:[E418 13:08:57.396667294 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 30419, last enqueued NCCL work: 30419, last completed NCCL work: 30418.
[rank0]:[E418 13:08:57.396709430 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E418 13:08:57.396733920 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E418 13:08:57.415528487 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30419, OpType=BROADCAST, NumelIn=25884, NumelOut=25884, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7571a2617446 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7571579cc772 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7571579d3bb3 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7571579d561d in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7571a277e5c0 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x9ca94 (0x7571b4c9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x7571b4d29c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30419, OpType=BROADCAST, NumelIn=25884, NumelOut=25884, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7571a2617446 in /root/miniconda3/envs/xr1/lib/python3.11/site-packages/torch/lib/libc10.so)

我的使用了项目中给出的 yaml 文件配置,认为可能是 显存不足 的原因,尽管我尽力调小了配置,num_processes=3, per_device_train_batch_size=2,gradient_accumulation_steps=6, num_generations=2,依然会出现这个错误。
您是否有解决此类问题的经验?通过进一步扩大实验配置是否可行?比如改用 5*4090 ?

期待您的回复,祝好。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions