Skip to content

Conversation

@swarnaHub
Copy link

@swarnaHub swarnaHub commented Jul 22, 2025

What does this PR do? Please describe:
This is a PR to enable usage of generative RMs in GRPO training. Below is a summary of the main changes:

This is how a typical reward config would look like:

reward:
name: "generative_pairwise_verifier"
config:
prompt_key: prompt_raw
tokenizer: /datasets/pretrained-llms/Qwen3-8B/
judgment_extractor: "j1_pairwise_score_extractor"
pair_type: "all_pairs"

If you want to use any pairwise RM, set name as "generative_pairwise_verifier" or for pointwise RMs, use "generative_pointwise_verifier". Currently, there is no option for doing k-wise judgments but I am adding support for it now.

The field "judgment_extractor" refers to a class that implements how we (1) prompt the RM, (2) extract scores/judgments and (3) aggregate multiple scores (if doing SC or something) from the judgment CoTs.

Whether you want to use GenRMs in a reference-free or a reference-based manner (i.e., for math, you might have access to reference answers), this need not be explicitly stated in the config. If the input file has a field for reference answer, it'll be used. All extractors now take reference answer as an argument (and can be empty).

Finally, "pair_type" refers to the particular setting a pairwise RM will be used in. These have three options now: (1) "pivot" (all rollouts are judged against a reference rollout), (2) "random_pairs" (we randomly sample N pairwise comparisons), and (3) "all_pairs" (all N*(N-1) pairs are constructed for judgment). See details in the GenerativePairwiseVerifier class in _rewards.py.

Unrelated change: This PR also added support for Skywork v2 RM.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 22, 2025
@swarnaHub swarnaHub changed the title skywork and some qwrn changes skywork and some qwen changes Jul 22, 2025
@swarnaHub swarnaHub changed the base branch from ot_merge to online_training July 22, 2025 21:21
@swarnaHub swarnaHub requested a review from chenxwh August 27, 2025 17:20
) in self._config.loss_config.validation_vllm_sampling_params.items():
policy_sampling_params.__setattr__(k, v)

# For a pairwise RM, need to sample at least two judgments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 2 rollouts per prompt, but i assume there are 2 copies of the prompt (2 different orders)? so shouldn't you only need 1 rollout for each order?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this is a typo! I meant 2 "rollouts" instead of "judgments".

How many judgment prompts are created out of those two rollouts (i.e., 2 with (a,b) and (b,a) in both orders) will be automatically handled with the "all_pairs" setting in the reward config. So, if it's all_pairs, we'll have N * (N-1) judgments and N = 2, so we'll have 2 judgments. Let me know if this makes sense!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense now

@jacklanchantin
Copy link
Contributor

@swarnaHub Is this ready for review? You can change the status from draft if so?

@swarnaHub swarnaHub marked this pull request as ready for review August 27, 2025 18:55
@swarnaHub swarnaHub requested a review from cbalioglu as a code owner August 27, 2025 18:55
@swarnaHub swarnaHub changed the title skywork and some qwen changes RLLM Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants