RLLM #1232

swarnaHub · 2025-07-22T00:14:08Z

What does this PR do? Please describe:
This is a PR to enable usage of generative RMs in GRPO training. Below is a summary of the main changes:

This is how a typical reward config would look like:

reward:
name: "generative_pairwise_verifier"
config:
prompt_key: prompt_raw
tokenizer: /datasets/pretrained-llms/Qwen3-8B/
judgment_extractor: "j1_pairwise_score_extractor"
pair_type: "all_pairs"

If you want to use any pairwise RM, set name as "generative_pairwise_verifier" or for pointwise RMs, use "generative_pointwise_verifier". Currently, there is no option for doing k-wise judgments but I am adding support for it now.

The field "judgment_extractor" refers to a class that implements how we (1) prompt the RM, (2) extract scores/judgments and (3) aggregate multiple scores (if doing SC or something) from the judgment CoTs.

Whether you want to use GenRMs in a reference-free or a reference-based manner (i.e., for math, you might have access to reference answers), this need not be explicitly stated in the config. If the input file has a field for reference answer, it'll be used. All extractors now take reference answer as an argument (and can be empty).

Finally, "pair_type" refers to the particular setting a pairwise RM will be used in. These have three options now: (1) "pivot" (all rollouts are judged against a reference rollout), (2) "random_pairs" (we randomly sample N pairwise comparisons), and (3) "all_pairs" (all N*(N-1) pairs are constructed for judgment). See details in the GenerativePairwiseVerifier class in _rewards.py.

Unrelated change: This PR also added support for Skywork v2 RM.

…q2 into swarna/skyworkv2

jacklanchantin · 2025-08-27T17:39:30Z

src/fairseq2/recipes/lm/_online_finetune/_online_dpo.py

            ) in self._config.loss_config.validation_vllm_sampling_params.items():
                policy_sampling_params.__setattr__(k, v)
+
+            # For a pairwise RM, need to sample at least two judgments


this is 2 rollouts per prompt, but i assume there are 2 copies of the prompt (2 different orders)? so shouldn't you only need 1 rollout for each order?

sorry, this is a typo! I meant 2 "rollouts" instead of "judgments".

How many judgment prompts are created out of those two rollouts (i.e., 2 with (a,b) and (b,a) in both orders) will be automatically handled with the "all_pairs" setting in the reward config. So, if it's all_pairs, we'll have N * (N-1) judgments and N = 2, so we'll have 2 judgments. Let me know if this makes sense!

yep, makes sense now

jacklanchantin · 2025-08-27T18:54:45Z

@swarnaHub Is this ready for review? You can change the status from draft if so?

skywork and some qwrn changes

6095032

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 22, 2025

swarna and others added 2 commits July 22, 2025 20:59

Removing think tokens

f3d876a

Merge branch 'ot_merge' into swarna/skyworkv2

d7e9fd0

swarnaHub changed the title ~~skywork and some qwrn changes~~ skywork and some qwen changes Jul 22, 2025

swarnaHub changed the base branch from ot_merge to online_training July 22, 2025 21:21

swarna and others added 22 commits July 23, 2025 20:06

Fixing GRMs

164458b

Merge branch 'swarna/skyworkv2' of github.com:facebookresearch/fairse…

b162c05

…q2 into swarna/skyworkv2

Black

4fc9aea

Merge branch online_training

a6ad0a8

Import issue

f829d82

add missing sw import

9392f0d

Different configs for pairwise GRM

55dc622

Minor fix and more logging

ee17161

Online dpo: pairwise GRM should sample at least two rollouts

18ff4c4

zero reward for rollouts not involved in pairwise judgments

585b744

simplifying

510bdf2

SequenceBatch seq_lens type ensure to be a list

38aaf53

add pairwsie J1 with reference answer

a6ab8b0

fix None ref answer

2004533

Pairwise with pivot changes

bfc255b

New pivot changes + cleanup

1f942d7

Fix

5eee4ee

Making pair type configurable

5cdb6b9

Config change

e14421d

update prompt

8831f36

some more logging

9fc9dbb

Merge branch 'swarna/skyworkv2' of github.com:facebookresearch/fairse…

d14fb90

…q2 into swarna/skyworkv2

swarnaHub requested review from jacklanchantin and uralik August 27, 2025 17:19

swarnaHub requested a review from chenxwh August 27, 2025 17:20

jacklanchantin reviewed Aug 27, 2025

View reviewed changes

Fixing typo in comment

b1ba0e2

swarnaHub marked this pull request as ready for review August 27, 2025 18:55

swarnaHub requested a review from cbalioglu as a code owner August 27, 2025 18:55

swarna and others added 4 commits September 2, 2025 20:37

kwise judgment support

d47ef15

Adding support for acemath

d474070

Skywork-RM from hf

1162d60

add parsed ref

4ea811d

chenxwh force-pushed the swarna/skyworkv2 branch from 84e7bd7 to 4ea811d Compare September 6, 2025 17:40

update prompt template

6703f1b

chenxwh force-pushed the swarna/skyworkv2 branch from fce332b to 6703f1b Compare September 7, 2025 20:39

chenxwh user and others added 8 commits September 10, 2025 21:15

all comparisons in k-wise

fe84d9e

Jacklanchantin/qwen (#1260)

e7137ac

octothinker assets

dfb958a

Changes

a746129

Merging

1116f02

Minor changes

9355ce6

Logging judge input

474a537

Tracking a second reward (for debugging)

8211664

swarnaHub changed the title ~~skywork and some qwen changes~~ RLLM Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RLLM #1232

RLLM #1232

Uh oh!

swarnaHub commented Jul 22, 2025 •

edited

Loading

Uh oh!

jacklanchantin Aug 27, 2025

Uh oh!

swarnaHub Aug 27, 2025

Uh oh!

jacklanchantin Aug 27, 2025

Uh oh!

jacklanchantin commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RLLM #1232

Are you sure you want to change the base?

RLLM #1232

Uh oh!

Conversation

swarnaHub commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacklanchantin Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

swarnaHub Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jacklanchantin Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jacklanchantin commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

swarnaHub commented Jul 22, 2025 •

edited

Loading