Add truncated importance sampling and DrGRPO args #1394

jacklanchantin · 2025-10-22T21:19:20Z

What does this PR do? Please describe:

Adds tis_imp_ratio_cap to use truncated importance sampling correction
Adds GrpoLossConfig adv_std_normarlization (for DrGRPO)
Adds new if statement to skip ref_logps computation for kl if beta == 0 (as done in DrGRPO)
Adds GrpoLossConfig loss_token_mean for normalizing over all tokens

Fixes #{issue number}

Most importantly, this adds truncated importance sampling correction, as recommended by @uralik.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…airseq2 into jacklanchantin/drgrpo

jacklanchantin · 2025-10-22T21:29:11Z

src/fairseq2/recipes/lm/_online_finetune/_grpo.py

+            )
+            per_token_scaled_advantage = per_token_scaled_advantage * tis_imp_ratio
+
+        if ref_logps is not None:


only use kl if ref_logps were computed

nitpick: this also means that beta is non-zero? does an assert make sense that it should never come here if beta is zero? or something that makes this if statement conditioned on beta for better readability?

otherwise LGTM!

src/fairseq2/recipes/lm/_online_finetune/_grpo.py

uralik

lets make sure this works for bs>1 before merging ! (as discussed offline)

src/fairseq2/recipes/lm/_online_finetune/_grpo.py

Jack Lanchantin and others added 17 commits October 14, 2025 13:36

drgrpo

fd7267f

get vllm logps

cb6f7a9

Update _wandb.py

d6acc63

remove beta check

7b72df9

Merge branch 'jacklanchantin/drgrpo' of github.com:facebookresearch/f…

7fc3b2f

…airseq2 into jacklanchantin/drgrpo

format

502fa69

revert

79382d3

add importance sampling correction

97e8dca

dont run ref model forward if beta==0

54c9d98

add tis ratio clamp = 2

acb0840

clean up

50d21dd

configs

ccfa63b

clean up

bb49312

default

bd4b073

var name

6919a4c

var name

d910891

only use tis_imp_ratio_cap

b762625

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2025

Jack Lanchantin added 2 commits October 22, 2025 21:20

revert unrelated files

2f07a4f

clean up

a148e56

jacklanchantin commented Oct 22, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_grpo.py Show resolved Hide resolved

jacklanchantin changed the title ~~Jacklanchantin/tis drgrpo~~ Add truncated importance sampling and DrGRPO args Oct 22, 2025

Jack Lanchantin added 2 commits October 22, 2025 21:34

fix type hint

d1eb63f

black/isort

0c50412

jacklanchantin marked this pull request as ready for review October 22, 2025 21:39

jacklanchantin requested a review from cbalioglu as a code owner October 22, 2025 21:39

jacklanchantin requested review from swarnaHub and uralik October 22, 2025 21:39

jacklanchantin commented Oct 22, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_grpo.py Outdated Show resolved Hide resolved

swarnaHub approved these changes Oct 22, 2025

View reviewed changes

uralik requested changes Oct 22, 2025

View reviewed changes

jacklanchantin added 2 commits October 22, 2025 23:54

Allow batched inputs for get_vllm_logprobs

7c3f827

allow batch_sz > 1

b36dffe

jacklanchantin commented Oct 23, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_grpo.py Outdated Show resolved Hide resolved

jacklanchantin and others added 2 commits October 23, 2025 11:37

Modify condition for reference log probabilities

de47d54

fix batch>1, microbatching

9313b0e

uralik self-requested a review October 27, 2025 22:46

uralik approved these changes Oct 27, 2025

View reviewed changes

uralik merged commit 606459b into online_training Oct 27, 2025
6 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add truncated importance sampling and DrGRPO args #1394

Add truncated importance sampling and DrGRPO args #1394

Uh oh!

jacklanchantin commented Oct 22, 2025 •

edited

Loading

Uh oh!

jacklanchantin Oct 22, 2025

Uh oh!

swarnaHub Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

uralik left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add truncated importance sampling and DrGRPO args #1394

Add truncated importance sampling and DrGRPO args #1394

Uh oh!

Conversation

jacklanchantin commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacklanchantin Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

swarnaHub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

uralik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jacklanchantin commented Oct 22, 2025 •

edited

Loading