Change online training verifier #1371

jacklanchantin · 2025-10-14T13:44:40Z

What does this PR do? Please describe:

Adds GrpoLossConfig adv_std_normarlization (for DrGRPO)
Adds GrpoLossConfig loss_token_mean for normalizing over all tokens
Adds new if statement to skip ref logprob computation for kl if beta == 0
Adds tis_imp_ratio_cap to use truncated importance sampling correction

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…airseq2 into jacklanchantin/drgrpo

…2 into jacklanchantin/drgrpo

drgrpo

fd7267f

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2025

Jack Lanchantin and others added 11 commits October 14, 2025 16:13

get vllm logps

cb6f7a9

Update _wandb.py

d6acc63

remove beta check

7b72df9

Merge branch 'jacklanchantin/drgrpo' of github.com:facebookresearch/f…

7fc3b2f

…airseq2 into jacklanchantin/drgrpo

format

502fa69

revert

79382d3

add importance sampling correction

97e8dca

dont run ref model forward if beta==0

54c9d98

add tis ratio clamp = 2

acb0840

clean up

50d21dd

configs

ccfa63b

jacklanchantin changed the title ~~drgrpo~~ Importance Sampling Correction, and DrGRPO args Oct 22, 2025

Jack Lanchantin added 5 commits October 22, 2025 20:59

clean up

bb49312

default

bd4b073

var name

6919a4c

var name

d910891

only use tis_imp_ratio_cap

b762625

jacklanchantin changed the title ~~Importance Sampling Correction, and DrGRPO args~~ Change online training verifier Oct 22, 2025

Jack Lanchantin added 10 commits October 23, 2025 03:57

batched inputs

5dff68a

use tis_drgrpo files

cce97ce

size

178fb69

match tis_grpo

536ce2b

fix batching/microbatching bugs

a036e92

black/isort

ca043a5

Merge branch 'online_training' of github.com:facebookresearch/fairseq…

55dc39a

…2 into jacklanchantin/drgrpo

revert qwen card

bdf6e4b

bypass reference_model if None

cdbec3c

add SelfAugmentingExtractor for llm judge

2645498

sa judge

7011bf9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change online training verifier #1371

Change online training verifier #1371

Uh oh!

jacklanchantin commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Change online training verifier #1371

Are you sure you want to change the base?

Change online training verifier #1371

Uh oh!

Conversation

jacklanchantin commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacklanchantin commented Oct 14, 2025 •

edited

Loading