Skip to content

Conversation

@jacklanchantin
Copy link
Contributor

@jacklanchantin jacklanchantin commented Oct 22, 2025

What does this PR do? Please describe:

  • Adds tis_imp_ratio_cap to use truncated importance sampling correction
  • Adds GrpoLossConfig adv_std_normarlization (for DrGRPO)
  • Adds new if statement to skip ref_logps computation for kl if beta == 0 (as done in DrGRPO)
  • Adds GrpoLossConfig loss_token_mean for normalizing over all tokens

Fixes #{issue number}

Most importantly, this adds truncated importance sampling correction, as recommended by @uralik.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2025
)
per_token_scaled_advantage = per_token_scaled_advantage * tis_imp_ratio

if ref_logps is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use kl if ref_logps were computed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this also means that beta is non-zero? does an assert make sense that it should never come here if beta is zero? or something that makes this if statement conditioned on beta for better readability?

otherwise LGTM!

@jacklanchantin jacklanchantin changed the title Jacklanchantin/tis drgrpo Add truncated importance sampling and DrGRPO args Oct 22, 2025
@jacklanchantin jacklanchantin marked this pull request as ready for review October 22, 2025 21:39
Copy link
Contributor

@uralik uralik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make sure this works for bs>1 before merging ! (as discussed offline)

@uralik uralik self-requested a review October 27, 2025 22:46
@uralik uralik merged commit 606459b into online_training Oct 27, 2025
6 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants