Skip to content

Conversation

@Birdylx
Copy link
Contributor

@Birdylx Birdylx commented Dec 8, 2025

PR type

Bug Fix

PR information

fix phi loss bug, the coeff should be stoped gradient as paper describe

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Birdylx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the phi loss computation within the Chord reinforcement learning from human feedback (RLHF) trainer. The fix ensures that a specific coefficient, chord_sft_loss, is correctly detached from the gradient computation graph when used to calculate per_token_probs for the phi function. This adjustment is crucial for maintaining the integrity of the gradient flow and ensuring the training process adheres to the original paper's specifications, leading to more accurate and stable model optimization.

Highlights

  • Bug Fix in Phi Loss Calculation: Addressed a bug in the phi loss calculation within the Chord RLHF trainer, specifically ensuring that a coefficient's gradient is properly stopped as described in the relevant paper.
  • Gradient Detachment: Implemented chord_sft_loss.detach() when computing per_token_probs for the phi function, preventing unintended gradient flow through this term and aligning the implementation with the theoretical model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a critical bug in the CHORD-ϕ loss calculation. By detaching chord_sft_loss before computing the phi weighting factor, it ensures that gradients are calculated correctly, treating phi as a constant weight as described in the CHORD paper. This change is essential for the correct implementation of the algorithm.

chord_sft_loss = per_token_loss_func(outputs, labels)

if trainer.args.chord_enable_phi_function:
per_token_probs = torch.exp(-chord_sft_loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The phi weighting factor should be treated as a constant during backpropagation, as its purpose is to scale the SFT loss based on the model's confidence. By not detaching chord_sft_loss here, incorrect gradients are propagated through phi, which can lead to training instability or failure to converge. The gradient of the final SFT loss should only come from the original loss term, scaled by phi.

Suggested change
per_token_probs = torch.exp(-chord_sft_loss)
per_token_probs = torch.exp(-chord_sft_loss.detach())

@hjh0119
Copy link
Collaborator

hjh0119 commented Dec 8, 2025

Thank you for your contribution

Could you share a reference for why the phi function should stop gradients?

@Birdylx
Copy link
Contributor Author

Birdylx commented Dec 8, 2025

@hjh0119 here is the original implementation https://github.com/modelscope/Trinity-RFT/blob/613194d45fee0eef9145fb73dbda69cab17fd6f4/trinity/algorithm/policy_loss_fn/chord_policy_loss.py#L97
phi is the weight based on logprob, we do not want gradient for this, it is not learnable.

@hjh0119 hjh0119 merged commit d77d409 into modelscope:main Dec 8, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants