- 
                Notifications
    You must be signed in to change notification settings 
- Fork 124
RLLM #1232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: online_training
Are you sure you want to change the base?
RLLM #1232
Conversation
…q2 into swarna/skyworkv2
…q2 into swarna/skyworkv2
| ) in self._config.loss_config.validation_vllm_sampling_params.items(): | ||
| policy_sampling_params.__setattr__(k, v) | ||
|  | ||
| # For a pairwise RM, need to sample at least two judgments | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is 2 rollouts per prompt, but i assume there are 2 copies of the prompt (2 different orders)? so shouldn't you only need 1 rollout for each order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this is a typo! I meant 2 "rollouts" instead of "judgments".
How many judgment prompts are created out of those two rollouts (i.e., 2 with (a,b) and (b,a) in both orders) will be automatically handled with the "all_pairs" setting in the reward config. So, if it's all_pairs, we'll have N * (N-1) judgments and N = 2, so we'll have 2 judgments. Let me know if this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense now
| @swarnaHub Is this ready for review? You can change the status from draft if so? | 
84e7bd7    to
    4ea811d      
    Compare
  
    fce332b    to
    6703f1b      
    Compare
  
    
What does this PR do? Please describe:
This is a PR to enable usage of generative RMs in GRPO training. Below is a summary of the main changes:
This is how a typical reward config would look like:
reward:
name: "generative_pairwise_verifier"
config:
prompt_key: prompt_raw
tokenizer: /datasets/pretrained-llms/Qwen3-8B/
judgment_extractor: "j1_pairwise_score_extractor"
pair_type: "all_pairs"
If you want to use any pairwise RM, set name as "generative_pairwise_verifier" or for pointwise RMs, use "generative_pointwise_verifier". Currently, there is no option for doing k-wise judgments but I am adding support for it now.
The field "judgment_extractor" refers to a class that implements how we (1) prompt the RM, (2) extract scores/judgments and (3) aggregate multiple scores (if doing SC or something) from the judgment CoTs.
Whether you want to use GenRMs in a reference-free or a reference-based manner (i.e., for math, you might have access to reference answers), this need not be explicitly stated in the config. If the input file has a field for reference answer, it'll be used. All extractors now take reference answer as an argument (and can be empty).
Finally, "pair_type" refers to the particular setting a pairwise RM will be used in. These have three options now: (1) "pivot" (all rollouts are judged against a reference rollout), (2) "random_pairs" (we randomly sample N pairwise comparisons), and (3) "all_pairs" (all N*(N-1) pairs are constructed for judgment). See details in the GenerativePairwiseVerifier class in _rewards.py.
Unrelated change: This PR also added support for Skywork v2 RM.