-
Notifications
You must be signed in to change notification settings - Fork 465
rlzero template and fixes to rlzero scripts #1216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the instruction-following capabilities of RLZero models by introducing refined chat templates. The changes aim to provide clearer, more structured prompts for tasks involving general problem-solving and code generation. By replacing an older, more complex template with two specialized ones and updating the corresponding training scripts, the PR seeks to improve how models interpret and respond to instructions, leading to more accurate and consistently formatted outputs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces new rlzero chat templates for standard and code-based problem-solving and updates several training scripts to use them. My review identified a recurring issue in three of the shell scripts where the stop_strings argument is inconsistent with the newly assigned chat templates. This mismatch will likely cause generations to run to their maximum length, which is inefficient and could impact training. I have provided suggestions to comment out the incorrect stop strings. The new chat templates themselves and the other script modifications appear correct.
| --pack_length 18432 \ | ||
| --model_name_or_path ${MODEL_NAME_OR_PATH} \ | ||
| --chat_template_name olmo_thinker \ | ||
| --chat_template_name olmo_thinker_rlzero \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Stop string mismatch with new templates
The script uses --stop_strings "</answer>" but the olmo_thinker_rlzero template instructs the model to output answers in the form Answer: $Answer, not using <answer> tags. The stop string will never match, causing generation to continue until hitting max length or EOS token instead of stopping at the intended point.
|
|
||
| cluster=ai2/augusta | ||
|
|
||
| python mason.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uv run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
scripts/train/olmo3/7b_rlzero_mix.sh
Outdated
| --budget ai2/oe-adapt \ | ||
| -- \ | ||
| source configs/beaker_configs/ray_node_setup.sh \&\& \ | ||
| python open_instruct/grpo_fast.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this on the previous line to make clear that it's not a new command? or indent?
same for line 33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved and added && to the front so its clear
…/add-rlzero-template
| #!/bin/bash | ||
|
|
||
| MODEL_NAME_OR_PATH="allenai/Olmo-3-1025-7B" | ||
| DATASETS="allenai/Dolci-RLZero-Code-7B 1.0 allenai/Dolci-RLZero-IF-7B 1.0 allenai/Dolci-RLZero-Code-7B 1.0 allenai/Dolci-RLZero-General-7B 1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Duplicate Code dataset in mix script
The DATASETS variable includes allenai/Dolci-RLZero-Code-7B twice (positions 1 and 3), likely instead of including the Math dataset once. The same duplication appears in LOCAL_EVALS. Given separate scripts exist for code, IF, and math training, the mix script should probably include all four dataset types: Code, IF, Math, and General, not Code twice.
Additional Locations (1)
| --response_length 16384 \ | ||
| --pack_length 18432 \ | ||
| --model_name_or_path ${MODEL_NAME_OR_PATH} \ | ||
| --chat_template_name olmo_thinker_rlzero \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Template mismatch for Code dataset
The script uses olmo_thinker_rlzero template while including the Code dataset. The Code dataset requires olmo_thinker_code_rlzero template which formats responses with code blocks. Using the wrong template will cause the model to generate responses in the wrong format for code problems, breaking the expected output structure with markdown code fences.
Note
Add RLZero chat templates and migrate OLMo3 7B training scripts to new datasets/templates with a new mixed runner and updated execution/eval settings.
olmo_thinker_rlzeroandolmo_thinker_code_rlzeroprompting formats focused on step-by-step solving with explicit final answer/solution sections.olmo_thinker_r1_styleentry and simplify message rendering for RLZero templates.allenai/Olmo-3-1025-7B, RLZero datasets (allenai/Dolci-RLZero-*), and new chat templates (olmo_thinker_rlzero,olmo_thinker_code_rlzero).uv run open_instruct/grpo_fast.pyafter Beaker setup; adjust evals, frequencies, lengths, and other run params.7b_rlzero_mix.shcombining code/IF/general datasets with judge/eval settings.Written by Cursor Bugbot for commit c22dd97. This will update automatically on new commits. Configure here.