rlzero template and fixes to rlzero scripts #1216

mnoukhov · 2025-11-20T07:11:36Z

Note

Add RLZero chat templates and migrate OLMo3 7B training scripts to new datasets/templates with a new mixed runner and updated execution/eval settings.

Chat templates (dataset_transformation.py):
- Add olmo_thinker_rlzero and olmo_thinker_code_rlzero prompting formats focused on step-by-step solving with explicit final answer/solution sections.
- Remove legacy olmo_thinker_r1_style entry and simplify message rendering for RLZero templates.
Training scripts (scripts/train/olmo3/...):
- Update 7B runs to use allenai/Olmo-3-1025-7B, RLZero datasets (allenai/Dolci-RLZero-*), and new chat templates (olmo_thinker_rlzero, olmo_thinker_code_rlzero).
- Switch launch flow to uv run open_instruct/grpo_fast.py after Beaker setup; adjust evals, frequencies, lengths, and other run params.
- Add mixed-task runner 7b_rlzero_mix.sh combining code/IF/general datasets with judge/eval settings.
- Tweak math run config (nodes, lengths, evals) and remove unused gs model name.

^{Written by Cursor Bugbot for commit c22dd97. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2025-11-20T07:11:54Z

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the instruction-following capabilities of RLZero models by introducing refined chat templates. The changes aim to provide clearer, more structured prompts for tasks involving general problem-solving and code generation. By replacing an older, more complex template with two specialized ones and updating the corresponding training scripts, the PR seeks to improve how models interpret and respond to instructions, leading to more accurate and consistently formatted outputs.

Highlights

New Chat Templates: Introduced two new chat templates, olmo_thinker_rlzero and olmo_thinker_code_rlzero, designed to provide clearer instructions for step-by-step problem-solving and structured code generation, respectively.
Template Refinement: Replaced the verbose olmo_thinker_r1_style template with more concise and task-specific templates, streamlining the prompt structure for better model adherence.
Script Updates: Updated several RLZero training scripts (7b_rlzero_code.sh, 7b_rlzero_instruction_following.sh, 7b_rlzero_math.sh, and the renamed 7b_rlzero_mix.sh) to utilize the newly defined chat templates.
File Renaming: The 7b_rlzero_general.sh script was renamed to 7b_rlzero_mix.sh to better reflect its purpose.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new rlzero chat templates for standard and code-based problem-solving and updates several training scripts to use them. My review identified a recurring issue in three of the shell scripts where the stop_strings argument is inconsistent with the newly assigned chat templates. This mismatch will likely cause generations to run to their maximum length, which is inefficient and could impact training. I have provided suggestions to comment out the incorrect stop strings. The new chat templates themselves and the other script modifications appear correct.

scripts/train/olmo3/7b_rlzero_code.sh

scripts/train/olmo3/7b_rlzero_instruction_following.sh