Skip to content

Conversation

@mnoukhov
Copy link
Contributor

@mnoukhov mnoukhov commented Nov 20, 2025

Note

Add RLZero chat templates and migrate OLMo3 7B training scripts to new datasets/templates with a new mixed runner and updated execution/eval settings.

  • Chat templates (dataset_transformation.py):
    • Add olmo_thinker_rlzero and olmo_thinker_code_rlzero prompting formats focused on step-by-step solving with explicit final answer/solution sections.
    • Remove legacy olmo_thinker_r1_style entry and simplify message rendering for RLZero templates.
  • Training scripts (scripts/train/olmo3/...):
    • Update 7B runs to use allenai/Olmo-3-1025-7B, RLZero datasets (allenai/Dolci-RLZero-*), and new chat templates (olmo_thinker_rlzero, olmo_thinker_code_rlzero).
    • Switch launch flow to uv run open_instruct/grpo_fast.py after Beaker setup; adjust evals, frequencies, lengths, and other run params.
    • Add mixed-task runner 7b_rlzero_mix.sh combining code/IF/general datasets with judge/eval settings.
    • Tweak math run config (nodes, lengths, evals) and remove unused gs model name.

Written by Cursor Bugbot for commit c22dd97. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the instruction-following capabilities of RLZero models by introducing refined chat templates. The changes aim to provide clearer, more structured prompts for tasks involving general problem-solving and code generation. By replacing an older, more complex template with two specialized ones and updating the corresponding training scripts, the PR seeks to improve how models interpret and respond to instructions, leading to more accurate and consistently formatted outputs.

Highlights

  • New Chat Templates: Introduced two new chat templates, olmo_thinker_rlzero and olmo_thinker_code_rlzero, designed to provide clearer instructions for step-by-step problem-solving and structured code generation, respectively.
  • Template Refinement: Replaced the verbose olmo_thinker_r1_style template with more concise and task-specific templates, streamlining the prompt structure for better model adherence.
  • Script Updates: Updated several RLZero training scripts (7b_rlzero_code.sh, 7b_rlzero_instruction_following.sh, 7b_rlzero_math.sh, and the renamed 7b_rlzero_mix.sh) to utilize the newly defined chat templates.
  • File Renaming: The 7b_rlzero_general.sh script was renamed to 7b_rlzero_mix.sh to better reflect its purpose.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new rlzero chat templates for standard and code-based problem-solving and updates several training scripts to use them. My review identified a recurring issue in three of the shell scripts where the stop_strings argument is inconsistent with the newly assigned chat templates. This mismatch will likely cause generations to run to their maximum length, which is inefficient and could impact training. I have provided suggestions to comment out the incorrect stop strings. The new chat templates themselves and the other script modifications appear correct.

--pack_length 18432 \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--chat_template_name olmo_thinker \
--chat_template_name olmo_thinker_rlzero \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Stop string mismatch with new templates

The script uses --stop_strings "</answer>" but the olmo_thinker_rlzero template instructs the model to output answers in the form Answer: $Answer, not using <answer> tags. The stop string will never match, causing generation to continue until hitting max length or EOS token instead of stopping at the intended point.

Fix in Cursor Fix in Web

@mnoukhov mnoukhov changed the base branch from add-olmo3-scripts to main November 20, 2025 21:04

cluster=ai2/augusta

python mason.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

--budget ai2/oe-adapt \
-- \
source configs/beaker_configs/ray_node_setup.sh \&\& \
python open_instruct/grpo_fast.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this on the previous line to make clear that it's not a new command? or indent?

same for line 33

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved and added && to the front so its clear

@mnoukhov mnoukhov enabled auto-merge November 22, 2025 20:15
#!/bin/bash

MODEL_NAME_OR_PATH="allenai/Olmo-3-1025-7B"
DATASETS="allenai/Dolci-RLZero-Code-7B 1.0 allenai/Dolci-RLZero-IF-7B 1.0 allenai/Dolci-RLZero-Code-7B 1.0 allenai/Dolci-RLZero-General-7B 1.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Duplicate Code dataset in mix script

The DATASETS variable includes allenai/Dolci-RLZero-Code-7B twice (positions 1 and 3), likely instead of including the Math dataset once. The same duplication appears in LOCAL_EVALS. Given separate scripts exist for code, IF, and math training, the mix script should probably include all four dataset types: Code, IF, Math, and General, not Code twice.

Additional Locations (1)

Fix in Cursor Fix in Web

--response_length 16384 \
--pack_length 18432 \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--chat_template_name olmo_thinker_rlzero \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Template mismatch for Code dataset

The script uses olmo_thinker_rlzero template while including the Code dataset. The Code dataset requires olmo_thinker_code_rlzero template which formats responses with code blocks. Using the wrong template will cause the model to generate responses in the wrong format for code problems, breaking the expected output structure with markdown code fences.

Fix in Cursor Fix in Web

@mnoukhov mnoukhov added this pull request to the merge queue Nov 22, 2025
Merged via the queue into main with commit 70b0472 Nov 22, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants