added staleness metrics #1204

mnoukhov · 2025-11-18T00:02:49Z

measure the training steps between

adding prompt to queue
starting generation of prompt
using prompt for training

rename prompt_param_Q to just prompt_Q since it doesn't contain params anymore

Note

Adds staleness metrics (queue→generation→consume) wired through ActorManager, vLLM actor, batching, and logs; renames param_prompt_Q to prompt_Q.

Metrics/Observability:
- Track training step staleness via GenerationResult (queued_training_step, generation_started_training_step).
- ActorManager: store/expose current training step; main loop updates it each step.
- vLLM actor: capture current step when dequeuing, propagate via request metadata; process_completed_request sets fields.
- BatchStatistics: add staleness fields; compute staleness/queue_to_generation and staleness/generation_to_consume (+ means) and log in training metrics.
- Update tests to validate new fields.
Queue API:
- Rename param_prompt_Q → prompt_Q across code paths (add_prompt_to_generator, data_preparation_thread, run_training, create_model_and_optimizer, main).
Misc:
- Minor dashboard/manager plumbing for KV cache and timing unchanged; integrate step reporting with metrics.

^{Written by Cursor Bugbot for commit ac22910. This will update automatically on new commits. Configure here.}

measure the training steps between - adding prompt to queue - starting generation of prompt - using prompt for training rename prompt_param_Q to just prompt_Q since it doesn't contain params anymore

gemini-code-assist · 2025-11-18T00:03:09Z

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training pipeline by introducing a comprehensive system for measuring and reporting prompt staleness. By meticulously tracking prompts from their initial queuing, through generation, and finally to their consumption in training, the changes provide valuable insights into the efficiency and potential bottlenecks of the prompt processing flow. This improved visibility aims to optimize the overall training process. Additionally, a variable name for a key queue was updated for better clarity and consistency.

Highlights

Staleness Metrics Introduction: New metrics have been implemented to track the 'staleness' of prompts throughout their lifecycle within the training pipeline. This includes measuring the time (in training steps) a prompt spends in the queue before generation, and the time from generation start until it's consumed for training.
Prompt Lifecycle Tracking: The system now records the current training step at three critical points for each prompt: when it is added to the queue, when its generation begins, and when it is ultimately used for training. These timestamps enable the calculation of staleness metrics.
Queue Renaming: The param_prompt_Q queue has been renamed to prompt_Q across the codebase to accurately reflect its current role, as it no longer exclusively handles parameters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces staleness metrics to measure the time prompts spend in different stages of the pipeline, and also refactors the naming of prompt_param_Q to prompt_Q. The refactoring is well-executed. The logic for adding staleness metrics is mostly sound, but I've identified a critical bug in open_instruct/grpo_fast.py within the staleness calculation logic that would lead to a runtime error. I've provided a detailed comment and a code suggestion to fix this. Additionally, I've included a medium-severity comment for a minor code simplification in the same file. The other changes in actor_manager.py, queue_types.py, vllm_utils.py, and the test files look good and correctly support the new feature.

gemini-code-assist · 2025-11-18T00:04:38Z

open_instruct/grpo_fast.py

+            # calculate staleness metrics
+            queue_steps = batch_stats.pop("staleness_queue_steps", [])
+            generation_steps = batch_stats.pop("staleness_generation_start_steps", [])
+            consumed_steps = batch_stats.pop("staleness_consumed_step", [])
+
+            queue_to_generation = []
+            generation_to_consume = []
+            for queue_step, generation_step, consume_step in zip(queue_steps, generation_steps, consumed_steps):
+                queue_to_generation.append(generation_step - queue_step)
+                generation_to_consume.append(consume_step - generation_step)
+
+            staleness_metrics = {
+                "staleness/queue_to_generation": queue_to_generation,
+                "staleness/queue_to_generation_mean": np.mean(queue_to_generation),
+                "staleness/generation_to_consume": generation_to_consume,
+                "staleness/generation_to_consume_mean": np.mean(generation_to_consume),
+            }


There are a few issues in this block that will cause runtime errors and incorrect behavior:

AttributeError: You are calling .pop() on batch_stats, which is a dataclass instance and does not have a pop method. You should be using batch_metrics, which is the dictionary created from asdict(batch_stats).

TypeError: batch_metrics.pop("staleness_consumed_step", []) will return an integer. The subsequent zip call attempts to iterate over this integer (consumed_steps), which will raise a TypeError. The consumed step is the same for all items in the batch and should be treated as a scalar.

None values: The queue_steps and generation_steps lists can contain None values. The subtraction generation_step - queue_step will raise a TypeError if either value is None. These cases should be handled.

RuntimeWarning from np.mean: If queue_to_generation or generation_to_consume is an empty list (e.g., if all steps were None), np.mean will return NaN and raise a RuntimeWarning. It's better to handle this case and return 0.0.

I've provided a suggestion that fixes these issues by using batch_metrics, handling the scalar consumed_step correctly, checking for None before subtraction, and safely calculating the mean.

Suggested change

# calculate staleness metrics

queue_steps = batch_stats.pop("staleness_queue_steps", [])

generation_steps = batch_stats.pop("staleness_generation_start_steps", [])

consumed_steps = batch_stats.pop("staleness_consumed_step", [])

queue_to_generation = []

generation_to_consume = []

for queue_step, generation_step, consume_step in zip(queue_steps, generation_steps, consumed_steps):

queue_to_generation.append(generation_step - queue_step)

generation_to_consume.append(consume_step - generation_step)

staleness_metrics = {

"staleness/queue_to_generation": queue_to_generation,

"staleness/queue_to_generation_mean": np.mean(queue_to_generation),

"staleness/generation_to_consume": generation_to_consume,

"staleness/generation_to_consume_mean": np.mean(generation_to_consume),

}

# calculate staleness metrics

queue_steps = batch_metrics.pop("staleness_queue_steps", [])

generation_steps = batch_metrics.pop("staleness_generation_start_steps", [])

consumed_step = batch_metrics.pop("staleness_consumed_step", None)

queue_to_generation = []

generation_to_consume = []

if consumed_step is not None:

for q_step, g_step in zip(queue_steps, generation_steps):

if q_step is not None and g_step is not None:

queue_to_generation.append(g_step - q_step)

generation_to_consume.append(consumed_step - g_step)

staleness_metrics = {

"staleness/queue_to_generation": queue_to_generation,

"staleness/queue_to_generation_mean": np.mean(queue_to_generation) if queue_to_generation else 0.0,

"staleness/generation_to_consume": generation_to_consume,

"staleness/generation_to_consume_mean": np.mean(generation_to_consume) if generation_to_consume else 0.0,

}

gemini-code-assist · 2025-11-18T00:04:39Z

open_instruct/grpo_fast.py

+        staleness_queue_steps=staleness_queue_steps if staleness_queue_steps else [],
+        staleness_generation_start_steps=staleness_generation_start_steps if staleness_generation_start_steps else [],


The conditional assignments here are redundant. The variables staleness_queue_steps and staleness_generation_start_steps are initialized as empty lists. If they remain empty, the expression ... if [] else [] evaluates to []. If they are populated, the expression ... if [1, 2] else [] evaluates to [1, 2]. In both cases, the result is the same as the original variable. You can simplify this by directly assigning the variables.

Suggested change

staleness_queue_steps=staleness_queue_steps if staleness_queue_steps else [],

staleness_generation_start_steps=staleness_generation_start_steps if staleness_generation_start_steps else [],

staleness_queue_steps=staleness_queue_steps,

staleness_generation_start_steps=staleness_generation_start_steps,

cursor · 2025-11-18T00:05:13Z

open_instruct/grpo_fast.py

+            generation_to_consume = []
+            for queue_step, generation_step, consume_step in zip(queue_steps, generation_steps, consumed_steps):
+                queue_to_generation.append(generation_step - queue_step)
+                generation_to_consume.append(consume_step - generation_step)


Bug: Zip Error: Single Value Not Iterable

The staleness_consumed_step field in BatchStatistics is a single int | None value, but the code tries to zip() it with two lists (queue_steps and generation_steps). This causes a runtime error because zip() expects an iterable, not a single integer. The consumed step should be repeated for each prompt or the code should be restructured to handle the single value properly.

cursor · 2025-11-18T00:05:13Z

open_instruct/grpo_fast.py

+            generation_to_consume = []
+            for queue_step, generation_step, consume_step in zip(queue_steps, generation_steps, consumed_steps):
+                queue_to_generation.append(generation_step - queue_step)
+                generation_to_consume.append(consume_step - generation_step)


Bug: Staleness Calculation: Robustness Against Missing Data

The staleness calculation performs arithmetic on queue_step, generation_step, and consume_step values that can be None. When _get_current_training_step() fails or when metadata keys are missing, these values default to None, causing TypeError: unsupported operand type(s) for -: 'NoneType' and 'int' when attempting subtraction. The code needs to handle None values before performing arithmetic operations.

added staleness metrics

ac22910

measure the training steps between - adding prompt to queue - starting generation of prompt - using prompt for training rename prompt_param_Q to just prompt_Q since it doesn't contain params anymore

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

cursor bot reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added staleness metrics #1204

added staleness metrics #1204

mnoukhov commented Nov 18, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

cursor bot Nov 18, 2025

Uh oh!

cursor bot Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		staleness_queue_steps=staleness_queue_steps if staleness_queue_steps else [],
		staleness_generation_start_steps=staleness_generation_start_steps if staleness_generation_start_steps else [],

added staleness metrics #1204

Are you sure you want to change the base?

added staleness metrics #1204

Conversation

mnoukhov commented Nov 18, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Zip Error: Single Value Not Iterable

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Staleness Calculation: Robustness Against Missing Data

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mnoukhov commented Nov 18, 2025 •

edited by cursor bot

Loading