feat(core): record per-episode cost in EpisodeReport (#253) by larstalian · Pull Request #255 · vecna-labs/open-range

larstalian · 2026-06-09T23:26:26Z

Self-Review

PR title uses a conventional commit style summary
Scope is focused and matches the linked issue or change theme
I reviewed the diff for architectural drift and unintended public API changes
Tests and docs were updated where behavior or workflows changed

Closes #253

Summary

EpisodeReport carried the graded result but no notion of what the episode cost to run, so a "capability per unit cost" comparison had to be reconstructed with an external stopwatch.

Add EpisodeCost(wall_seconds, realize_seconds, turns), populated by the episode loop: perf_counter at realize start and grade end, realize_seconds for the setup (realize + reset) portion, turns counted from record_turn.
It rides on EpisodeReport as a defaulted field — zero-cost when unset, so the existing direct constructions (SWE, trading, TRL, training tests) are unaffected — and serializes in as_dict for eval runs and the dashboard.
Backing-agnostic: the same field is populated whether the world ran as PROCESS or any future backing, so cost is comparable across the fidelity range.

Token attribution is deliberately out of scope here — it needs per-backend usage reporting, a separate piece.

Testing

Verified through the real episode loop (scripted solvers, no LLM): turns matches the number of recorded turns (2 for a two-turn solver, 0 for a no-op); wall_seconds >= realize_seconds >= 0; as_dict()["cost"] carries the three keys. The five test files that construct EpisodeReport directly all still pass, confirming the defaulted field is backward-compatible.

Review Notes

EpisodeCost is exported alongside EpisodeReport at the package top level.
Wall-clock spans realize → grade (excludes teardown, which is cleanup, not episode work). Token usage is the natural follow-up once a backend reports it.
Pairs with the backing work (Manifest/RunConfig honor only seed — wire backing + scale selection #251, merged) so cost is measured at every rung of the fidelity range from the first training run.

EpisodeReport carried the graded result but no notion of what the episode cost to run, so any "capability per unit cost" comparison had to be reconstructed with an external stopwatch. Add EpisodeCost (wall_seconds, realize_seconds, turns) populated by the episode loop: perf_counter at realize start and grade end, realize_seconds for the setup portion, turns counted from record_turn. It rides on EpisodeReport as a defaulted field (zero-cost when unset, so existing constructions are unaffected) and serializes in as_dict for eval runs and the dashboard. Backing-agnostic: the same field is populated whether the world ran as PROCESS or any future backing, so cost is comparable across the fidelity range. Token attribution is deferred — it needs per-backend usage reporting, a separate piece. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

larstalian merged commit a1b48e8 into main Jun 10, 2026
2 checks passed

larstalian deleted the feat/253-episode-cost branch June 10, 2026 02:23

github-actions Bot locked and limited conversation to collaborators Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): record per-episode cost in EpisodeReport (#253)#255

feat(core): record per-episode cost in EpisodeReport (#253)#255
larstalian merged 1 commit into
mainfrom
feat/253-episode-cost

larstalian commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larstalian commented Jun 9, 2026

Self-Review

Summary

Testing

Review Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant