Skip to content

feat(core): record per-episode cost in EpisodeReport (#253)#255

Merged
larstalian merged 1 commit into
mainfrom
feat/253-episode-cost
Jun 10, 2026
Merged

feat(core): record per-episode cost in EpisodeReport (#253)#255
larstalian merged 1 commit into
mainfrom
feat/253-episode-cost

Conversation

@larstalian

Copy link
Copy Markdown
Collaborator

Self-Review

  • PR title uses a conventional commit style summary
  • Scope is focused and matches the linked issue or change theme
  • I reviewed the diff for architectural drift and unintended public API changes
  • Tests and docs were updated where behavior or workflows changed

Closes #253

Summary

EpisodeReport carried the graded result but no notion of what the episode cost to run, so a "capability per unit cost" comparison had to be reconstructed with an external stopwatch.

  • Add EpisodeCost(wall_seconds, realize_seconds, turns), populated by the episode loop: perf_counter at realize start and grade end, realize_seconds for the setup (realize + reset) portion, turns counted from record_turn.
  • It rides on EpisodeReport as a defaulted field — zero-cost when unset, so the existing direct constructions (SWE, trading, TRL, training tests) are unaffected — and serializes in as_dict for eval runs and the dashboard.
  • Backing-agnostic: the same field is populated whether the world ran as PROCESS or any future backing, so cost is comparable across the fidelity range.

Token attribution is deliberately out of scope here — it needs per-backend usage reporting, a separate piece.

Testing

Verified through the real episode loop (scripted solvers, no LLM): turns matches the number of recorded turns (2 for a two-turn solver, 0 for a no-op); wall_seconds >= realize_seconds >= 0; as_dict()["cost"] carries the three keys. The five test files that construct EpisodeReport directly all still pass, confirming the defaulted field is backward-compatible.

Review Notes

  • EpisodeCost is exported alongside EpisodeReport at the package top level.
  • Wall-clock spans realize → grade (excludes teardown, which is cleanup, not episode work). Token usage is the natural follow-up once a backend reports it.
  • Pairs with the backing work (Manifest/RunConfig honor only seed — wire backing + scale selection #251, merged) so cost is measured at every rung of the fidelity range from the first training run.

EpisodeReport carried the graded result but no notion of what the episode
cost to run, so any "capability per unit cost" comparison had to be
reconstructed with an external stopwatch.

Add EpisodeCost (wall_seconds, realize_seconds, turns) populated by the
episode loop: perf_counter at realize start and grade end, realize_seconds
for the setup portion, turns counted from record_turn. It rides on
EpisodeReport as a defaulted field (zero-cost when unset, so existing
constructions are unaffected) and serializes in as_dict for eval runs and
the dashboard.

Backing-agnostic: the same field is populated whether the world ran as
PROCESS or any future backing, so cost is comparable across the fidelity
range. Token attribution is deferred — it needs per-backend usage
reporting, a separate piece.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@larstalian larstalian merged commit a1b48e8 into main Jun 10, 2026
2 checks passed
@larstalian larstalian deleted the feat/253-episode-cost branch June 10, 2026 02:23
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Record per-episode cost (wall-clock, tokens) in EpisodeReport

1 participant