Add cross-model research supervisor (bin/review-cycle) by philoengineer · Pull Request #63 · cybertronai/SutroYaro

philoengineer · 2026-03-28T05:56:10Z

Summary

Adds bin/review-cycle, a standalone script that lets one AI model supervise experiments done by another through a multi-turn dialogue
Supervisor and researcher alternate turns via a shared review file — works with any CLI tool (Claude, Codex, Gemini, OpenCode)
Follows bin/run-agent patterns exactly: same dispatch, same tool flags, same resilience

How it works

# Codex reviews Claude's last 5 experiments
bin/review-cycle --tool codex --researcher-tool claude --last 5

Turn 1 — Supervisor reviews: reads log.jsonl + findings docs, checks classifications against integrity rules, assigns per-experiment verdicts (CONFIRMED / RECLASSIFY / REVISION_NEEDED / FLAG)

Turn 2 — Researcher responds: reads the review, fixes acknowledged errors in the actual files (findings docs, log entries), or disputes with evidence

Turn 3 — Supervisor verdict: evaluates responses, writes final status (APPROVED / CORRECTED / DISPUTED / REJECTED)

The dialogue accumulates in research/reviews/{cycle-id}.md — both the conversation medium and the permanent audit trail.

What the supervisor checks

Classification honesty: does delta_pct direction match the class?
Findings completeness: all 6 required sections present?
Reproducibility: "Can it be reproduced?" has executable commands?
Sample size: claims qualified by data point count?
Prior work: DISCOVERIES.md checked for duplicates?
Locked files: harness.py and measurement code unmodified?

Options

Flag	Default	Description
`--tool`	claude	Supervisor CLI
`--researcher-tool`	claude	Researcher CLI for response turns
`--last N`	5	Review last N experiments
`--turns N`	3	Max dialogue turns
`--dry-run`	—	Print prompts without launching agents

Test plan

bin/review-cycle --help shows usage
bin/review-cycle --dry-run --last 3 prints 3 prompts (supervisor, researcher, verdict)
bin/review-cycle --tool claude --last 3 produces research/reviews/review-*.md with dialogue
bin/review-cycle --tool codex --researcher-tool claude — cross-model dialogue works
Review file has YAML front-matter with status: complete after run
Researcher response turn can modify findings docs (fix issues supervisor flagged)

🤖 Generated with Claude Code

Codex users now get the same auto-loaded project context, agent instructions, and sync/writing-rules that Claude Code users get from CLAUDE.md and .claude/skills/. New files: - CODEX.md: project context (mirrors CLAUDE.md), auto-loaded by Codex - .codex/config.toml: sandbox, approval, model, OAuth auth settings - .codex/AGENTS.md: agent instructions with context loading, sync routine, and anti-slop writing rules inlined Updated files: - AGENTS.md: references both CLAUDE.md and CODEX.md in key files table - docs/tooling/agent-cli-guide.md: expanded Codex section with OAuth setup (ChatGPT subscription login vs API key), project config table showing Claude/Codex equivalents, and updated comparison table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

A supervisor agent reviews recent experiments, the researcher responds, and the supervisor writes a final verdict. The dialogue is file-mediated (research/reviews/{cycle-id}.md) so any CLI tool can participate on either side. Key design choices: - Multi-turn dialogue (default 3: review, response, verdict) with configurable --turns flag - Cross-model: --tool codex --researcher-tool claude lets different models play different roles - File-mediated turns: each turn is a separate CLI invocation, the review file accumulates the full conversation - Follows bin/run-agent patterns: same dispatch, same tool support The supervisor checks: - Classification honesty (delta_pct direction matches class) - Findings completeness (6 required sections) - Reproducibility (executable commands present) - Sample size adequacy - DISCOVERIES.md cross-reference - Locked file integrity New files: - bin/review-cycle: the main script (~300 lines bash) - research/reviews/.gitkeep: directory for review dialogue files Updated docs: - AGENTS.md, CLAUDE.md, CODEX.md: added review-cycle to tables - agent-cli-guide.md: added Cross-Model Supervision section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0bserver07 · 2026-04-25T17:46:20Z

Hi @philoengineer @zh4ngx — sorry for the long silence on this one. Catching up with the state of things and need help deciding the path forward.

The overlap question

This PR (bin/review-cycle, multi-turn supervisor across Claude / Codex / Gemini / OpenCode) covers very similar ground to what @zh4ngx described in the group chat on Apr 22:

"main agent can dispatch to other agents based on what it thinks is the best tool for the job (or if you are milking quotas) ... works across CLI + LLM providers ... I can hop into any chat under any project to take over / watch (only tail -f due to ownership)"

So we've got two parallel-but-similar efforts and one PR sitting with conflicts.

Three reasonable paths forward

Resurrect this PR — @philoengineer rebases against current main, @zh4ngx reviews against his own design and we merge if compatible. Best if the two designs complement (this one's review-cycle pattern is narrower / more focused than Andy's general dispatch).
Adopt the design, rewrite under Andy's architecture — @zh4ngx pulls the bin/review-cycle pattern into his "main loop" agent as one of its built-in workflows. Closes this PR with credit.
Close this PR as superseded — Andy's broader work covers the same need. @philoengineer's design influences the integration.

I lean (1) or (2) over (3) because the review-cycle pattern (researcher-supervisor multi-turn dialogue) is a clean, narrowly-scoped primitive that's easy to test in isolation — exactly the kind of thing that's worth shipping standalone before it gets absorbed into a bigger framework.

@philoengineer — are you able to rebase? @zh4ngx — does this fit your "main loop" architecture or are they actually orthogonal?

Whichever path we pick, thanks for the patience on this one.

agent-0bserver07 (Claude Code) on behalf of Yad

zh4ngx · 2026-04-30T02:16:44Z

Thanks for the clear framing here. I agree with Path 1: review-cycle is worth resurrecting as a standalone primitive rather than absorbing into session-query or the broader main-loop lifecycle.

The technical shape is good: supervisor review -> researcher response -> supervisor verdict gives the workflow a bounded audit loop instead of an open-ended agent conversation. I especially like that research/reviews/{cycle-id}.md is both transport and durable record; that keeps the tool CLI-agnostic and inspectable after agents exit.

A few implementation questions I would want answered during review/rebase:

How does review-cycle represent model disagreement? If the researcher disputes a RECLASSIFY with evidence and the supervisor still disagrees, does the final state become DISPUTED, REJECTED, or something more granular?
What happens when the researcher refuses, stalls, or edits only part of the requested fix? Does the supervisor verify filesystem/log changes directly, or only judge the written response?
Are the integrity checks deterministic enough that we can compare supervisor models, or are they mostly prompt-level instructions today?
Does --turns N generalize beyond the default 3-turn cycle cleanly, or is the third turn semantically special as the final verdict?

This maps directly onto our multi-model eval rotation: Kimi K2.6, DSv4 Pro, GPT-5.5, and GLM-5.1 are being compared as research/main-loop candidates, and review-cycle would let us run structured cross-review instead of ad hoc transcript reading.

I also think this composes naturally with the tooling Andy mentioned: zellij-mcp provides the pane/control fabric, metastack handles parallel dispatch/DAG orchestration, and review-cycle can remain the focused researcher-supervisor audit loop that those systems call into.

So my vote is: rebase, keep it standalone, and wire it into metastack where parallelism is needed.

— drafted by Kimi K2.6 (OpenCode Go) on behalf of Andy; human-reviewed before posting

philoengineer and others added 2 commits March 27, 2026 22:25

0bserver07 mentioned this pull request Apr 2, 2026

Housekeeping: branch cleanup and stale PR triage (post v0.26.0) #69

Closed

0bserver07 mentioned this pull request Apr 30, 2026

Housekeeping: PR triage, doc staleness, and alignment with current research direction #95

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cross-model research supervisor (bin/review-cycle)#63

Add cross-model research supervisor (bin/review-cycle)#63
philoengineer wants to merge 2 commits intocybertronai:mainfrom
philoengineer:feat/review-cycle

philoengineer commented Mar 28, 2026

Uh oh!

0bserver07 commented Apr 25, 2026

Uh oh!

zh4ngx commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

philoengineer commented Mar 28, 2026

Summary

How it works

What the supervisor checks

Options

Test plan

Uh oh!

0bserver07 commented Apr 25, 2026

Uh oh!

zh4ngx commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants