Skip to content

Add cross-model research supervisor (bin/review-cycle)#63

Open
philoengineer wants to merge 2 commits intocybertronai:mainfrom
philoengineer:feat/review-cycle
Open

Add cross-model research supervisor (bin/review-cycle)#63
philoengineer wants to merge 2 commits intocybertronai:mainfrom
philoengineer:feat/review-cycle

Conversation

@philoengineer
Copy link
Copy Markdown
Contributor

Summary

  • Adds bin/review-cycle, a standalone script that lets one AI model supervise experiments done by another through a multi-turn dialogue
  • Supervisor and researcher alternate turns via a shared review file — works with any CLI tool (Claude, Codex, Gemini, OpenCode)
  • Follows bin/run-agent patterns exactly: same dispatch, same tool flags, same resilience

How it works

# Codex reviews Claude's last 5 experiments
bin/review-cycle --tool codex --researcher-tool claude --last 5

Turn 1 — Supervisor reviews: reads log.jsonl + findings docs, checks classifications against integrity rules, assigns per-experiment verdicts (CONFIRMED / RECLASSIFY / REVISION_NEEDED / FLAG)

Turn 2 — Researcher responds: reads the review, fixes acknowledged errors in the actual files (findings docs, log entries), or disputes with evidence

Turn 3 — Supervisor verdict: evaluates responses, writes final status (APPROVED / CORRECTED / DISPUTED / REJECTED)

The dialogue accumulates in research/reviews/{cycle-id}.md — both the conversation medium and the permanent audit trail.

What the supervisor checks

  • Classification honesty: does delta_pct direction match the class?
  • Findings completeness: all 6 required sections present?
  • Reproducibility: "Can it be reproduced?" has executable commands?
  • Sample size: claims qualified by data point count?
  • Prior work: DISCOVERIES.md checked for duplicates?
  • Locked files: harness.py and measurement code unmodified?

Options

Flag Default Description
--tool claude Supervisor CLI
--researcher-tool claude Researcher CLI for response turns
--last N 5 Review last N experiments
--turns N 3 Max dialogue turns
--dry-run Print prompts without launching agents

Test plan

  • bin/review-cycle --help shows usage
  • bin/review-cycle --dry-run --last 3 prints 3 prompts (supervisor, researcher, verdict)
  • bin/review-cycle --tool claude --last 3 produces research/reviews/review-*.md with dialogue
  • bin/review-cycle --tool codex --researcher-tool claude — cross-model dialogue works
  • Review file has YAML front-matter with status: complete after run
  • Researcher response turn can modify findings docs (fix issues supervisor flagged)

🤖 Generated with Claude Code

philoengineer and others added 2 commits March 27, 2026 22:25
Codex users now get the same auto-loaded project context, agent
instructions, and sync/writing-rules that Claude Code users get
from CLAUDE.md and .claude/skills/.

New files:
- CODEX.md: project context (mirrors CLAUDE.md), auto-loaded by Codex
- .codex/config.toml: sandbox, approval, model, OAuth auth settings
- .codex/AGENTS.md: agent instructions with context loading, sync
  routine, and anti-slop writing rules inlined

Updated files:
- AGENTS.md: references both CLAUDE.md and CODEX.md in key files table
- docs/tooling/agent-cli-guide.md: expanded Codex section with OAuth
  setup (ChatGPT subscription login vs API key), project config table
  showing Claude/Codex equivalents, and updated comparison table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A supervisor agent reviews recent experiments, the researcher responds,
and the supervisor writes a final verdict. The dialogue is file-mediated
(research/reviews/{cycle-id}.md) so any CLI tool can participate on
either side.

Key design choices:
- Multi-turn dialogue (default 3: review, response, verdict) with
  configurable --turns flag
- Cross-model: --tool codex --researcher-tool claude lets different
  models play different roles
- File-mediated turns: each turn is a separate CLI invocation, the
  review file accumulates the full conversation
- Follows bin/run-agent patterns: same dispatch, same tool support

The supervisor checks:
- Classification honesty (delta_pct direction matches class)
- Findings completeness (6 required sections)
- Reproducibility (executable commands present)
- Sample size adequacy
- DISCOVERIES.md cross-reference
- Locked file integrity

New files:
- bin/review-cycle: the main script (~300 lines bash)
- research/reviews/.gitkeep: directory for review dialogue files

Updated docs:
- AGENTS.md, CLAUDE.md, CODEX.md: added review-cycle to tables
- agent-cli-guide.md: added Cross-Model Supervision section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@0bserver07
Copy link
Copy Markdown
Collaborator

Hi @philoengineer @zh4ngx — sorry for the long silence on this one. Catching up with the state of things and need help deciding the path forward.

The overlap question

This PR (bin/review-cycle, multi-turn supervisor across Claude / Codex / Gemini / OpenCode) covers very similar ground to what @zh4ngx described in the group chat on Apr 22:

"main agent can dispatch to other agents based on what it thinks is the best tool for the job (or if you are milking quotas) ... works across CLI + LLM providers ... I can hop into any chat under any project to take over / watch (only tail -f due to ownership)"

So we've got two parallel-but-similar efforts and one PR sitting with conflicts.

Three reasonable paths forward

  1. Resurrect this PR@philoengineer rebases against current main, @zh4ngx reviews against his own design and we merge if compatible. Best if the two designs complement (this one's review-cycle pattern is narrower / more focused than Andy's general dispatch).
  2. Adopt the design, rewrite under Andy's architecture@zh4ngx pulls the bin/review-cycle pattern into his "main loop" agent as one of its built-in workflows. Closes this PR with credit.
  3. Close this PR as superseded — Andy's broader work covers the same need. @philoengineer's design influences the integration.

I lean (1) or (2) over (3) because the review-cycle pattern (researcher-supervisor multi-turn dialogue) is a clean, narrowly-scoped primitive that's easy to test in isolation — exactly the kind of thing that's worth shipping standalone before it gets absorbed into a bigger framework.

@philoengineer — are you able to rebase? @zh4ngx — does this fit your "main loop" architecture or are they actually orthogonal?

Whichever path we pick, thanks for the patience on this one.


agent-0bserver07 (Claude Code) on behalf of Yad

@zh4ngx
Copy link
Copy Markdown
Collaborator

zh4ngx commented Apr 30, 2026

Thanks for the clear framing here. I agree with Path 1: review-cycle is worth resurrecting as a standalone primitive rather than absorbing into session-query or the broader main-loop lifecycle.

The technical shape is good: supervisor review -> researcher response -> supervisor verdict gives the workflow a bounded audit loop instead of an open-ended agent conversation. I especially like that research/reviews/{cycle-id}.md is both transport and durable record; that keeps the tool CLI-agnostic and inspectable after agents exit.

A few implementation questions I would want answered during review/rebase:

  1. How does review-cycle represent model disagreement? If the researcher disputes a RECLASSIFY with evidence and the supervisor still disagrees, does the final state become DISPUTED, REJECTED, or something more granular?
  2. What happens when the researcher refuses, stalls, or edits only part of the requested fix? Does the supervisor verify filesystem/log changes directly, or only judge the written response?
  3. Are the integrity checks deterministic enough that we can compare supervisor models, or are they mostly prompt-level instructions today?
  4. Does --turns N generalize beyond the default 3-turn cycle cleanly, or is the third turn semantically special as the final verdict?

This maps directly onto our multi-model eval rotation: Kimi K2.6, DSv4 Pro, GPT-5.5, and GLM-5.1 are being compared as research/main-loop candidates, and review-cycle would let us run structured cross-review instead of ad hoc transcript reading.

I also think this composes naturally with the tooling Andy mentioned: zellij-mcp provides the pane/control fabric, metastack handles parallel dispatch/DAG orchestration, and review-cycle can remain the focused researcher-supervisor audit loop that those systems call into.

So my vote is: rebase, keep it standalone, and wire it into metastack where parallelism is needed.

— drafted by Kimi K2.6 (OpenCode Go) on behalf of Andy; human-reviewed before posting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants