Add cross-model research supervisor (bin/review-cycle)#63
Add cross-model research supervisor (bin/review-cycle)#63philoengineer wants to merge 2 commits intocybertronai:mainfrom
Conversation
Codex users now get the same auto-loaded project context, agent instructions, and sync/writing-rules that Claude Code users get from CLAUDE.md and .claude/skills/. New files: - CODEX.md: project context (mirrors CLAUDE.md), auto-loaded by Codex - .codex/config.toml: sandbox, approval, model, OAuth auth settings - .codex/AGENTS.md: agent instructions with context loading, sync routine, and anti-slop writing rules inlined Updated files: - AGENTS.md: references both CLAUDE.md and CODEX.md in key files table - docs/tooling/agent-cli-guide.md: expanded Codex section with OAuth setup (ChatGPT subscription login vs API key), project config table showing Claude/Codex equivalents, and updated comparison table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A supervisor agent reviews recent experiments, the researcher responds,
and the supervisor writes a final verdict. The dialogue is file-mediated
(research/reviews/{cycle-id}.md) so any CLI tool can participate on
either side.
Key design choices:
- Multi-turn dialogue (default 3: review, response, verdict) with
configurable --turns flag
- Cross-model: --tool codex --researcher-tool claude lets different
models play different roles
- File-mediated turns: each turn is a separate CLI invocation, the
review file accumulates the full conversation
- Follows bin/run-agent patterns: same dispatch, same tool support
The supervisor checks:
- Classification honesty (delta_pct direction matches class)
- Findings completeness (6 required sections)
- Reproducibility (executable commands present)
- Sample size adequacy
- DISCOVERIES.md cross-reference
- Locked file integrity
New files:
- bin/review-cycle: the main script (~300 lines bash)
- research/reviews/.gitkeep: directory for review dialogue files
Updated docs:
- AGENTS.md, CLAUDE.md, CODEX.md: added review-cycle to tables
- agent-cli-guide.md: added Cross-Model Supervision section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @philoengineer @zh4ngx — sorry for the long silence on this one. Catching up with the state of things and need help deciding the path forward. The overlap question This PR (
So we've got two parallel-but-similar efforts and one PR sitting with conflicts. Three reasonable paths forward
I lean (1) or (2) over (3) because the review-cycle pattern (researcher-supervisor multi-turn dialogue) is a clean, narrowly-scoped primitive that's easy to test in isolation — exactly the kind of thing that's worth shipping standalone before it gets absorbed into a bigger framework. @philoengineer — are you able to rebase? @zh4ngx — does this fit your "main loop" architecture or are they actually orthogonal? Whichever path we pick, thanks for the patience on this one. agent-0bserver07 (Claude Code) on behalf of Yad |
|
Thanks for the clear framing here. I agree with Path 1: The technical shape is good: supervisor review -> researcher response -> supervisor verdict gives the workflow a bounded audit loop instead of an open-ended agent conversation. I especially like that A few implementation questions I would want answered during review/rebase:
This maps directly onto our multi-model eval rotation: Kimi K2.6, DSv4 Pro, GPT-5.5, and GLM-5.1 are being compared as research/main-loop candidates, and I also think this composes naturally with the tooling Andy mentioned: So my vote is: rebase, keep it standalone, and wire it into metastack where parallelism is needed. — drafted by Kimi K2.6 (OpenCode Go) on behalf of Andy; human-reviewed before posting |
Summary
bin/review-cycle, a standalone script that lets one AI model supervise experiments done by another through a multi-turn dialoguebin/run-agentpatterns exactly: same dispatch, same tool flags, same resilienceHow it works
# Codex reviews Claude's last 5 experiments bin/review-cycle --tool codex --researcher-tool claude --last 5Turn 1 — Supervisor reviews: reads
log.jsonl+ findings docs, checks classifications against integrity rules, assigns per-experiment verdicts (CONFIRMED / RECLASSIFY / REVISION_NEEDED / FLAG)Turn 2 — Researcher responds: reads the review, fixes acknowledged errors in the actual files (findings docs, log entries), or disputes with evidence
Turn 3 — Supervisor verdict: evaluates responses, writes final status (APPROVED / CORRECTED / DISPUTED / REJECTED)
The dialogue accumulates in
research/reviews/{cycle-id}.md— both the conversation medium and the permanent audit trail.What the supervisor checks
delta_pctdirection match theclass?Options
--tool--researcher-tool--last N--turns N--dry-runTest plan
bin/review-cycle --helpshows usagebin/review-cycle --dry-run --last 3prints 3 prompts (supervisor, researcher, verdict)bin/review-cycle --tool claude --last 3producesresearch/reviews/review-*.mdwith dialoguebin/review-cycle --tool codex --researcher-tool claude— cross-model dialogue worksstatus: completeafter run🤖 Generated with Claude Code