cybertronai · philoengineer · Mar 28, 2026 · Mar 28, 2026
diff --git a/.codex/AGENTS.md b/.codex/AGENTS.md
@@ -0,0 +1,92 @@
+# Codex Project Instructions
+
+These instructions apply to all Codex CLI sessions in this repo.
+
+## Before Any Research Task
+
+Load current project state before running experiments, reviewing PRs, or writing findings.
+
+1. Read these files in order:
+   - **CODEX.md** -- project context, current best methods, constraints
+   - **DISCOVERIES.md** -- what's proven, what failed, open questions (bottom of file)
+   - **AGENT.md** -- machine-executable experiment loop (if running autonomous)
+   - **LAB.md** -- experiment protocol, rules (especially rule #9: metric isolation)
+
+2. Check recent Telegram activity (if synced):
+
+```python
+import json
+for f in ['chat-yad.json', 'chat-yaroslav.json', 'challenge-1-sparse-parity.json']:
+    path = f'src/sparse_parity/telegram_sync/{f}'
+    try:
+        msgs = json.load(open(path))
+        print(f'\n=== {f} (last 3) ===')
+        for m in msgs[:3]:
+            print(f"  [{m['date'][:10]}] {m['sender']}: {m['text'][:150]}")
+    except FileNotFoundError:
+        print(f'{f} not found -- run: bun run sync_telegram.ts')
+```
+
+3. Check GitHub for open work:
+
+```bash
+gh pr list --repo cybertronai/SutroYaro --state open
+gh issue list --repo cybertronai/SutroYaro --state open
+```
+
+4. Before writing code, check:
+   - `research/search_space.yaml` for allowed parameter ranges
+   - `research/questions.yaml` for the dependency graph of open questions
+
+## Current State
+
+| Fact | Value |
+|------|-------|
+| Best method | GF(2) Gaussian elimination, 509us, ARD ~500 |
+| Best energy proxy | DMC (Data Movement Complexity, Ding et al.) |
+| Experiments done | 33+ (see `research/log.jsonl`) |
+| Open questions | Bottom of DISCOVERIES.md (Q7, Q11-Q13 still open) |
+| Next milestone | Energy-efficient nanoGPT training ("final exam") |
+| Meeting cadence | Mondays 18:00 at South Park Commons |
+
+## Sync Routine
+
+Run at session start and before any push:
+
+```bash
+# Telegram (daily)
+bun run sync_telegram.ts
+
+# Or use the targeted read/send scripts:
+bun telegram/tg-read.ts --topic "General" --limit 10
+bun telegram/tg-send.ts --topic "agents" --message "Status update"
+
+# Google Docs (weekly, after Monday meetings)
+python3 src/sync_google_docs.py
+
+# GitHub
+gh pr list --repo cybertronai/SutroYaro --state open
+gh issue list --repo cybertronai/SutroYaro --state open
+```
+
+Before pushing:
+1. Update `docs/changelog.md` (bump version)
+2. `python3 -m mkdocs build` to verify no broken links
+3. Show the diff and wait for approval before `git push`
+
+## Writing Rules (Anti-Slop)
+
+Apply these to all prose (findings docs, DISCOVERIES.md updates, PR descriptions):
+
+1. Cut filler phrases. Say the thing directly.
+2. Break formulaic structures. No binary contrasts, no dramatic fragmentation.
+3. Vary rhythm. Mix sentence lengths. Two items beat three.
+4. Trust readers. State facts directly.
+5. Prefer plain verbs. "used" not "leveraged," "showed" not "showcased."
+6. Use simple copulatives. Write "X is Y" not "X serves as Y."
+7. Kill em dashes. Use commas or periods.
+8. Never triple. Two items in a list, not three.
+9. Be specific. Replace generic statements with concrete details.
+10. No AI vocabulary: delve, tapestry, landscape, pivotal, showcase, testament, underscore, foster, garner, interplay, intricate, vibrant, robust, seamless, paramount, multifaceted, nuanced, groundbreaking, cornerstone, transformative, synergy.
+
+Full guide: `.claude/skills/anti-slop-guide/SKILL.md` (plain markdown, readable by any tool).
diff --git a/.codex/config.toml b/.codex/config.toml
@@ -0,0 +1,16 @@
+# Codex CLI project config for SutroYaro
+# Docs: https://developers.openai.com/codex/config-reference
+
+model = "codex-1"
+
+# Workspace-write lets the agent edit files and run experiments
+sandbox_mode = "workspace-write"
+
+# Ask before destructive actions (git push, file deletion)
+approval_policy = "on-request"
+
+# Read CODEX.md for project context (in addition to AGENTS.md which is auto-read)
+project_doc_fallback_filenames = ["CODEX.md", "AGENTS.md"]
+
+# Auth: prefer ChatGPT OAuth (uses subscription, not API credits)
+forced_login_method = "chatgpt"
diff --git a/AGENTS.md b/AGENTS.md
@@ -1,6 +1,6 @@
 # AGENTS.md
 
-This project uses AI agents (Claude Code, Gemini, Replit, others) for research and accepts contributions from both humans and agents.
+This project uses AI agents (Claude Code, Codex CLI, Gemini CLI, Replit, others) for research and accepts contributions from both humans and agents.
 
 ## How AI agents were used
 
@@ -33,7 +33,9 @@ When reviewing a contributed experiment:
 
 | File | Purpose |
 |------|---------|
-| `CLAUDE.md` | Project instructions loaded at session start |
+| `CLAUDE.md` | Project context for Claude Code (auto-loaded at session start) |
+| `CODEX.md` | Project context for Codex CLI (auto-loaded via `.codex/config.toml`) |
+| `.codex/AGENTS.md` | Codex instructions: context loading, sync routine, writing rules |
 | `LAB.md` | Experiment protocol (one hypothesis, baseline, commit discipline) |
 | `DISCOVERIES.md` | Shared knowledge base, anyone can PR new findings |
 | `CONTRIBUTING.md` | How humans and agents contribute (three effort levels) |
@@ -43,6 +45,8 @@ When reviewing a contributed experiment:
 | `docs/research/survey.md` | Full methodology in Section 7 (agentic loop, parallel dispatch, prompting strategies) |
 | `docs/tooling/anti-slop-guide.md` | Writing rules applied to all agent-generated prose |
 | `docs/tooling/sync-runbook.md` | Weekly/daily/per-session sync checklists |
+| `bin/review-cycle` | Cross-model supervisor: reviews experiments, dialogues with researcher |
+| `research/reviews/` | Review dialogue files (one per review cycle) |
 
 ## What worked
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -104,6 +104,22 @@ See [docs/research/peer-research-protocol.md](docs/research/peer-research-protoc
 | `sync_telegram.ts` | Pulls Telegram group thread messages to JSON | [docs/tooling/automation.md](docs/tooling/automation.md) |
 | `src/sync_google_docs.py` | Pulls Google Docs to local markdown | [docs/tooling/automation.md](docs/tooling/automation.md) |
 | `.traces/export_sessions.py` | Exports Claude Code session traces | [docs/tooling/automation.md](docs/tooling/automation.md) |
+| `bin/review-cycle` | Cross-model experiment review (supervisor/researcher dialogue) | See below |
+
+### Review Cycle Quick Reference
+
+```bash
+# Codex supervises Claude's work (last 5 experiments, 3-turn dialogue)
+bin/review-cycle --tool codex --researcher-tool claude --last 5
+
+# Gemini supervises with more dialogue turns
+bin/review-cycle --tool gemini --researcher-tool claude --last 10 --turns 5
+
+# Preview prompts without launching agents
+bin/review-cycle --dry-run --last 3
+
+# Output: research/reviews/review-{timestamp}.md
+```
 
 ### Telegram Sync Quick Reference
 

diff --git a/CODEX.md b/CODEX.md
@@ -0,0 +1,176 @@
+# CODEX.md - Sutro Group Research Workspace
+
+## Project Context
+
+This is a research workspace for the **Sutro Group**, a study group exploring energy-efficient AI training. The group meets weekly at South Park Commons in San Francisco.
+
+## Read These First
+
+- **LAB.md** — Protocol for running experiments (templates, lifecycle, rules)
+- **AGENT.md** — Machine-executable experiment loop for autonomous sessions
+- **DISCOVERIES.md** — What's proven so far (read before every experiment)
+- **CONTRIBUTING.md** — How external contributors submit experiments and findings
+- **TODO.md** — Open research tasks
+- **docs/tasks/INDEX.md** — Current task tracker with priorities
+- **docs/research/survey.md** — Practitioner's Field Guide ranking all 33 experiments
+- **docs/research/peer-research-protocol.md** — Full design doc for multi-researcher autonomous research
+
+## Core Concepts
+
+- **Sparse Parity**: The benchmark task — learn XOR/parity from random {-1,+1} inputs. n=20 bits, k=3 secret, 17 noise. The "drosophila" of energy-efficient training.
+- **Average Reuse Distance (ARD)**: Proxy metric for energy efficiency. Small ARD = data stays in cache = cheap. Large ARD = expensive external memory access.
+- **Data Movement Complexity (DMC)**: Better proxy metric (Ding et al., arXiv:2312.14441). DMC = sum of sqrt(stack_distance) for all float accesses. Tracks alongside ARD in MemTracker. Baseline: ARD 4,104 / DMC 300,298.
+- **Cache Energy Model**: register 5pJ, L1 (64KB) 20pJ, L2 (256KB) 100pJ, HBM 640pJ per float access (Bill Dally numbers).
+- **CacheTracker**: Extended MemTracker with LRU cache simulation for realistic energy estimates.
+
+## Current Best Methods
+
+| Method | Time (n=20/k=3) | ARD | DMC | Notes |
+|--------|-----------------|-----|-----|-------|
+| KM-min (1 sample) | ~0.001s | 20 | 3,578 | New DMC leader. 1 influence sample suffices for parity. |
+| GF(2) Gaussian Elimination | 509 us | ~420 | 8,607 | 240x faster than SGD, k-independent. Harness under-counts; true DMC ~189K. |
+| KM Influence Estimation | 0.001-0.006s | 92 | 20,633 | ARD leader. 5 influence samples per bit. |
+| SMT Backtracking | 0.002s | 3,360 | 348,336 | Constraint satisfaction approach |
+| SGD (baseline) | 0.12s | 8,504 | 1,278,460 | LR=0.1, batch=32, hidden=200 |
+
+GF(2) solves n=100/k=10 in 703 microseconds. Parity is linear over the binary field -- the neural network was solving an easy problem the hard way.
+
+## SGD Config (when using neural nets)
+
+```python
+n_bits=20, k_sparse=3, hidden=200, lr=0.1, wd=0.01,
+batch_size=32, n_train=1000, max_epochs=200
+```
+
+Solves in ~40 epochs / 0.12s with numpy (`fast.py`).
+
+## Key Findings
+
+**Phase 1 (16 experiments, SGD optimization):**
+- LR=0.1 is critical (0.5 overshoots, never triggers phase transition)
+- W1 dominates 75% of all float reads -- limits ARD optimization to ~10%
+- L2 cache (256KB) eliminates ALL cache misses for both single-sample and batch
+- Curriculum learning (n=10 then expand to n=50) gives 14.6x speedup at scale
+- SGD breaks when n^k exceeds ~100,000 gradient steps
+
+**Phase 2 (17 experiments, broad search):**
+- Algebraic/exact methods (GF(2), KM, SMT) solve instantly -- they exploit that parity is linear over GF(2)
+- All 4 local learning rules (Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation) fail at chance level -- parity requires k-th order interaction detection
+- Information-theoretic methods (MI, LASSO, MDL, Random Projections) all solve it but none beats Fourier meaningfully
+- RL sequential Q-learning achieves ARD of 1 at inference (reads exactly k=3 bits per prediction)
+
+## Autonomous Research Infrastructure
+
+| File | Purpose |
+|------|---------|
+| `AGENT.md` | Agent-executable experiment loop (machine protocol) |
+| `src/harness.py` | Locked evaluation harness (DO NOT MODIFY in experiment PRs) |
+| `research/search_space.yaml` | Bounded mutation space per challenge |
+| `research/questions.yaml` | Dependency graph of open research questions |
+| `research/log.jsonl` | Append-only experiment log (machine-readable) |
+| `results/scoreboard.tsv` | Human-readable leaderboard (auto-generated) |
+| `checks/env_check.py` | Pre-flight environment check |
+| `checks/baseline_check.py` | Re-establish baselines on this machine |
+| `bin/run-agent` | Launch autonomous agent cycle |
+| `bin/merge-findings` | Import contributor log entries via PR |
+
+See [docs/research/peer-research-protocol.md](docs/research/peer-research-protocol.md) for the full design.
+
+## Automation
+
+| Script | What it does | Docs |
+|--------|-------------|------|
+| `sync_telegram.ts` | Bulk-syncs Telegram topics to JSON files | [docs/tooling/automation.md](docs/tooling/automation.md) |
+| `telegram/tg-topics.ts` | Lists forum topics (JSON) | See Telegram section below |
+| `telegram/tg-read.ts` | Reads messages from a topic (JSON) | See Telegram section below |
+| `telegram/tg-send.ts` | Sends a message to a topic | See Telegram section below |
+| `src/sync_google_docs.py` | Pulls Google Docs to local markdown | [docs/tooling/automation.md](docs/tooling/automation.md) |
+| `bin/review-cycle` | Cross-model experiment review (supervisor/researcher dialogue) | See below |
+
+### Review Cycle Quick Reference
+
+```bash
+# Codex supervises Claude's work (last 5 experiments, 3-turn dialogue)
+bin/review-cycle --tool codex --researcher-tool claude --last 5
+
+# Claude supervises Codex's work
+bin/review-cycle --tool claude --researcher-tool codex --last 5
+
+# Preview prompts without launching agents
+bin/review-cycle --dry-run --last 3
+
+# Output: research/reviews/review-{timestamp}.md
+```
+
+### Telegram Quick Reference
+
+```bash
+# First time: install deps and authenticate
+bun install
+cp .env.example .env  # fill in TELEGRAM_API_ID and TELEGRAM_API_HASH
+tg auth login
+
+# Bulk sync (existing)
+bun run sync_telegram.ts
+
+# List topics
+bun telegram/tg-topics.ts
+
+# Read last 20 messages from a topic
+bun telegram/tg-read.ts --topic "General" --limit 20
+
+# Read messages since a date
+bun telegram/tg-read.ts --topic "chat-yad" --since 2025-06-01
+
+# Send a message to a topic
+bun telegram/tg-send.ts --topic "agents" --message "Hello from agent"
+
+# Send multi-line via stdin
+echo "Summary of findings..." | bun telegram/tg-send.ts --topic "agents" --stdin
+
+# Send to default write topic (set TELEGRAM_WRITE_TOPIC in .env)
+bun telegram/tg-send.ts --message "Status update"
+```
+
+## Working Style
+
+- Iteration time must stay under 2 seconds (use `fast.py` for numpy speed)
+- Change one thing at a time (correctness, then speed, then energy)
+- Priority: correctness > wall-clock time > energy usage
+- One hypothesis per experiment, always compare against baseline
+- Record everything -- failed hypotheses are findings too
+- Apply anti-slop writing rules to all prose (no em dashes, no AI vocabulary)
+
+## Before Pushing
+
+- **Update `docs/changelog.md`** with what changed (bump version, add section)
+- **Sync Google Docs** if meeting notes may have changed: `python3 src/sync_google_docs.py`
+- **Sync Telegram** if group discussion may have new messages: `bun run sync_telegram.ts`
+- **Check `docs/index.md`** if findings or status changed -- homepage should reflect current state
+- **Check GitHub** for PRs/issues: `gh pr list --repo cybertronai/SutroYaro`
+
+Full sync workflow: [docs/tooling/sync-runbook.md](docs/tooling/sync-runbook.md)
+
+## People
+
+- **Yad** (repo creator, SutroYaro) — Built the Claude Code autonomous research lab, parallel agent experiments
+- **Yaroslav** (Sutro Group founder) — Technical sprints, algorithm work, cybertronai/sutro
+- **Emmett** — Aster agentic loop framework, 2x energy improvement on microgpt
+- **G B** — Architecture experiments (depth-1/hidden-64, ARD ~33-35)
+- **Germaine**, **Andy**, **Seth**, **Barak**, **Jamie Simon** — Group members
+
+## Contributing
+
+Multiple people contribute via PRs (fork and branch). See [CONTRIBUTING.md](CONTRIBUTING.md) for the full guide and [docs/branch-workflow.md](docs/branch-workflow.md) for branch naming, locked files, and agent permissions.
+
+- **`contributions/`** — Drop raw results here in any format. No template needed.
+- **`findings/_template.md`** — Standalone findings template for structured reports.
+- **`DISCOVERIES.md`** — Shared knowledge base. Anyone can PR new bullets.
+- **Metric isolation (LAB.md rule #9)** — Never modify tracker.py, cache_tracker.py, data.py, config.py, harness.py in experiment PRs.
+
+When reviewing PRs: check that results are reproducible, findings follow the template, and DISCOVERIES.md is updated if the experiment answers an open question.
+
+## Related Repos
+
+- https://github.com/cybertronai/sutro — Main code repo with sparse_parity_benchmark.py
+- https://github.com/cybertronai/SutroYaro — This research workspace