Research notes

RL roadmap (tiers and phases) is implemented in code as documented below.

Current status in-code:

Tier 1: UCB bandit in src/sage/rl/ucb_bandit.py
Tier 2 (Phase 5): offline routing dataset export + BC + conservative (CQL-style) policy
- Export: src/sage/rl/export_dataset.py
- BC: src/sage/rl/train_bc.py
- Conservative policy: src/sage/rl/train_cql.py (contextual bandit variant)
Tier 3 (Phase 6): simulator tasks + Docker sandbox + minimal PPO
- Tasks: datasets/sim_tasks.jsonl (generated by sage sim generate)
- Docker: sim/Dockerfile, src/sage/sim/docker_runner.py
- PPO: src/sage/sim/ppo.py

Notes:

For “real” Phase 5/6 research, replace synthetic trajectory collection with logged decisions from real runs and evaluate on an external benchmark suite.

Implemented vs Deferred

Phase 5 evidence pipeline (mixed real/synthetic, labeled):
- Export + session filtering: src/sage/rl/export_dataset.py
- Synthetic collector: src/sage/rl/collect_synth.py
- Provenance label: data_source in exported rows
Training artifacts:
- BC: memory/rl/policy_bc.joblib
- Conservative policy: memory/rl/policy_cql.joblib
Evaluation artifacts:
- Reward report: datasets/reward_report.json
- Offline eval: datasets/offline_eval_cql.json
Benchmark artifacts:
- YAML task suite under src/sage/benchmarks/tasks/*.yaml (now 6 tasks)
- sage bench --out ... writes JSON artifacts
Phase 6 simulator maturity:
- Oracle suite: datasets/sim_tasks.jsonl (>=1000 tasks)
- Parallel runner + optional docker: src/sage/sim/parallel_runner.py, src/sage/sim/docker_runner.py
- PPO smoke: src/sage/sim/ppo.py (minimal PPO implementation)
Reproducibility docs:
- Verification matrix: docs/verification_matrix.md
- Verification script: scripts/verify_local.sh
- Artifact mapping checklist: docs/final_checklist.md

Real-run-only Phase 5 dataset at scale (no synthetic bootstrap).
Research-grade CQL / PPO training over multi-step simulator rollouts.
Published benchmark protocol outputs (e.g., full ablation tables, external-run reproducibility).
Arxiv/paper/demo-video packaging for the full RL story.