RL roadmap (tiers and phases) is implemented in code as documented below.
Current status in-code:
- Tier 1: UCB bandit in
src/sage/rl/ucb_bandit.py - Tier 2 (Phase 5): offline routing dataset export + BC + conservative (CQL-style) policy
- Export:
src/sage/rl/export_dataset.py - BC:
src/sage/rl/train_bc.py - Conservative policy:
src/sage/rl/train_cql.py(contextual bandit variant)
- Export:
- Tier 3 (Phase 6): simulator tasks + Docker sandbox + minimal PPO
- Tasks:
datasets/sim_tasks.jsonl(generated bysage sim generate) - Docker:
sim/Dockerfile,src/sage/sim/docker_runner.py - PPO:
src/sage/sim/ppo.py
- Tasks:
Notes:
- For “real” Phase 5/6 research, replace synthetic trajectory collection with logged decisions from real runs and evaluate on an external benchmark suite.
- Phase 5 evidence pipeline (mixed real/synthetic, labeled):
- Export + session filtering:
src/sage/rl/export_dataset.py - Synthetic collector:
src/sage/rl/collect_synth.py - Provenance label:
data_sourcein exported rows
- Export + session filtering:
- Training artifacts:
- BC:
memory/rl/policy_bc.joblib - Conservative policy:
memory/rl/policy_cql.joblib
- BC:
- Evaluation artifacts:
- Reward report:
datasets/reward_report.json - Offline eval:
datasets/offline_eval_cql.json
- Reward report:
- Benchmark artifacts:
- YAML task suite under
src/sage/benchmarks/tasks/*.yaml(now 6 tasks) sage bench --out ...writes JSON artifacts
- YAML task suite under
- Phase 6 simulator maturity:
- Oracle suite:
datasets/sim_tasks.jsonl(>=1000 tasks) - Parallel runner + optional docker:
src/sage/sim/parallel_runner.py,src/sage/sim/docker_runner.py - PPO smoke:
src/sage/sim/ppo.py(minimal PPO implementation)
- Oracle suite:
- Reproducibility docs:
- Verification matrix:
docs/verification_matrix.md - Verification script:
scripts/verify_local.sh - Artifact mapping checklist:
docs/final_checklist.md
- Verification matrix:
- Real-run-only Phase 5 dataset at scale (no synthetic bootstrap).
- Research-grade CQL / PPO training over multi-step simulator rollouts.
- Published benchmark protocol outputs (e.g., full ablation tables, external-run reproducibility).
- Arxiv/paper/demo-video packaging for the full RL story.