Skip to content

Latest commit

 

History

History
52 lines (42 loc) · 2.26 KB

File metadata and controls

52 lines (42 loc) · 2.26 KB

Research notes

RL roadmap (tiers and phases) is implemented in code as documented below.

Current status in-code:

  • Tier 1: UCB bandit in src/sage/rl/ucb_bandit.py
  • Tier 2 (Phase 5): offline routing dataset export + BC + conservative (CQL-style) policy
    • Export: src/sage/rl/export_dataset.py
    • BC: src/sage/rl/train_bc.py
    • Conservative policy: src/sage/rl/train_cql.py (contextual bandit variant)
  • Tier 3 (Phase 6): simulator tasks + Docker sandbox + minimal PPO
    • Tasks: datasets/sim_tasks.jsonl (generated by sage sim generate)
    • Docker: sim/Dockerfile, src/sage/sim/docker_runner.py
    • PPO: src/sage/sim/ppo.py

Notes:

  • For “real” Phase 5/6 research, replace synthetic trajectory collection with logged decisions from real runs and evaluate on an external benchmark suite.

Implemented vs Deferred

Implemented (local evidence in this repo)

  • Phase 5 evidence pipeline (mixed real/synthetic, labeled):
    • Export + session filtering: src/sage/rl/export_dataset.py
    • Synthetic collector: src/sage/rl/collect_synth.py
    • Provenance label: data_source in exported rows
  • Training artifacts:
    • BC: memory/rl/policy_bc.joblib
    • Conservative policy: memory/rl/policy_cql.joblib
  • Evaluation artifacts:
    • Reward report: datasets/reward_report.json
    • Offline eval: datasets/offline_eval_cql.json
  • Benchmark artifacts:
    • YAML task suite under src/sage/benchmarks/tasks/*.yaml (now 6 tasks)
    • sage bench --out ... writes JSON artifacts
  • Phase 6 simulator maturity:
    • Oracle suite: datasets/sim_tasks.jsonl (>=1000 tasks)
    • Parallel runner + optional docker: src/sage/sim/parallel_runner.py, src/sage/sim/docker_runner.py
    • PPO smoke: src/sage/sim/ppo.py (minimal PPO implementation)
  • Reproducibility docs:
    • Verification matrix: docs/verification_matrix.md
    • Verification script: scripts/verify_local.sh
    • Artifact mapping checklist: docs/final_checklist.md

Deferred (requires stronger research-grade setup)

  • Real-run-only Phase 5 dataset at scale (no synthetic bootstrap).
  • Research-grade CQL / PPO training over multi-step simulator rollouts.
  • Published benchmark protocol outputs (e.g., full ablation tables, external-run reproducibility).
  • Arxiv/paper/demo-video packaging for the full RL story.