Skip to content

Latest commit

 

History

History
76 lines (63 loc) · 3.93 KB

File metadata and controls

76 lines (63 loc) · 3.93 KB

Final checklist (Phase 5–6 artifacts)

This doc maps Phase 5–6 (offline RL, simulator) work to concrete artifacts in this workspace and how to reproduce them.

Phase 5 — Offline RL (routing policy beats static YAML)

  1. 500+ diverse trajectories collected

    • Dataset export: datasets/routing_v1.jsonl
    • Row count observed in this workspace: 651 rows (wc -l datasets/routing_v1.jsonl)
    • How to reproduce (synthetic collector in-repo, then export):
      • sage rl collect-synth --rows 650
      • sage rl export --output datasets/routing_v1.jsonl
  2. Reward function tuned

    • Implemented reward: src/sage/rl/reward.py (reward_v1 via composite_reward)
    • Analysis report: datasets/reward_report.json (generated by sage rl analyze-rewards)
    • How to reproduce:
      • sage rl analyze-rewards --data datasets/routing_v1.jsonl --out datasets/reward_report.json
  3. Behavior Cloning baseline trained

    • Checkpoint: memory/rl/policy_bc.joblib
    • Training command:
      • sage rl train-bc --data datasets/routing_v1.jsonl --out memory/rl/policy_bc.joblib
  4. CQL policy trained + integrated with fallback

    • Conservative policy checkpoint:
      • memory/rl/policy_cql.joblib
    • Router integration is via src/sage/orchestrator/model_router.py:
      • set SAGE_RL_POLICY=1 and SAGE_RL_CHECKPOINT=memory/rl/policy_cql.joblib
      • router uses conservative policy output with a confidence floor
    • How to reproduce:
      • sage rl train-cql --data datasets/routing_v1.jsonl --out memory/rl/policy_cql.joblib
  5. Benchmark: policy vs static table comparison

    • Two evaluation paths exist:
      • Offline eval (available immediately, no LLM dependency):
        • report: datasets/offline_eval_cql.json
        • command:
          • sage rl eval-offline --data datasets/routing_v1.jsonl --checkpoint memory/rl/policy_cql.joblib --out datasets/offline_eval_cql.json
      • sage bench --compare-policy (LLM-dependent):
        • runs the bench suite twice with SAGE_RL_POLICY=0 vs 1
        • to persist a proof artifact:
          • sage bench --compare-policy --out memory/benchmarks/bench.json
        • standardized run-pack (artifact + manifest):
          • sage bench --compare-policy --run-pack-dir memory/benchmarks/run_pack_compare
    • Recommendation: treat eval-offline as the deterministic Phase-5 “evidence” and sage bench as the best-effort LLM-dependent check.

Phase 6 — Simulator RL (PPO against simulator benchmark)

  1. 1000+ benchmark tasks with known solutions

    • Task suite: datasets/sim_tasks.jsonl
    • Count observed in this workspace: 1200 tasks
    • How to reproduce:
      • sage sim generate --count 1000 --out datasets/sim_tasks.jsonl
  2. Parallel Docker sandbox infrastructure

    • Implemented:
      • Docker tool runner: src/sage/sim/docker_runner.py
      • Parallel runner: src/sage/sim/parallel_runner.py with --docker wiring
      • CLI:
        • sage sim run --tasks datasets/sim_tasks.jsonl --workers 4 --docker
    • Local Docker image:
      • sim/Dockerfilesage-sim:latest
  3. PPO fine-tuning pipeline

    • Implemented as a minimal working PPO on a toy contextual-bandit simulator:
      • src/sage/sim/ppo.py (train_ppo)
    • This is Phase-6 wiring + correctness baseline; the “research-grade PPO on software rollouts” still needs the richer simulator environment.

Notes / what’s intentionally not “final proof” yet

  • The Phase-5 dataset in this workspace is generated using sage rl collect-synth (synthetic trajectories) unless you run additional real sage run sessions and export them.
  • For strict provenance separation, use:
    • sage rl export --output datasets/routing_real_v1.jsonl --data-source real
    • sage rl export --output datasets/routing_synth_v1.jsonl --data-source synthetic
  • sage bench --compare-policy depends on local model availability/latency; eval-offline is deterministic given exported rows.