Final checklist (Phase 5–6 artifacts)

This doc maps Phase 5–6 (offline RL, simulator) work to concrete artifacts in this workspace and how to reproduce them.

Phase 5 — Offline RL (routing policy beats static YAML)

500+ diverse trajectories collected
- Dataset export: datasets/routing_v1.jsonl
- Row count observed in this workspace: 651 rows (wc -l datasets/routing_v1.jsonl)
- How to reproduce (synthetic collector in-repo, then export):
  - sage rl collect-synth --rows 650
  - sage rl export --output datasets/routing_v1.jsonl
Reward function tuned
- Implemented reward: src/sage/rl/reward.py (reward_v1 via composite_reward)
- Analysis report: datasets/reward_report.json (generated by sage rl analyze-rewards)
- How to reproduce:
  - sage rl analyze-rewards --data datasets/routing_v1.jsonl --out datasets/reward_report.json
Behavior Cloning baseline trained
- Checkpoint: memory/rl/policy_bc.joblib
- Training command:
  - sage rl train-bc --data datasets/routing_v1.jsonl --out memory/rl/policy_bc.joblib
CQL policy trained + integrated with fallback
- Conservative policy checkpoint:
  - memory/rl/policy_cql.joblib
- Router integration is via src/sage/orchestrator/model_router.py:
  - set SAGE_RL_POLICY=1 and SAGE_RL_CHECKPOINT=memory/rl/policy_cql.joblib
  - router uses conservative policy output with a confidence floor
- How to reproduce:
  - sage rl train-cql --data datasets/routing_v1.jsonl --out memory/rl/policy_cql.joblib
Benchmark: policy vs static table comparison
- Two evaluation paths exist:
  - Offline eval (available immediately, no LLM dependency):
    - report: datasets/offline_eval_cql.json
    - command:
      - sage rl eval-offline --data datasets/routing_v1.jsonl --checkpoint memory/rl/policy_cql.joblib --out datasets/offline_eval_cql.json
  - sage bench --compare-policy (LLM-dependent):
    - runs the bench suite twice with SAGE_RL_POLICY=0 vs 1
    - to persist a proof artifact:
      - sage bench --compare-policy --out memory/benchmarks/bench.json
    - standardized run-pack (artifact + manifest):
      - sage bench --compare-policy --run-pack-dir memory/benchmarks/run_pack_compare
- Recommendation: treat eval-offline as the deterministic Phase-5 “evidence” and sage bench as the best-effort LLM-dependent check.

Phase 6 — Simulator RL (PPO against simulator benchmark)

1000+ benchmark tasks with known solutions
- Task suite: datasets/sim_tasks.jsonl
- Count observed in this workspace: 1200 tasks
- How to reproduce:
  - sage sim generate --count 1000 --out datasets/sim_tasks.jsonl
Parallel Docker sandbox infrastructure
- Implemented:
  - Docker tool runner: src/sage/sim/docker_runner.py
  - Parallel runner: src/sage/sim/parallel_runner.py with --docker wiring
  - CLI:
    - sage sim run --tasks datasets/sim_tasks.jsonl --workers 4 --docker
- Local Docker image:
  - sim/Dockerfile → sage-sim:latest
PPO fine-tuning pipeline
- Implemented as a minimal working PPO on a toy contextual-bandit simulator:
  - src/sage/sim/ppo.py (train_ppo)
- This is Phase-6 wiring + correctness baseline; the “research-grade PPO on software rollouts” still needs the richer simulator environment.

Notes / what’s intentionally not “final proof” yet

The Phase-5 dataset in this workspace is generated using sage rl collect-synth (synthetic trajectories) unless you run additional real sage run sessions and export them.
For strict provenance separation, use:
- sage rl export --output datasets/routing_real_v1.jsonl --data-source real
- sage rl export --output datasets/routing_synth_v1.jsonl --data-source synthetic
sage bench --compare-policy depends on local model availability/latency; eval-offline is deterministic given exported rows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final checklist (Phase 5–6 artifacts)

Phase 5 — Offline RL (routing policy beats static YAML)

Phase 6 — Simulator RL (PPO against simulator benchmark)

Notes / what’s intentionally not “final proof” yet

FilesExpand file tree

final_checklist.md

Latest commit

History

final_checklist.md

File metadata and controls

Final checklist (Phase 5–6 artifacts)

Phase 5 — Offline RL (routing policy beats static YAML)

Phase 6 — Simulator RL (PPO against simulator benchmark)

Notes / what’s intentionally not “final proof” yet