This doc maps Phase 5–6 (offline RL, simulator) work to concrete artifacts in this workspace and how to reproduce them.
-
500+ diverse trajectories collected
- Dataset export:
datasets/routing_v1.jsonl - Row count observed in this workspace:
651rows (wc -l datasets/routing_v1.jsonl) - How to reproduce (synthetic collector in-repo, then export):
sage rl collect-synth --rows 650sage rl export --output datasets/routing_v1.jsonl
- Dataset export:
-
Reward function tuned
- Implemented reward:
src/sage/rl/reward.py(reward_v1viacomposite_reward) - Analysis report:
datasets/reward_report.json(generated bysage rl analyze-rewards) - How to reproduce:
sage rl analyze-rewards --data datasets/routing_v1.jsonl --out datasets/reward_report.json
- Implemented reward:
-
Behavior Cloning baseline trained
- Checkpoint:
memory/rl/policy_bc.joblib - Training command:
sage rl train-bc --data datasets/routing_v1.jsonl --out memory/rl/policy_bc.joblib
- Checkpoint:
-
CQL policy trained + integrated with fallback
- Conservative policy checkpoint:
memory/rl/policy_cql.joblib
- Router integration is via
src/sage/orchestrator/model_router.py:- set
SAGE_RL_POLICY=1andSAGE_RL_CHECKPOINT=memory/rl/policy_cql.joblib - router uses conservative policy output with a confidence floor
- set
- How to reproduce:
sage rl train-cql --data datasets/routing_v1.jsonl --out memory/rl/policy_cql.joblib
- Conservative policy checkpoint:
-
Benchmark: policy vs static table comparison
- Two evaluation paths exist:
- Offline eval (available immediately, no LLM dependency):
- report:
datasets/offline_eval_cql.json - command:
sage rl eval-offline --data datasets/routing_v1.jsonl --checkpoint memory/rl/policy_cql.joblib --out datasets/offline_eval_cql.json
- report:
sage bench --compare-policy(LLM-dependent):- runs the bench suite twice with
SAGE_RL_POLICY=0vs1 - to persist a proof artifact:
sage bench --compare-policy --out memory/benchmarks/bench.json
- standardized run-pack (artifact + manifest):
sage bench --compare-policy --run-pack-dir memory/benchmarks/run_pack_compare
- runs the bench suite twice with
- Offline eval (available immediately, no LLM dependency):
- Recommendation: treat
eval-offlineas the deterministic Phase-5 “evidence” andsage benchas the best-effort LLM-dependent check.
- Two evaluation paths exist:
-
1000+ benchmark tasks with known solutions
- Task suite:
datasets/sim_tasks.jsonl - Count observed in this workspace:
1200tasks - How to reproduce:
sage sim generate --count 1000 --out datasets/sim_tasks.jsonl
- Task suite:
-
Parallel Docker sandbox infrastructure
- Implemented:
- Docker tool runner:
src/sage/sim/docker_runner.py - Parallel runner:
src/sage/sim/parallel_runner.pywith--dockerwiring - CLI:
sage sim run --tasks datasets/sim_tasks.jsonl --workers 4 --docker
- Docker tool runner:
- Local Docker image:
sim/Dockerfile→sage-sim:latest
- Implemented:
-
PPO fine-tuning pipeline
- Implemented as a minimal working PPO on a toy contextual-bandit simulator:
src/sage/sim/ppo.py(train_ppo)
- This is Phase-6 wiring + correctness baseline; the “research-grade PPO on software rollouts” still needs the richer simulator environment.
- Implemented as a minimal working PPO on a toy contextual-bandit simulator:
- The Phase-5 dataset in this workspace is generated using
sage rl collect-synth(synthetic trajectories) unless you run additional realsage runsessions and export them. - For strict provenance separation, use:
sage rl export --output datasets/routing_real_v1.jsonl --data-source realsage rl export --output datasets/routing_synth_v1.jsonl --data-source synthetic
sage bench --compare-policydepends on local model availability/latency;eval-offlineis deterministic given exported rows.