[EXAMPLE] [WIP] Add Terminus Pi TRL training#675
Conversation
dcbf977 to
af16a1c
Compare
af16a1c to
cb65ba7
Compare
Four-node Qwen3-4B async GRPO runPushed the latest example fixes to
Completed a four-node SLURM run:
Observed metrics:
Validation:
|
Rollout worker simplificationPushed Changes are intentionally scoped to
Kept the proof-of-concept pieces in the example for now: the interception server, vLLM weight sync, Qwen/PI message normalization, dataset task coercion, and the Terminus text-to-tool-call shim. Validation:
|
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Tier 1: Bugs & Issues
-
Bug (Critical):
logprobs=0silently zeroes all log-probs.pi_rollout_worker.pyline 763 passes"logprobs": 0to vLLM/v1/completions, which returns zero log-prob tokens. The fallback on lines 773-774 then fills all log-probs with0.0. GRPO computesexp(log_pi_new - log_pi_old)— withlog_pi_old = 0.0this collapses toexp(log_pi_new), corrupting the policy ratio. Should be"logprobs": 1. -
Credential safety:
pi_cli.pyincludes"stderr": stderrin theHarnessRolloutResultmetrics dict. If thepisubprocess emits API keys or tokens in stderr (error paths, debug output), those propagate through the metrics reporting pipeline, violating the "No credential exposure" invariant. Dropstderrfrom metrics or sanitize it. -
assertin production code:pi_rollout_worker.pylines 458-459 useassertto guardNCCLTrainerSendWeightsArgsandNCCLWeightTransferEngine. These are silently no-ops withpython -O. Replace with explicitif ... is None: raise RuntimeError(...). -
Internal vLLM state mutation:
pi_rollout_worker.pylines 586-588 directly setgroup.group.store = Noneandgroup.group.socket = Noneon undocumented vLLM NCCL group internals. Will break silently on vLLM version bumps. -
Short join timeout:
pi_rollout_worker.pyline 726pi_thread.join(timeout=1.0)in thefinallyblock may leave the Pi subprocess thread running. Should log a warning if thread is still alive after join. -
Rollout failures swallowed:
pi_rollout_worker.pylines 606-608 catch all exceptions and continue._failedevent is never set from_loop, socheck_health()won't raise until the heartbeat goes stale. Consider setting_failedafter N consecutive failures. -
Late import:
src/openenv/core/harness/__init__.pyline 703 importsPiCLIHarnessAdapterat the bottom of a 700-line module, making import errors hard to trace.
Tier 2: Alignment
ALIGNMENT FLAG: No authentication on MCP-to-Pi bridge server
- Invariant at risk: Agent isolation (INVARIANTS.md).
pi_cli.py'sBridgeHandler.do_POSThas no auth check. On a shared Slurm node, another job could enumerate the bridge port and calltools/callto invoke environment tools. The bridge should require a per-run secret. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: unused_reward stub bypasses TRL reward_funcs mechanism
- Invariant at risk: "Rewards inside environment" (RFC 002). All reward signal comes from
env_rewardvia the rollout worker, while a zero-returningunused_rewardis passed toAsyncGRPOTrainer. This inverts TRL's assumptions and needs clear documentation. - Suggested reviewer: @Darktex
Summary
Critical training bug: logprobs=0 zeros all log-probs, corrupting GRPO's policy ratio (should be logprobs=1). One credential safety issue (stderr in metrics). Several hardening gaps appropriate for WIP but blocking for merge. The Pi bridge lacks authentication, which matters on shared compute nodes.
Automated review by Claude Code | Learn more

This PR contains a minimal Terminus + Pi + TRL async GRPO example under
examples/end_to_end/tbench2_pi_trl.It depends on #729 for the Terminus env harness and sandbox backend changes used by the training script.
Runs
Qwen/Qwen3.5-4B
22182836,COMPLETED, exit code0:0, elapsed00:14:47590.107s0.038534664rewards/terminus_rewardreached1.0at steps 10, 11, and 12a79cf0c20afc7b937df8d015bef2093a3ac21a25Qwen/Qwen3-0.6B training sanity run
22179048,COMPLETED, exit code0:0, elapsed00:06:28rewards/terminus_reward=1.0through 12 stepsQwen/Qwen3-4B PR head
22183082,COMPLETED, exit code0:0, elapsed00:12:15hopper-prod; sandbox, vLLM, trainer, plus one reserve node in the simple cluster wrapper402.4sfor200/200optimizer steps-0.038140.93, min0.5, max1.0, final reward1.0terminus-qwen3-4b-22183082-terminusf9351f6ce92de5c6559ef46de03a0750f2fd29dfsave_strategy="no"and kept the existing finalsave_model()/push_to_hub()path, avoiding a 16 GB checkpoint write and push every optimizer step.