Skip to content

[EXAMPLE] [WIP] Add Terminus Pi TRL training#675

Draft
burtenshaw wants to merge 10 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-pi-trl-space
Draft

[EXAMPLE] [WIP] Add Terminus Pi TRL training#675
burtenshaw wants to merge 10 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-pi-trl-space

Conversation

@burtenshaw
Copy link
Copy Markdown
Collaborator

@burtenshaw burtenshaw commented May 12, 2026

This PR contains a minimal Terminus + Pi + TRL async GRPO example under examples/end_to_end/tbench2_pi_trl.

It depends on #729 for the Terminus env harness and sandbox backend changes used by the training script.

Runs

Qwen/Qwen3.5-4B

Qwen/Qwen3-0.6B training sanity run

Qwen/Qwen3-4B PR head

  • Slurm job: 22183082, COMPLETED, exit code 0:0, elapsed 00:12:15
  • Allocation: 4 H100 nodes on hopper-prod; sandbox, vLLM, trainer, plus one reserve node in the simple cluster wrapper
  • Trainer runtime: 402.4s for 200/200 optimizer steps
  • Train loss: -0.03814
  • Reward metrics: 200 logged rows, mean reward 0.93, min 0.5, max 1.0, final reward 1.0
  • Trackio run: terminus-qwen3-4b-22183082-terminus
  • Hub model: https://huggingface.co/burtenshaw/terminus-pi-trl-async-grpo-qwen3-4b
  • Hub model commit: f9351f6ce92de5c6559ef46de03a0750f2fd29df
  • Trackio Space: https://huggingface.co/spaces/burtenshaw/terminus-pi-trl-static-12edaf
  • Run fix made during validation: changed the trainer config to save_strategy="no" and kept the existing final save_model() / push_to_hub() path, avoiding a 16 GB checkpoint write and push every optimizer step.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026
Darktex

This comment was marked as outdated.

@burtenshaw burtenshaw force-pushed the codex/terminus-pi-trl-space branch 6 times, most recently from dcbf977 to af16a1c Compare May 29, 2026 18:43
@burtenshaw burtenshaw force-pushed the codex/terminus-pi-trl-space branch from af16a1c to cb65ba7 Compare May 29, 2026 18:49
@burtenshaw
Copy link
Copy Markdown
Collaborator Author

burtenshaw commented May 29, 2026

Cluster training run: Terminus async GRPO, Qwen/Qwen3.5-4B

Completed a 3-node Slurm run from worktree /fsx/benjamin_burtenshaw/OpenEnv-codex-terminus-pi-trl-space-run2.

  • Slurm job: 22183037, state COMPLETED, exit 0:0, elapsed 00:49:14
  • Nodes: sandbox ip-26-0-170-160, vLLM ip-26-0-171-56, trainer ip-26-0-171-62
  • Run commit used: 3ae1cb23 (examples: configure terminus grpo training run)
  • Note: the PR branch was updated afterward to 56fdccbf; I did not overwrite that newer branch state.
  • Trackio Space: burtenshaw/terminus-pi-trl-trackio
  • Task dataset: burtenshaw/terminus-pi-trl-tasks, updated to max_steps=[200]

Model repos pushed:

  • burtenshaw/terminus-pi-trl-qwen35-4b-200-alpha
  • burtenshaw/terminus-pi-trl-qwen35-4b-200-beta

Reward validation from trainer logs:

run steps first 20 mean reward last 20 mean reward final reward train runtime train loss
alpha 200 0.25 0.85 1.0 934.6s -0.02174
beta 200 0.975 1.0 1.0 799.0s -0.00917
train_reward

https://burtenshaw-terminus-pi-trl-trackio.hf.space/?project=terminus-pi-trl&run_ids=5f7015b368544ece97c09de6ea4c47b1%2Caddd05850af0465cb040df647e600cd7&sidebar=hidden&navbar=hidden

@burtenshaw burtenshaw changed the title Add Terminus Pi TRL training [EXAMPLE] Add Terminus Pi TRL training May 29, 2026
@burtenshaw burtenshaw marked this pull request as ready for review May 29, 2026 19:58
Darktex

This comment was marked as duplicate.

@burtenshaw burtenshaw marked this pull request as draft May 30, 2026 10:08
@burtenshaw burtenshaw changed the title [EXAMPLE] Add Terminus Pi TRL training [EXAMPLE] [WIP] Add Terminus Pi TRL training May 30, 2026
Darktex

This comment was marked as outdated.

Darktex

This comment was marked as duplicate.

@burtenshaw
Copy link
Copy Markdown
Collaborator Author

Four-node Qwen3-4B async GRPO run

Pushed the latest example fixes to codex/terminus-pi-trl-space:

  • 0e322847 normalizes Terminus-style text terminal calls from PI/model output into OpenAI tool-call JSON.
  • 7e662a83 passes dataset chat prompts to Terminus as plain instructions instead of stringified dataset rows.
  • 70a6bb27 normalizes PI OpenAI content parts to strings before Qwen chat template rendering. This was required because Qwen rendered content: [{"type":"text", ...}] as an empty user message.

Completed a four-node SLURM run:

Observed metrics:

  • Final summary: train_runtime=1272s, train_steps_per_second=0.157, train_loss=-0.9997, epoch=1
  • Reward and verification stayed healthy after the content-normalization fix: reward=1, verify/done=1, verify/submitted_answer=1, verify/commands=3, turns=4
  • Model repo verified on Hub at commit f7de6ea5bd058b44b91bf69de2ae4a20171520fe.

Validation:

  • uv run --with ruff ruff check examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py
  • uv run python -m py_compile examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py

@burtenshaw
Copy link
Copy Markdown
Collaborator Author

Rollout worker simplification

Pushed 4ae641b0 (Simplify Terminus PI rollout worker) to keep the proof-of-concept example self-contained but smaller.

Changes are intentionally scoped to examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py:

  • Removed unused _failed / _exception health state.
  • Removed streaming/SSE support from the local PI interception server; the example now only supports the JSON response path it uses.
  • Removed _stream_chunks.
  • Removed _assistant_message_from_text and folded that fallback into _parse_assistant_message.
  • Removed _current_model_version and read the value inline when building RolloutSample.

Kept the proof-of-concept pieces in the example for now: the interception server, vLLM weight sync, Qwen/PI message normalization, dataset task coercion, and the Terminus text-to-tool-call shim.

Validation:

  • uv run --with ruff ruff check examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py
  • uv run python -m py_compile examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py

Copy link
Copy Markdown
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Tier 1: Bugs & Issues

  • Bug (Critical): logprobs=0 silently zeroes all log-probs. pi_rollout_worker.py line 763 passes "logprobs": 0 to vLLM /v1/completions, which returns zero log-prob tokens. The fallback on lines 773-774 then fills all log-probs with 0.0. GRPO computes exp(log_pi_new - log_pi_old) — with log_pi_old = 0.0 this collapses to exp(log_pi_new), corrupting the policy ratio. Should be "logprobs": 1.

  • Credential safety: pi_cli.py includes "stderr": stderr in the HarnessRolloutResult metrics dict. If the pi subprocess emits API keys or tokens in stderr (error paths, debug output), those propagate through the metrics reporting pipeline, violating the "No credential exposure" invariant. Drop stderr from metrics or sanitize it.

  • assert in production code: pi_rollout_worker.py lines 458-459 use assert to guard NCCLTrainerSendWeightsArgs and NCCLWeightTransferEngine. These are silently no-ops with python -O. Replace with explicit if ... is None: raise RuntimeError(...).

  • Internal vLLM state mutation: pi_rollout_worker.py lines 586-588 directly set group.group.store = None and group.group.socket = None on undocumented vLLM NCCL group internals. Will break silently on vLLM version bumps.

  • Short join timeout: pi_rollout_worker.py line 726 pi_thread.join(timeout=1.0) in the finally block may leave the Pi subprocess thread running. Should log a warning if thread is still alive after join.

  • Rollout failures swallowed: pi_rollout_worker.py lines 606-608 catch all exceptions and continue. _failed event is never set from _loop, so check_health() won't raise until the heartbeat goes stale. Consider setting _failed after N consecutive failures.

  • Late import: src/openenv/core/harness/__init__.py line 703 imports PiCLIHarnessAdapter at the bottom of a 700-line module, making import errors hard to trace.

Tier 2: Alignment

ALIGNMENT FLAG: No authentication on MCP-to-Pi bridge server

  • Invariant at risk: Agent isolation (INVARIANTS.md). pi_cli.py's BridgeHandler.do_POST has no auth check. On a shared Slurm node, another job could enumerate the bridge port and call tools/call to invoke environment tools. The bridge should require a per-run secret.
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: unused_reward stub bypasses TRL reward_funcs mechanism

  • Invariant at risk: "Rewards inside environment" (RFC 002). All reward signal comes from env_reward via the rollout worker, while a zero-returning unused_reward is passed to AsyncGRPOTrainer. This inverts TRL's assumptions and needs clear documentation.
  • Suggested reviewer: @Darktex

Summary

Critical training bug: logprobs=0 zeros all log-probs, corrupting GRPO's policy ratio (should be logprobs=1). One credential safety issue (stderr in metrics). Several hardening gaps appropriate for WIP but blocking for merge. The Pi bridge lacks authentication, which matters on shared compute nodes.


Automated review by Claude Code | Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants