[EXAMPLE] [WIP] Add Terminus Pi TRL training by burtenshaw · Pull Request #675 · huggingface/OpenEnv

burtenshaw · 2026-05-12T09:03:30Z

This PR contains a minimal Terminus + Pi + TRL async GRPO example under examples/end_to_end/tbench2_pi_trl.

It depends on #729 for the Terminus env harness and sandbox backend changes used by the training script.

Runs

Qwen/Qwen3.5-4B

Slurm job: 22182836, COMPLETED, exit code 0:0, elapsed 00:14:47
Trainer runtime: 590.107s
Train loss: 0.038534664
Reward: rewards/terminus_reward reached 1.0 at steps 10, 11, and 12
Hub model: https://huggingface.co/burtenshaw/terminus-pi-trl-async-grpo-qwen35-4b
Hub commit: a79cf0c20afc7b937df8d015bef2093a3ac21a25
Trackio Space: https://huggingface.co/spaces/burtenshaw/terminus-pi-trl-static-becc55

Qwen/Qwen3-0.6B training sanity run

Slurm job: 22179048, COMPLETED, exit code 0:0, elapsed 00:06:28
Reward: rewards/terminus_reward=1.0 through 12 steps
Hub model: https://huggingface.co/burtenshaw/terminus-pi-trl-async-grpo
Trackio Space: https://huggingface.co/spaces/burtenshaw/terminus-pi-trl-static-a22f63

Qwen/Qwen3-4B PR head

Slurm job: 22183082, COMPLETED, exit code 0:0, elapsed 00:12:15
Allocation: 4 H100 nodes on hopper-prod; sandbox, vLLM, trainer, plus one reserve node in the simple cluster wrapper
Trainer runtime: 402.4s for 200/200 optimizer steps
Train loss: -0.03814
Reward metrics: 200 logged rows, mean reward 0.93, min 0.5, max 1.0, final reward 1.0
Trackio run: terminus-qwen3-4b-22183082-terminus
Hub model: https://huggingface.co/burtenshaw/terminus-pi-trl-async-grpo-qwen3-4b
Hub model commit: f9351f6ce92de5c6559ef46de03a0750f2fd29df
Trackio Space: https://huggingface.co/spaces/burtenshaw/terminus-pi-trl-static-12edaf
Run fix made during validation: changed the trainer config to save_strategy="no" and kept the existing final save_model() / push_to_hub() path, avoiding a 16 GB checkpoint write and push every optimizer step.

burtenshaw · 2026-05-29T19:12:07Z

Cluster training run: Terminus async GRPO, Qwen/Qwen3.5-4B

Completed a 3-node Slurm run from worktree /fsx/benjamin_burtenshaw/OpenEnv-codex-terminus-pi-trl-space-run2.

Slurm job: 22183037, state COMPLETED, exit 0:0, elapsed 00:49:14
Nodes: sandbox ip-26-0-170-160, vLLM ip-26-0-171-56, trainer ip-26-0-171-62
Run commit used: 3ae1cb23 (examples: configure terminus grpo training run)
Note: the PR branch was updated afterward to 56fdccbf; I did not overwrite that newer branch state.
Trackio Space: burtenshaw/terminus-pi-trl-trackio
Task dataset: burtenshaw/terminus-pi-trl-tasks, updated to max_steps=[200]

Model repos pushed:

burtenshaw/terminus-pi-trl-qwen35-4b-200-alpha
burtenshaw/terminus-pi-trl-qwen35-4b-200-beta

Reward validation from trainer logs:

run	steps	first 20 mean reward	last 20 mean reward	final reward	train runtime	train loss
alpha	200	0.25	0.85	1.0	934.6s	-0.02174
beta	200	0.975	1.0	1.0	799.0s	-0.00917

https://burtenshaw-terminus-pi-trl-trackio.hf.space/?project=terminus-pi-trl&run_ids=5f7015b368544ece97c09de6ea4c47b1%2Caddd05850af0465cb040df647e600cd7&sidebar=hidden&navbar=hidden

burtenshaw · 2026-05-30T19:59:05Z

Four-node Qwen3-4B async GRPO run

Pushed the latest example fixes to codex/terminus-pi-trl-space:

0e322847 normalizes Terminus-style text terminal calls from PI/model output into OpenAI tool-call JSON.
7e662a83 passes dataset chat prompts to Terminus as plain instructions instead of stringified dataset rows.
70a6bb27 normalizes PI OpenAI content parts to strings before Qwen chat template rendering. This was required because Qwen rendered content: [{"type":"text", ...}] as an empty user message.

Completed a four-node SLURM run:

Job: 22189657
Nodes: sandbox ip-26-0-167-177, vLLM ip-26-0-167-217, trainer ip-26-0-167-245, reserve ip-26-0-168-30
Run dir: examples/end_to_end/tbench2_pi_trl/logs/20260530T193520Z-qwen3-4b-rollout-qwen-content
Timing: sandbox ready 2026-05-30T19:28:28Z, vLLM ready 2026-05-30T19:30:55Z, trainer done/job done 2026-05-30T19:56:16Z
Trackio: https://huggingface.co/spaces/burtenshaw/terminus-pi-trl-trackio (qwen3-4b-rollout-22189657)
Model: https://huggingface.co/burtenshaw/terminus-pi-trl-qwen3-4b-rollout-22189657

Observed metrics:

Final summary: train_runtime=1272s, train_steps_per_second=0.157, train_loss=-0.9997, epoch=1
Reward and verification stayed healthy after the content-normalization fix: reward=1, verify/done=1, verify/submitted_answer=1, verify/commands=3, turns=4
Model repo verified on Hub at commit f7de6ea5bd058b44b91bf69de2ae4a20171520fe.

Validation:

uv run --with ruff ruff check examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py
uv run python -m py_compile examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py

burtenshaw · 2026-06-01T09:48:50Z

Rollout worker simplification

Pushed 4ae641b0 (Simplify Terminus PI rollout worker) to keep the proof-of-concept example self-contained but smaller.

Changes are intentionally scoped to examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py:

Removed unused _failed / _exception health state.
Removed streaming/SSE support from the local PI interception server; the example now only supports the JSON response path it uses.
Removed _stream_chunks.
Removed _assistant_message_from_text and folded that fallback into _parse_assistant_message.
Removed _current_model_version and read the value inline when building RolloutSample.

Kept the proof-of-concept pieces in the example for now: the interception server, vLLM weight sync, Qwen/PI message normalization, dataset task coercion, and the Terminus text-to-tool-call shim.

Validation:

uv run --with ruff ruff check examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py
uv run python -m py_compile examples/end_to_end/tbench2_pi_trl/pi_rollout_worker.py examples/end_to_end/tbench2_pi_trl/train_terminus_grpo.py

Darktex

Note: This is an automated review by Claude Code, not a human review.

Tier 1: Bugs & Issues

Bug (Critical): logprobs=0 silently zeroes all log-probs. pi_rollout_worker.py line 763 passes "logprobs": 0 to vLLM /v1/completions, which returns zero log-prob tokens. The fallback on lines 773-774 then fills all log-probs with 0.0. GRPO computes exp(log_pi_new - log_pi_old) — with log_pi_old = 0.0 this collapses to exp(log_pi_new), corrupting the policy ratio. Should be "logprobs": 1.
Credential safety: pi_cli.py includes "stderr": stderr in the HarnessRolloutResult metrics dict. If the pi subprocess emits API keys or tokens in stderr (error paths, debug output), those propagate through the metrics reporting pipeline, violating the "No credential exposure" invariant. Drop stderr from metrics or sanitize it.
assert in production code: pi_rollout_worker.py lines 458-459 use assert to guard NCCLTrainerSendWeightsArgs and NCCLWeightTransferEngine. These are silently no-ops with python -O. Replace with explicit if ... is None: raise RuntimeError(...).
Internal vLLM state mutation: pi_rollout_worker.py lines 586-588 directly set group.group.store = None and group.group.socket = None on undocumented vLLM NCCL group internals. Will break silently on vLLM version bumps.
Short join timeout: pi_rollout_worker.py line 726 pi_thread.join(timeout=1.0) in the finally block may leave the Pi subprocess thread running. Should log a warning if thread is still alive after join.
Rollout failures swallowed: pi_rollout_worker.py lines 606-608 catch all exceptions and continue. _failed event is never set from _loop, so check_health() won't raise until the heartbeat goes stale. Consider setting _failed after N consecutive failures.
Late import: src/openenv/core/harness/__init__.py line 703 imports PiCLIHarnessAdapter at the bottom of a 700-line module, making import errors hard to trace.

Tier 2: Alignment

ALIGNMENT FLAG: No authentication on MCP-to-Pi bridge server

Invariant at risk: Agent isolation (INVARIANTS.md). pi_cli.py's BridgeHandler.do_POST has no auth check. On a shared Slurm node, another job could enumerate the bridge port and call tools/call to invoke environment tools. The bridge should require a per-run secret.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: unused_reward stub bypasses TRL reward_funcs mechanism

Invariant at risk: "Rewards inside environment" (RFC 002). All reward signal comes from env_reward via the rollout worker, while a zero-returning unused_reward is passed to AsyncGRPOTrainer. This inverts TRL's assumptions and needs clear documentation.
Suggested reviewer: @Darktex

Summary

Critical training bug: logprobs=0 zeros all log-probs, corrupting GRPO's policy ratio (should be logprobs=1). One credential safety issue (stderr in metrics). Several hardening gaps appropriate for WIP but blocking for merge. The Pi bridge lacks authentication, which matters on shared compute nodes.

Automated review by Claude Code | Learn more

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026

sergiopaniego mentioned this pull request May 19, 2026

feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment #695

Open

9 tasks

This comment was marked as outdated.

Sign in to view

burtenshaw force-pushed the codex/terminus-pi-trl-space branch 6 times, most recently from dcbf977 to af16a1c Compare May 29, 2026 18:43

examples: add terminus async grpo training

cb65ba7

burtenshaw force-pushed the codex/terminus-pi-trl-space branch from af16a1c to cb65ba7 Compare May 29, 2026 18:49

Use Qwen3 4B in Terminus async GRPO example

56fdccb

Avoid periodic checkpoint saves in Terminus example

9a433be

burtenshaw changed the title ~~Add Terminus Pi TRL training~~ [EXAMPLE] Add Terminus Pi TRL training May 29, 2026

burtenshaw marked this pull request as ready for review May 29, 2026 19:58

This comment was marked as duplicate.

Sign in to view

burtenshaw added 5 commits May 30, 2026 09:00

Move Pi context into OpenEnv core

6836803

Expose Pi context from core harness

804a919

Add Pi CLI harness adapter

90763f5

Simplify Pi CLI harness integration

1a8dfb8

Point Pi harness at training vLLM model

6db1d80

burtenshaw marked this pull request as draft May 30, 2026 10:08

burtenshaw changed the title ~~[EXAMPLE] Add Terminus Pi TRL training~~ [EXAMPLE] [WIP] Add Terminus Pi TRL training May 30, 2026

Add PI interception rollout worker

b4abfc7

This comment was marked as outdated.

Sign in to view

Allow run-specific Terminus tracking ids

140f272

This comment was marked as duplicate.

Sign in to view

Darktex suggested changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXAMPLE] [WIP] Add Terminus Pi TRL training#675

[EXAMPLE] [WIP] Add Terminus Pi TRL training#675
burtenshaw wants to merge 10 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-pi-trl-space

burtenshaw commented May 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

burtenshaw commented May 29, 2026 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as duplicate.

Uh oh!

burtenshaw commented May 30, 2026

Uh oh!

burtenshaw commented Jun 1, 2026

Uh oh!

Darktex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

burtenshaw commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runs

Qwen/Qwen3.5-4B

Qwen/Qwen3-0.6B training sanity run

Qwen/Qwen3-4B PR head

Uh oh!

This comment was marked as outdated.

Uh oh!

burtenshaw commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cluster training run: Terminus async GRPO, Qwen/Qwen3.5-4B

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as duplicate.

Uh oh!

burtenshaw commented May 30, 2026

Four-node Qwen3-4B async GRPO run

Uh oh!

burtenshaw commented Jun 1, 2026

Rollout worker simplification

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Tier 1: Bugs & Issues

Tier 2: Alignment

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

burtenshaw commented May 12, 2026 •

edited

Loading

burtenshaw commented May 29, 2026 •

edited

Loading