feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment#695
feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment#695rycerzes wants to merge 67 commits into
Conversation
- related tests
- fix agent handling
- tests
…roxy the transparent proxy was a passive forwarder that captured logprobs by injecting logprobs=true into upstream requests. It is replaced by the interception_gate mode where the trainer owns the forward pass entirely — no proxy needed inside the sandbox.
…eneration
InterceptionServer (aiohttp) runs on the trainer host. Each rollout
registers a queue. The agent's OPENAI_BASE_URL points at
`{base_url}/rollout/{id}/v1`. When the agent makes an LLM call it
blocks at the server. The training loop dequeues the request, calls
vLLM with logprobs=True and return_token_ids=True, and delivers the
response back via deliver_response().
…andling - soak test
Greptile SummaryThis PR introduces a new
Confidence Score: 3/5The deploy script has an exploitable shell injection path, and the rollout worker has two defects in the weight-sync path that can silently pass None into the NCCL engine; three alignment questions also need human sign-off. Shell injection in deploy_hf_space.sh is a present, exploitable code path via CLI argument interpolation with no escaping. The TOCTOU race in send_weights and bare assert removal under -O are both present defects in the critical weight-sync path. Three alignment questions (host-side grading, missing generics, core changes without RFC) each require explicit human decisions before these patterns become load-bearing in downstream training code. deploy_hf_space.sh (shell injection), rollout_worker.py (weight-sync race and assert removal), swe_environment.py and client.py (missing generics), harness.py (host-side grading alignment)
|
| Filename | Overview |
|---|---|
| envs/mini_swe_env/async_grpo/rollout_worker.py | New 736-line rollout worker; TOCTOU race on _model_update_group and bare asserts stripped by -O in the weight-sync path |
| envs/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh | Deploy script with shell injection via $SPACE_ID and $HARDWARE interpolated into python3 -c strings |
| envs/mini_swe_env/server/swe_environment.py | SWEEnvironment extends MCPEnvironment without generic type parameters, violating the Environment[ActT, ObsT, StateT] invariant |
| envs/mini_swe_env/harness.py | Host-side grading via answer() conflicts with the Rewards-in-environment invariant; references missing CORE_CHANGES.md |
| src/openenv/core/harness/agents/interception_server.py | New aiohttp interception server; hmac auth correct; significant core addition without RFC |
| src/openenv/core/harness/agents/cli_driver.py | New CLI agent driver for black_box/interception_gate modes; well-structured; core addition without RFC |
| envs/mini_swe_env/grading.py | Clean binary grading module; correctly implements FAIL_TO_PASS/PASS_TO_PASS semantics |
| envs/mini_swe_env/models.py | Well-designed frozen dataclasses and Pydantic wire types; SWEState correctly extends State |
| envs/mini_swe_env/async_grpo/control_plane.py | Clean control plane with rollout registration and context-manager support; no issues found |
| envs/mini_swe_env/client.py | Typed MCP client; no server imports; missing EnvClient generic parameterization |
Sequence Diagram
sequenceDiagram
participant SB as Pi Agent Sandbox
participant IS as InterceptionServer host:7860
participant RW as SWERolloutWorker
participant vLLM as vLLM /v1/completions
participant TR as AsyncGRPOTrainer
SB->>IS: POST /rollout/id/v1/chat/completions
IS-->>RW: enqueue request_id
RW->>IS: get_intercept(request_id)
RW->>RW: apply_chat_template to prompt_ids
RW->>vLLM: POST /v1/completions token IDs
vLLM-->>RW: completion_ids and logprobs
RW->>IS: deliver_response intercept chat_resp
IS-->>SB: streaming SSE response
SB->>IS: POST /rollout/id/v1/tools/answer
IS->>RW: answer_handler grade_from_case_results
RW-->>IS: reward result
IS-->>SB: tool result
RW->>RW: assemble RolloutSample
RW->>TR: rollout_buffer.put sample
TR->>RW: send_weights via NCCL
RW->>vLLM: POST /update_weights
Prompt To Fix All With AI
Fix the following 7 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 7
envs/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh:83-90
**Shell injection via CLI argument interpolation**
`$SPACE_ID` and `$HARDWARE` originate from `--space-id` / `--hardware` command-line arguments (attacker-controlled), and they are interpolated directly into a Python string passed to `python3 -c`. A crafted `--space-id` value can execute arbitrary Python. The heredoc section at line 125 correctly uses `os.environ` — the inline `-c` blocks (lines 83–90, 94–100, 181–187) should follow the same safe pattern.
### Issue 2 of 7
envs/mini_swe_env/async_grpo/rollout_worker.py:239-243
**Bare `assert` stripped by `-O` flag**
`assert NCCLWeightTransferEngine is not None` and `assert NCCLTrainerSendWeightsArgs is not None` are silently removed when Python is run with the `-O` flag. The code then calls `None.trainer_send_weights(...)`, raising a confusing `AttributeError`. Replace with explicit `if ... raise RuntimeError(...)` guards.
### Issue 3 of 7
envs/mini_swe_env/async_grpo/rollout_worker.py:202-225
**TOCTOU race between `send_weights` and `stop`**
The null-check on `_model_update_group` happens outside `_weight_sync_lock`, while `_destroy_model_update_group()` clears the field without acquiring that lock. A concurrent `stop()` can race to pass `None` into the NCCL engine. Move the null-check inside `_weight_sync_lock`, or have `_destroy_model_update_group` acquire it before clearing the field.
### Issue 4 of 7
envs/mini_swe_env/harness.py:41-44
**ALIGNMENT FLAG: Reward computation outside environment boundary**
- **Principle at stake**: "Rewards in environment" (INVARIANTS.md §Architectural Invariants 3)
- **The concern**: In `interception_gate` mode, `answer()` runs `grade_from_case_results` on the trainer host, not inside the containerized `SWEEnvironment`. The PR notes this is intentional, but it directly conflicts with the invariant.
- **Suggested reviewer**: `@darktex`
### Issue 5 of 7
envs/mini_swe_env/server/swe_environment.py:97
**ALIGNMENT FLAG: Missing `Environment[ActT, ObsT, StateT]` generics**
- **Principle at stake**: INVARIANTS.md §API Invariants 2
- **The concern**: `SWEEnvironment(MCPEnvironment)` carries no generic parameters; `MiniSWEEnv(MCPToolClient)` in `client.py` similarly omits `EnvClient[ActT, ObsT, StateT]`. This violates the type-safety invariant.
- **Suggested reviewer**: `@darktex`
### Issue 6 of 7
src/openenv/core/harness/agents/interception_server.py:73-75
**ALIGNMENT FLAG: Core changes without RFC**
- **Principle at stake**: INVARIANTS.md §Architectural Invariants
- **The concern**: `InterceptionServer`, `CLIAgentDriver`, two new sandbox backends, and a new `agents/` subpackage are added to `src/openenv/core/` without a covering RFC. The interception-gate pattern is a new architectural layer not discussed in RFC 001 or RFC 002.
- **Suggested reviewer**: `@darktex`
### Issue 7 of 7
envs/mini_swe_env/harness.py:31-44
**Missing `CORE_CHANGES.md` referenced in docstring**
The module docstring says "Requires core changes — see `CORE_CHANGES.md`" but this file is not included in the PR. Update the docstring to point to actual documentation.
Reviews (1): Last reviewed commit: "feat: deploy script for SWE Async GRPO t..." | Re-trigger Greptile
| echo "⏸ Pausing Space $SPACE_ID..." | ||
| python3 -c " | ||
| from huggingface_hub import HfApi | ||
| api = HfApi(token='$HF_TOKEN') | ||
| api.pause_space('$SPACE_ID') | ||
| print('Space paused.') | ||
| " | ||
| exit 0 |
There was a problem hiding this comment.
Shell injection via CLI argument interpolation
$SPACE_ID and $HARDWARE originate from --space-id / --hardware command-line arguments (attacker-controlled), and they are interpolated directly into a Python string passed to python3 -c. A crafted --space-id value can execute arbitrary Python. The heredoc section at line 125 correctly uses os.environ — the inline -c blocks (lines 83–90, 94–100, 181–187) should follow the same safe pattern.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh
Line: 83-90
Comment:
**Shell injection via CLI argument interpolation**
`$SPACE_ID` and `$HARDWARE` originate from `--space-id` / `--hardware` command-line arguments (attacker-controlled), and they are interpolated directly into a Python string passed to `python3 -c`. A crafted `--space-id` value can execute arbitrary Python. The heredoc section at line 125 correctly uses `os.environ` — the inline `-c` blocks (lines 83–90, 94–100, 181–187) should follow the same safe pattern.
How can I resolve this? If you propose a fix, please make it concise.| packed=True, | ||
| ), | ||
| ) | ||
|
|
||
| t_update.join(timeout=1800) |
There was a problem hiding this comment.
Bare
assert stripped by -O flag
assert NCCLWeightTransferEngine is not None and assert NCCLTrainerSendWeightsArgs is not None are silently removed when Python is run with the -O flag. The code then calls None.trainer_send_weights(...), raising a confusing AttributeError. Replace with explicit if ... raise RuntimeError(...) guards.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/async_grpo/rollout_worker.py
Line: 239-243
Comment:
**Bare `assert` stripped by `-O` flag**
`assert NCCLWeightTransferEngine is not None` and `assert NCCLTrainerSendWeightsArgs is not None` are silently removed when Python is run with the `-O` flag. The code then calls `None.trainer_send_weights(...)`, raising a confusing `AttributeError`. Replace with explicit `if ... raise RuntimeError(...)` guards.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| names = [name for name, _ in items] | ||
| dtype_names = [ | ||
| str(getattr(tensor, "dtype", "float32")).split(".")[-1] | ||
| for _, tensor in items | ||
| ] | ||
| shapes = [list(getattr(tensor, "shape", [])) for _, tensor in items] | ||
| update_info = { | ||
| "names": names, | ||
| "dtype_names": dtype_names, | ||
| "shapes": shapes, | ||
| "packed": True, | ||
| "is_checkpoint_format": True, | ||
| } | ||
|
|
||
| with self._weight_sync_lock: | ||
| post_error: list[Exception] = [] | ||
|
|
||
| def _post_update() -> None: | ||
| try: | ||
| self._post_json( | ||
| "/update_weights", | ||
| timeout=1800, | ||
| json_body={"update_info": update_info}, |
There was a problem hiding this comment.
TOCTOU race between
send_weights and stop
The null-check on _model_update_group happens outside _weight_sync_lock, while _destroy_model_update_group() clears the field without acquiring that lock. A concurrent stop() can race to pass None into the NCCL engine. Move the null-check inside _weight_sync_lock, or have _destroy_model_update_group acquire it before clearing the field.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/async_grpo/rollout_worker.py
Line: 202-225
Comment:
**TOCTOU race between `send_weights` and `stop`**
The null-check on `_model_update_group` happens outside `_weight_sync_lock`, while `_destroy_model_update_group()` clears the field without acquiring that lock. A concurrent `stop()` can race to pass `None` into the NCCL engine. Move the null-check inside `_weight_sync_lock`, or have `_destroy_model_update_group` acquire it before clearing the field.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| **Requires core changes** — see ``CORE_CHANGES.md`` for the | ||
| InterceptionServer tool routing, models.json, workdir, and Docker | ||
| host-IP changes needed in ``openenv.core``. |
There was a problem hiding this comment.
ALIGNMENT FLAG: Reward computation outside environment boundary
- Principle at stake: "Rewards in environment" (INVARIANTS.md §Architectural Invariants 3)
- The concern: In
interception_gatemode,answer()runsgrade_from_case_resultson the trainer host, not inside the containerizedSWEEnvironment. The PR notes this is intentional, but it directly conflicts with the invariant. - Suggested reviewer:
@darktex
Context Used: .claude/docs/INVARIANTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/harness.py
Line: 41-44
Comment:
**ALIGNMENT FLAG: Reward computation outside environment boundary**
- **Principle at stake**: "Rewards in environment" (INVARIANTS.md §Architectural Invariants 3)
- **The concern**: In `interception_gate` mode, `answer()` runs `grade_from_case_results` on the trainer host, not inside the containerized `SWEEnvironment`. The PR notes this is intentional, but it directly conflicts with the invariant.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.| } | ||
|
|
||
|
|
||
| class SWEEnvironment(MCPEnvironment): |
There was a problem hiding this comment.
ALIGNMENT FLAG: Missing
Environment[ActT, ObsT, StateT] generics
- Principle at stake: INVARIANTS.md §API Invariants 2
- The concern:
SWEEnvironment(MCPEnvironment)carries no generic parameters;MiniSWEEnv(MCPToolClient)inclient.pysimilarly omitsEnvClient[ActT, ObsT, StateT]. This violates the type-safety invariant. - Suggested reviewer:
@darktex
Context Used: .claude/docs/INVARIANTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/server/swe_environment.py
Line: 97
Comment:
**ALIGNMENT FLAG: Missing `Environment[ActT, ObsT, StateT]` generics**
- **Principle at stake**: INVARIANTS.md §API Invariants 2
- **The concern**: `SWEEnvironment(MCPEnvironment)` carries no generic parameters; `MiniSWEEnv(MCPToolClient)` in `client.py` similarly omits `EnvClient[ActT, ObsT, StateT]`. This violates the type-safety invariant.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.| class InterceptionServer: | ||
| """Async HTTP server that gates every LLM call from sandboxed agents. | ||
|
|
There was a problem hiding this comment.
ALIGNMENT FLAG: Core changes without RFC
- Principle at stake: INVARIANTS.md §Architectural Invariants
- The concern:
InterceptionServer,CLIAgentDriver, two new sandbox backends, and a newagents/subpackage are added tosrc/openenv/core/without a covering RFC. The interception-gate pattern is a new architectural layer not discussed in RFC 001 or RFC 002. - Suggested reviewer:
@darktex
Context Used: .claude/docs/INVARIANTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/harness/agents/interception_server.py
Line: 73-75
Comment:
**ALIGNMENT FLAG: Core changes without RFC**
- **Principle at stake**: INVARIANTS.md §Architectural Invariants
- **The concern**: `InterceptionServer`, `CLIAgentDriver`, two new sandbox backends, and a new `agents/` subpackage are added to `src/openenv/core/` without a covering RFC. The interception-gate pattern is a new architectural layer not discussed in RFC 001 or RFC 002.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| routed through the InterceptionServer's tool routing layer (``/vf/tools``). | ||
| When the agent calls ``answer()``, the request goes to the host, which | ||
| runs SWE-Gym-native grading (revert test files → apply test_patch → run | ||
| explicit FAIL_TO_PASS/PASS_TO_PASS tests), and returns the result to the | ||
| agent. | ||
| This is the same result ``verify()`` returns — one grading path, no | ||
| in-sandbox grading infrastructure. | ||
|
|
||
| This matches SWE-Gym's architecture where ``answer`` is server-side code | ||
| on the OpenReward platform. | ||
|
|
||
| **Requires core changes** — see ``CORE_CHANGES.md`` for the | ||
| InterceptionServer tool routing, models.json, workdir, and Docker | ||
| host-IP changes needed in ``openenv.core``. |
There was a problem hiding this comment.
Missing
CORE_CHANGES.md referenced in docstring
The module docstring says "Requires core changes — see CORE_CHANGES.md" but this file is not included in the PR. Update the docstring to point to actual documentation.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/mini_swe_env/harness.py
Line: 31-44
Comment:
**Missing `CORE_CHANGES.md` referenced in docstring**
The module docstring says "Requires core changes — see `CORE_CHANGES.md`" but this file is not included in the PR. Update the docstring to point to actual documentation.
How can I resolve this? If you propose a fix, please make it concise.
rycerzes
left a comment
There was a problem hiding this comment.
pi now supports only v22.x and above, and make deployment scripts usable with existing .venv dir and no dependency on rsync
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review
This is a substantial, well-structured PR. The overall interception-gate + host-side grading architecture is sound. The following items need attention before merge.
Tier 1: Fix Before Merge
[F1] Ruff format failures — 4 files fail ruff format --check:
envs/mini_swe_env/grading.pyenvs/mini_swe_env/harness.pyenvs/mini_swe_env/server/swe_environment.pyexamples/mini_swe_env/run_swe_sample.py
Run ruff format on these files.
[F2] Silent suffix-ID corruption in rollout_worker.py (~line 443)
suffix_ids = current_prompt_ids[prev_len:] silently returns an empty list if the tokenizer truncates context. This produces incorrect completion masks and corrupts GRPO advantages without any log warning. Add a _log.warning when len(current_prompt_ids) < prev_len.
[F3] KeyError-prone logprob extraction in rollout_worker.py (~line 600)
choice["logprobs"]["token_logprobs"] will raise KeyError on vLLM versions where the logprobs key structure differs. Add a null-check with a descriptive error message.
[F4] assert used for runtime control flow in harness.py (~lines 3406–3408)
assert self._interception_server is not None and assert self._interception_base_url is not None in SWESessionFactory.create() are stripped under python -O. Replace with explicit if ... raise RuntimeError(...).
[F5] Non-strict revert failures are silently swallowed in swe_environment.py (~line 599)
The _revert_test_files(..., strict=False) path in swe_environment.py produces no log on failure. The harness.py copy correctly calls _log.warning(msg). Align them.
Tier 2: Alignment Discussion Points
🔴 ALIGNMENT FLAG: In-sandbox reward computation (sandbox_mcp_server.py final_answer path)
- Principle at stake: "Rewards inside environment" (RFC 002) + agent isolation invariant
- The concern:
sandbox_mcp_server.pyruns verify commands and writesreward.txtinside the sandbox when the agent callsterminal(final_answer=...). The agent process can read or write the same reward file path (/home/user/logs/verifier/reward.txt). Meanwhile the host-side answer tool handler inharness.pycomputes reward out-of-band and stores it insession._answer_reward. It is unclear which reward source is authoritative when both paths run — the two paths are not documented as mutually exclusive. The host-side path is safer; the sandbox-side path should either be removed or reduced to a no-op acknowledgment with grading deferred to the host. - Suggested reviewer: @Darktex
🔴 ALIGNMENT FLAG: sandbox_backend and sandbox_image exposed as MCP tool parameters
- Principle at stake: Dual API boundary (INVARIANTS.md) — simulation controls belong on the orchestration side, not in agent-callable MCP tools.
- The concern:
run_swe_rolloutinswe_environment.pyacceptssandbox_backend: strandsandbox_image: stras direct MCP tool call parameters. These are infrastructure/simulation-control parameters. They should be server-side configuration (environment variables), not parameters in the agent-facing MCP API. - Suggested reviewer: @Darktex
🟡 ALIGNMENT FLAG: Use of private _bootstrap_sandbox / _start_agent in swe_environment.py
- Principle at stake: "One canonical way to build environments" (PRINCIPLES.md)
- The concern:
swe_environment.pydirectly callsdriver._bootstrap_sandbox(...)anddriver._start_agent(...)(private APIs) and manually constructsCLIAgentSession. This bypassesCLIAgentSessionFactory.create(). Theharness.pySWESessionFactorycorrectly uses the factory abstraction. The environment should use the same public path. - Suggested reviewer: @Darktex
🟡 ALIGNMENT FLAG: CodingAgentSessionFactory.create() starts the agent process as a side effect
- Principle at stake: Gymnasium API clarity + "Minimize lifecycle deltas"
- The concern:
create()immediately callssession.start_agent()before returning, meaning the agent process starts before the caller can set up error handling. This is not the expected Gymnasium idiom wherereset()is the side-effecting call. The pattern is consistent across the PR but should be discussed as an intentional design choice. - Suggested reviewer: @Darktex
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: PARTIAL — Formatting could not be auto-verified in CI-less local env. Manual inspection found no obvious formatting violations in the new files, though
envs/mini_swe_env/harness.pyimports are not in usort canonical order. - Debug code: CLEAN — No
breakpoint(),pdb, orTODO/FIXMEmarkers found in new files.
Tier 1: Fixes Required
-
envs/mini_swe_env/— Missingopenenv.yamlmanifest. Every OpenEnv environment is required to have anopenenv.yaml(per PATTERNS.md). This file is absent entirely frommini_swe_env/. Without it,openenv validateandopenenv buildcommands will fail. -
envs/mini_swe_env/server/swe_environment.py:313-315—wall_snot set on early-return setup failure. When a setup command fails, the code returns from inside thetryblock.result.wall_sis set after thefinallyblock and is not reached on this early-return path. The returnedSWERolloutResultwill havewall_s=0.0even for tasks that ran for several minutes of setup before failing. Fix: setresult.wall_s = round(time.time() - t0, 3)before the early return. -
envs/mini_swe_env/server/swe_environment.py:708-725—_build_agent_configdefines a local dataclass on every call. The pattern of defining a@dataclassclass inside a method body and using outer-scope variables as class-level defaults is fragile.harness.pydefines the equivalent_SWEAgentTaskat module scope — the environment should follow the same pattern. -
examples/mini_swe_env/async_grpo/rollout_worker.py:600-603—choice["token_ids"]andchoice["logprobs"]["token_logprobs"]accessed without key guards. If vLLM returns a response withoutreturn_token_ids=Truesupport, the code will raise aKeyErrorinside_generate(), crashing the entire rollout thread. -
envs/mini_swe_env/server/swe_environment.py:690-695— MCP server health-check timeout message is misleading. Error says "5s" but the actual polling window is10 * 0.5ssleep +10 * 5scurl timeout = ~55s. -
envs/mini_swe_env/server/swe_environment.py:479-492—FAIL_TO_PASSnot type-checked. If present but not alist,list(metadata["FAIL_TO_PASS"])will silently convert a string to a list of characters. -
tests/envs/test_swe_async_rollout_worker.py:1-14—sys.pathmanipulation in test file. Manually inserting repo root intosys.pathbypasses normal package discovery and breaks if run from a different working directory. Useconftest.pywithpythonpathconfig instead. Also, test coverage only covers 3 pure-function helpers with no coverage ofSWERolloutWorkeritself.
Tier 2: Alignment Discussion
ALIGNMENT FLAG: In-sandbox grading returns reward directly to agent
- Principle: "Rewards inside environment" (RFC 002) / "Agents cannot reset" (RFC 001)
- Concern:
sandbox_mcp_server.pycomputes a reward when the agent callsfinal_answerand returns it in the tool response ("reward": reward). The agent observes its own grading signal during the episode. This departs from the OpenEnv model where reward flows only from server to training orchestrator. Additionally, there are now two grading code paths (in-sandbox and host-side) that must stay in sync. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: swe_environment.py calls private _ methods on CLIAgentDriver
- Principle: Dual API boundary (RFC 001/004), Client-server separation (INVARIANTS.md)
- Concern: Lines 331-338 call
driver._bootstrap_sandbox(...)anddriver._start_agent(...)— private methods on a core class.harness.pyin this same PR uses the publicSWESessionFactoryabstraction correctly. Why doesswe_environment.pybypass the factory? - Suggested reviewer: @Darktex
ALIGNMENT FLAG: No server/Dockerfile for the environment server
- Principle: Container isolation (INVARIANTS.md §2)
- Concern: The env relies on Docker images for sandboxing but has no specification for how the environment server itself should be containerized. PATTERNS.md shows
server/Dockerfileas a required component. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: Async GRPO rollout worker in examples/ but imported by tests/
- Principle: Minimize lifecycle deltas (PRINCIPLES.md)
- Concern:
examples/mini_swe_env/async_grpo/rollout_worker.py(~736 lines) is a substantial module imported by tests. If production code, it should be a proper package; if example code, it shouldn't be under test coverage. Ambiguous maintenance contract. - Suggested reviewer: @Darktex
Summary
- 7 mechanical issues (missing required manifest, real bug with
wall_s, fragile dataclass pattern, 2 defensive coding gaps, test infrastructure issue, misleading error message) - 4 alignment flags for human review (in-sandbox reward to agent, private API calls, missing Dockerfile, example-vs-package ambiguity)
The core SWE-Gym task loading, host-side grading logic (grading.py), and model design are well-structured. The alignment flags concern the black_box path specifics, not the overall architecture.
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Automated Alignment Review — PR #695
Note: This is an automated review by Claude Code, not a human review.
This PR was reviewed in four parallel passes covering: (1) core harness infrastructure, (2) mini_swe_env, (3) async GRPO training stack, (4) coding_agent_env refactor. The review applies the two-tier model: Tier 1 = bugs/correctness, Tier 2 = alignment with OpenEnv principles and invariants.
Tier 1: Fixes Required
Core Harness Infrastructure
T1-1: Thread-safety bug in _put_queue_threadsafe — src/openenv/core/harness/agents/interception_server.py
asyncio.Queue._loop was removed in Python 3.10. On 3.10+, getattr(q, "_loop", None) returns None and the code falls through to put_nowait() directly — which is not thread-safe from another thread. Fix: store the event loop in InterceptionServer.__init__ during start() and use loop.call_soon_threadsafe(q.put_nowait, item).
T1-2: Race window in _handle_chat_completions — src/openenv/core/harness/agents/interception_server.py
Between the first active_rollouts check and the second lock acquisition, unregister_rollout can run and cancel futures. The response_future is created before the second lock, so if cleanup runs in the gap, the intercept is stored after cleanup — leaking a never-resolved future.
T1-3: DockerBgJob.wait() returns 0 for killed processes — src/openenv/core/harness/sandbox/docker_backend.py
When kill() is called, _exit_code remains None, so wait() returns 0 — masking killed jobs as successes. HFBgJob correctly sets _exit_code = 137 on kill; DockerBgJob.kill() should do the same.
T1-4: Hardcoded spec.name == "pi" in shared driver — src/openenv/core/harness/agents/cli_driver.py:872-893
Pi-specific logic (models.json writing, PI_CODING_AGENT_DIR) in the generic CLIAgentDriver._start_agent breaks the declarative design of CLIAgentSpec. Should be moved to Pi's build_env_vars/build_command or a new pre_start hook.
mini_swe_env
T1-5: Missing _wrap_instruction in MCP server path (behavior bug) — envs/mini_swe_env/server/swe_environment.py:734
_build_agent_task passes task.instruction directly without wrapping. The harness path (harness.py:1100) calls _wrap_instruction() which injects answer/final_answer tool instructions. Without this, agents on the MCP server path won't know to call final_answer, breaking reward collection entirely for that code path.
T1-6: In-sandbox MCP server returns reward to the agent — envs/mini_swe_env/server/sandbox_mcp_server.py:202-205
terminal(final_answer="...") returns "reward": reward directly in the tool response. The agent can observe this during the episode, breaking the training abstraction. The host-side re-grading is authoritative, but the in-band reward is still visible.
T1-7: api_key passed as MCP tool parameter — envs/mini_swe_env/client.py:112, swe_environment.py:132
The LLM provider API key travels as a first-class string argument over MCP transport. Per INVARIANTS.md §Security: "Use environment variables for sensitive configuration." Should use server-side env var lookup instead.
T1-8: Potential credential leak in error path — envs/mini_swe_env/server/swe_environment.py:374-375
Broad except Exception stores f"{type(exc).__name__}: {exc}" into result.error. If the LLM client throws an AuthenticationError containing the API key, it leaks through the serialized result.
T1-9: Wrong docstring — envs/mini_swe_env/task_loader_swegym.py:178
Says __ is replaced with "_1776_" but the code uses "_s_".
Async GRPO Training
T1-10: Race condition — coroutine scheduled before loop starts — examples/mini_swe_env/train_swe_async_grpo.py:83
asyncio.run_coroutine_threadsafe(control_plane.start(), loop) is called before the thread running the loop starts (line 88). Fix: start the thread first, then schedule the coroutine.
T1-11: _generate crashes on null logprobs — examples/mini_swe_env/async_grpo/rollout_worker.py:601
choice["logprobs"]["token_logprobs"] raises KeyError when vLLM returns logprobs: null (truncated generation, error paths). Guard with .get().
T1-12: Hardcoded personal TRACKIO_SPACE_ID — examples/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh:163
"TRACKIO_SPACE_ID": "rycerzes/swe-grpo-dashboard" routes all users' metrics to the author's personal Space. Should be parameterized.
T1-13: HF_TOKEN shell-interpolated into Python -c string — examples/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh:86-88
Token interpolated directly into python3 -c "api = HfApi(token='$HF_TOKEN')". If the token contains single quotes, this causes syntax error or injection. Use the heredoc pattern used elsewhere in the same file.
T1-14: Insecure default vllm_key = "token" with no warning — examples/mini_swe_env/train_swe_async_grpo.py:235
Silently falls back to a guessable API key when VLLM_API_KEY is unset. Should emit a warning.
coding_agent_env Refactor
T1-15: Unused parameters in _build_session_factory — envs/coding_agent_env/server/coding_environment.py
disable_thinking, top_logprobs, max_tokens_cap are declared as parameters but never read. These are already baked into config via _build_agent_config. Remove or document.
T1-16: Import order will fail usort check — envs/coding_agent_env/server/gradio_ui.py
catalog_summary sorts after ENDPOINT_KINDS (uppercase first). Run usort format.
T1-17: Format violations — Multiple files need ruff format:
envs/mini_swe_env/grading.pyenvs/mini_swe_env/harness.pyenvs/mini_swe_env/server/swe_environment.pyexamples/mini_swe_env/run_swe_sample.py
Tier 2: Alignment Flags for Human Review
🔴 Critical Alignment Questions
-
New
openenv.core.harness(singular) vs existingopenenv.core.harnesses(plural)
Two parallel harness abstraction layers now exist. Is.harnessthe RFC 005 implementation, a replacement for.harnesses, or something alongside it? This needs explicit documentation. -
CLIAgentSession.verify()computes reward on the trainer host, outside the environment boundary
Theverifier: Callableruns host-side with(sandbox, task)→VerifyResult(env_reward=...). Per INVARIANTS.md/RFC 002, reward computation must stay inside the environment boundary. If this is an intentional exception for CLI-harness scenarios, it needs an RFC reference. -
InterceptionServer tool endpoint creates agent→infrastructure channel
POST /rollout/{id}/v1/tools/{tool_name}is reachable from the sandbox. WhileRESERVED_TOOL_NAMESblocksreset/step/state, this is a third implicit channel from agent-controlled code to arbitrary trainer-host Python callables. Does this violate the dual-API boundary? -
Two divergent reward paths:
swe_environment.pyvsharness.py
These implement structurally different code paths for the same SWE rollout. Divergence already exists (_wrap_instructionin one but not the other). Should one be canonical? -
CLIAgentDriverbypasses the Gymnasium API
UsesSandboxHandle.exec()/start_bg()/write_text()instead ofreset/step/state.CLIAgentSessiondoesn't implementreset(). Is this the RFC 005 implementation or a departure from it?
🟡 Moderate Concerns
-
advantage = raw_rewardwithout baseline subtraction —rollout_worker.py:528
In GRPO, advantage should bereward - baseline. IfAsyncGRPOTrainerdoesn't apply its own normalization on theadvantagefield, training will be biased. -
_noop_rewardbypasses TRL's reward pipeline —train_swe_async_grpo.py
reward_funcs=_noop_reward(always returns zeros) with comment "rewards come from rollout_worker.advantage." Non-standard TRL usage that may break across versions. -
aiohttp>=3.13.5added to rootpyproject.toml
Not imported in any file changed by this PR. Should live in the env/example-specificpyproject.toml, not the root. -
CommandResult.exit_codechanged frominttoint | None
Breaking wire-type change. Existing clients doingexit_code != 0checks will silently getNone. -
test_swe_async_rollout_worker.pybundled in rename PR
Testsmini_swe_envrollout worker but is part of thecoding_agent_envrename changes. Per PR discipline: "Each PR should cover ONE meaningful semantic unit." -
InterceptionServer on public port 7860 with auto-generated token
The interception server is the public-facing HF Space endpoint. If the token leaks, an adversary can inject arbitrary LLM responses mid-rollout, poisoning training data. -
Docker-in-Docker capability in Dockerfile
docker.ioinstalled + user added todockergroup. If DinD is enabled, agents could escape the inner sandbox. Document the intended threat model.
Verdict: REQUEST CHANGES
Blocking issues:
- T1-1 (thread-safety bug in streaming path)
- T1-5 (missing
_wrap_instructionbreaks reward collection on MCP path) - T1-7 (API key as MCP tool parameter violates security invariant)
- T1-10 (race condition in training entrypoint)
- T1-11 (crash on null logprobs)
Must-resolve alignment questions:
- #1 (harness vs harnesses namespace)
- #2 (reward outside environment boundary)
- #3 (agent→infrastructure channel via tool endpoint)
- #4 (dual reward path divergence)
The architecture is ambitious and well-tested (1300+ lines of test coverage for the core harness). The interception server design is clever. But the invariant questions are load-bearing enough to warrant human sign-off before merge.
Automated review by Claude Code | Learn more
There was a problem hiding this comment.
Hey, thanks for the contribution! Amine from the TRL's team here 🤗 .
I have focused on the mini_swe_env example as it uses TRL's RolloutWorkerProtocol. As a bit of architectural primer, the_AsyncRolloutLoop in the worker is what owns the generation loop. This decision was made on purpose so we can centralize generation, staleness management, context clamping, scoring, and GIL isolation (ongoing work)
Your interception model inverts the dependency (which is ok): the sandbox agent (Pi) owns the loop, the worker reacts to the message. That's a legitimate divergence and is worth keeping in your example; it's also a great point for improving our current TRL design.
Below is about parts that diverged from TRL without IMHO a valid reason, and can be simplified greatly
1. The weight-transfer
All send_weights/pause/resume/_init_weight_transfer NCCL setups are basically unreachable. Injecting a worker sets self.weight_transfer = None and skips every sync path so vLLM never receives updated weights! I think you can reuse WeightTransferClient in the trainer, or intentionally extend the protocol as a separate TRL change.
2. The loss in REINFORCE, not GRPO??
From what I can see, the advantage = reward = 0/1 and you have no baseline, and no group-relative normalization. It's ok if that's the target loss, but I just want to point that out.
3. Import helpers from TRL
There is subtle logic in tool-suffix tokenization/ multiturn handling/ token length clamping that is tricky. I would advise reusing the logic
Some other general divergence :
- Staleness: You stamp
model_versionbut never cancel in-flight stale rollouts, so minutes-long SWE rollouts get discarded post-hoc. - Defensive scaffolding, a lot of gettattr and try/except block.
- Duplicated
RolloutSample: You redefine the dataclass instead of importing it, so it will monitor the trainer's contract changes. Import fromasync_rollout_worker. - Hardcoded weight remapping: poking into NCCL internals is bespoke and fragile. I really think you can just use TRL weight sync
- Reinvent turn-termination heuristics. TRL ends a turn on the absence of
tool_calls. I might be missing something, is that something special for this env? If not, I'd lean back on the TRL implementation for turn termination
Summary
This PR adds a new
mini_swe_envand a complete async RL training path for SWE tasks: Pi agent in sandbox, host-side interception, vLLM generation, and Async GRPO updates. It is self-contained for review againstfeat/multi-harnessand supports both direct black-box rollouts and interception-driven async training. The goal is to make SWE-Gym-style training runnable end-to-end on local Docker and HF Space + HF Sandbox.Type of Change
Alignment Checklist
Before submitting, verify:
.claude/docs/PRINCIPLES.mdand this PR aligns with our principles.claude/docs/INVARIANTS.mdand no invariants are violated/pre-submit-pr(orbash .claude/hooks/lint.shand tests) and addressed all issuesWhat this PR does (self-contained)
1) Adds a new SWE environment package
New package:
envs/mini_swe_env/models.py,task_loader_swegym.py)grading.py) with binary resolved/not-resolved semanticsharness.py) for Pi/OpenCode sessionsserver/swe_environment.py,server/app.py)client.py) and env package metadata (pyproject.toml,uv.lock)2) Supports both rollout modes needed by SWE workflows
black_boxmode: agent calls upstream LLM directlyinterception_gatemode: trainer owns generation + logprob capture3) Adds async GRPO control plane + rollout worker
New modules in
envs/mini_swe_env/async_grpo/:control_plane.py— starts/owns interception server lifecyclerollout_worker.py— custom worker implementing TRL rollout protocolapply_chat_template/v1/completionswithreturn_token_ids=True,logprobs=0input_ids,completion_mask,old_log_probsfor trainer4) Adds runnable training/examples and HF Space deploy path
examples/mini_swe_env/train_swe_async_grpo.pyexamples/mini_swe_env/run_swe_sample.pyenvs/mini_swe_env/async_grpo/space_app/Dockerfileenvs/mini_swe_env/async_grpo/space_app/start.shenvs/mini_swe_env/async_grpo/space_app/deploy_hf_space.shenvs/mini_swe_env/async_grpo/space_app/README.md5) Core harness compatibility updates used by this env
src/openenv/core/harness/agents/cli_driver.pyupdated for interception + queue handling needed by async rollouts.Execution modes
black_boxmodeinterception_gatemodeanswerhost-side)/rollout/{id}/v1/chat/completions,/rollout/{id}/v1/tools/answer, internal vLLM/v1/completionsNetwork diagram (async training path)
flowchart LR subgraph SPACE[HF Space / Trainer Host] IS["InterceptionServer<br/>0.0.0.0:7860<br/>/rollout/:rollout_id/v1/chat/completions<br/>/rollout/:rollout_id/v1/tools/answer"] RW["SWERolloutWorker<br/>threads + queues"] TR["AsyncGRPOTrainer"] VLLM["vLLM<br/>127.0.0.1:8000<br/>/v1/completions"] IS <--> RW RW --> TR RW <--> VLLM TR -. weight sync .-> VLLM end SB["Sandbox (HF or Docker)<br/>Pi agent + /testbed"] -->|chat completion request| IS SB -->|answer tool call| IS RW -->|create/exec/kill sandboxes| HFAPI["HF Sandbox API"]black_boxmode at a glanceReward and grading integrity
answer()is routed to host tool handler (not sandbox-owned grader).FAIL_TO_PASS,PASS_TO_PASS) with binary reward1.0/0.0.Note
This PR depends and should be merged after #694
CC: @burtenshaw