feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment by rycerzes · Pull Request #695 · huggingface/OpenEnv

rycerzes · 2026-05-18T20:13:45Z

Summary

This PR adds a new mini_swe_env and a complete async RL training path for SWE tasks: Pi agent in sandbox, host-side interception, vLLM generation, and Async GRPO updates. It is self-contained for review against feat/multi-harness and supports both direct black-box rollouts and interception-driven async training. The goal is to make SWE-Gym-style training runnable end-to-end on local Docker and HF Space + HF Sandbox.

Type of Change

Alignment Checklist

Before submitting, verify:

I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
I have checked .claude/docs/INVARIANTS.md and no invariants are violated
I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

What this PR does (self-contained)

1) Adds a new SWE environment package

New package: envs/mini_swe_env/

SWE task models + SWE-Gym loader (models.py, task_loader_swegym.py)
host-side grading (grading.py) with binary resolved/not-resolved semantics
harness integration (harness.py) for Pi/OpenCode sessions
server + MCP environment (server/swe_environment.py, server/app.py)
client (client.py) and env package metadata (pyproject.toml, uv.lock)

2) Supports both rollout modes needed by SWE workflows

black_box mode: agent calls upstream LLM directly
interception_gate mode: trainer owns generation + logprob capture

3) Adds async GRPO control plane + rollout worker

New modules in envs/mini_swe_env/async_grpo/:

control_plane.py — starts/owns interception server lifecycle
rollout_worker.py — custom worker implementing TRL rollout protocol
- consumes intercepted chat requests
- tokenizes with apply_chat_template
- calls vLLM /v1/completions with return_token_ids=True, logprobs=0
- delivers chat responses back to Pi
- assembles input_ids, completion_mask, old_log_probs for trainer

4) Adds runnable training/examples and HF Space deploy path

examples/mini_swe_env/train_swe_async_grpo.py
examples/mini_swe_env/run_swe_sample.py
HF Space app assets:
- envs/mini_swe_env/async_grpo/space_app/Dockerfile
- envs/mini_swe_env/async_grpo/space_app/start.sh
- envs/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh
- envs/mini_swe_env/async_grpo/space_app/README.md

5) Core harness compatibility updates used by this env

src/openenv/core/harness/agents/cli_driver.py updated for interception + queue handling needed by async rollouts.

Execution modes

Area	`black_box` mode	`interception_gate` mode
Agent LLM calls	Direct to upstream provider	Routed through host InterceptionServer
Training signal	No token/logprob capture	Exact token IDs + logprobs captured via vLLM
Main use	Eval / smoke / demos	Async GRPO training
Host tools	Optional	Required (`answer` host-side)
Endpoints	provider-specific	`/rollout/{id}/v1/chat/completions`, `/rollout/{id}/v1/tools/answer`, internal vLLM `/v1/completions`

Network diagram (async training path)

flowchart LR
    subgraph SPACE[HF Space / Trainer Host]
        IS["InterceptionServer<br/>0.0.0.0:7860<br/>/rollout/:rollout_id/v1/chat/completions<br/>/rollout/:rollout_id/v1/tools/answer"]
        RW["SWERolloutWorker<br/>threads + queues"]
        TR["AsyncGRPOTrainer"]
        VLLM["vLLM<br/>127.0.0.1:8000<br/>/v1/completions"]

        IS <--> RW
        RW --> TR
        RW <--> VLLM
        TR -. weight sync .-> VLLM
    end

    SB["Sandbox (HF or Docker)<br/>Pi agent + /testbed"] -->|chat completion request| IS
    SB -->|answer tool call| IS
    RW -->|create/exec/kill sandboxes| HFAPI["HF Sandbox API"]

`black_box` mode at a glance

Pi agent (sandbox) -> upstream LLM directly
verify() runs at end; no host-side token/logprob interception loop

Reward and grading integrity

Reward source is host-side grading path.
answer() is routed to host tool handler (not sandbox-owned grader).
Grading uses SWE-Gym-style case outcomes (FAIL_TO_PASS, PASS_TO_PASS) with binary reward 1.0/0.0.
This keeps reward authority on host and aligns with OpenEnv invariants.

Note

This PR depends and should be merged after #694

CC: @burtenshaw

- related tests

- fix agent handling

- tests

…roxy the transparent proxy was a passive forwarder that captured logprobs by injecting logprobs=true into upstream requests. It is replaced by the interception_gate mode where the trainer owns the forward pass entirely — no proxy needed inside the sandbox.

…eneration InterceptionServer (aiohttp) runs on the trainer host. Each rollout registers a queue. The agent's OPENAI_BASE_URL points at `{base_url}/rollout/{id}/v1`. When the agent makes an LLM call it blocks at the server. The training loop dequeues the request, calls vLLM with logprobs=True and return_token_ids=True, and delivers the response back via deliver_response().

…_swe_env

to examples

rycerzes

pi now supports only v22.x and above, and make deployment scripts usable with existing .venv dir and no dependency on rsync

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review

This is a substantial, well-structured PR. The overall interception-gate + host-side grading architecture is sound. The following items need attention before merge.

Tier 1: Fix Before Merge

[F1] Ruff format failures — 4 files fail ruff format --check:

envs/mini_swe_env/grading.py
envs/mini_swe_env/harness.py
envs/mini_swe_env/server/swe_environment.py
examples/mini_swe_env/run_swe_sample.py

Run ruff format on these files.

[F2] Silent suffix-ID corruption in rollout_worker.py (~line 443)
suffix_ids = current_prompt_ids[prev_len:] silently returns an empty list if the tokenizer truncates context. This produces incorrect completion masks and corrupts GRPO advantages without any log warning. Add a _log.warning when len(current_prompt_ids) < prev_len.

[F3] KeyError-prone logprob extraction in rollout_worker.py (~line 600)
choice["logprobs"]["token_logprobs"] will raise KeyError on vLLM versions where the logprobs key structure differs. Add a null-check with a descriptive error message.

[F4] assert used for runtime control flow in harness.py (~lines 3406–3408)
assert self._interception_server is not None and assert self._interception_base_url is not None in SWESessionFactory.create() are stripped under python -O. Replace with explicit if ... raise RuntimeError(...).

[F5] Non-strict revert failures are silently swallowed in swe_environment.py (~line 599)
The _revert_test_files(..., strict=False) path in swe_environment.py produces no log on failure. The harness.py copy correctly calls _log.warning(msg). Align them.

Tier 2: Alignment Discussion Points

🔴 ALIGNMENT FLAG: In-sandbox reward computation (sandbox_mcp_server.py final_answer path)

Principle at stake: "Rewards inside environment" (RFC 002) + agent isolation invariant
The concern: sandbox_mcp_server.py runs verify commands and writes reward.txt inside the sandbox when the agent calls terminal(final_answer=...). The agent process can read or write the same reward file path (/home/user/logs/verifier/reward.txt). Meanwhile the host-side answer tool handler in harness.py computes reward out-of-band and stores it in session._answer_reward. It is unclear which reward source is authoritative when both paths run — the two paths are not documented as mutually exclusive. The host-side path is safer; the sandbox-side path should either be removed or reduced to a no-op acknowledgment with grading deferred to the host.
Suggested reviewer: @Darktex

🔴 ALIGNMENT FLAG: sandbox_backend and sandbox_image exposed as MCP tool parameters

Principle at stake: Dual API boundary (INVARIANTS.md) — simulation controls belong on the orchestration side, not in agent-callable MCP tools.
The concern: run_swe_rollout in swe_environment.py accepts sandbox_backend: str and sandbox_image: str as direct MCP tool call parameters. These are infrastructure/simulation-control parameters. They should be server-side configuration (environment variables), not parameters in the agent-facing MCP API.
Suggested reviewer: @Darktex

🟡 ALIGNMENT FLAG: Use of private _bootstrap_sandbox / _start_agent in swe_environment.py

Principle at stake: "One canonical way to build environments" (PRINCIPLES.md)
The concern: swe_environment.py directly calls driver._bootstrap_sandbox(...) and driver._start_agent(...) (private APIs) and manually constructs CLIAgentSession. This bypasses CLIAgentSessionFactory.create(). The harness.py SWESessionFactory correctly uses the factory abstraction. The environment should use the same public path.
Suggested reviewer: @Darktex

🟡 ALIGNMENT FLAG: CodingAgentSessionFactory.create() starts the agent process as a side effect

Principle at stake: Gymnasium API clarity + "Minimize lifecycle deltas"
The concern: create() immediately calls session.start_agent() before returning, meaning the agent process starts before the caller can set up error handling. This is not the expected Gymnasium idiom where reset() is the side-effecting call. The pattern is consistent across the PR but should be discussed as an intentional design choice.
Suggested reviewer: @Darktex

Automated review by Claude Code | Learn more

…gement

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review Report

Automated Checks

Lint: PARTIAL — Formatting could not be auto-verified in CI-less local env. Manual inspection found no obvious formatting violations in the new files, though envs/mini_swe_env/harness.py imports are not in usort canonical order.
Debug code: CLEAN — No breakpoint(), pdb, or TODO/FIXME markers found in new files.

Tier 1: Fixes Required

envs/mini_swe_env/ — Missing openenv.yaml manifest. Every OpenEnv environment is required to have an openenv.yaml (per PATTERNS.md). This file is absent entirely from mini_swe_env/. Without it, openenv validate and openenv build commands will fail.
envs/mini_swe_env/server/swe_environment.py:313-315 — wall_s not set on early-return setup failure. When a setup command fails, the code returns from inside the try block. result.wall_s is set after the finally block and is not reached on this early-return path. The returned SWERolloutResult will have wall_s=0.0 even for tasks that ran for several minutes of setup before failing. Fix: set result.wall_s = round(time.time() - t0, 3) before the early return.
envs/mini_swe_env/server/swe_environment.py:708-725 — _build_agent_config defines a local dataclass on every call. The pattern of defining a @dataclass class inside a method body and using outer-scope variables as class-level defaults is fragile. harness.py defines the equivalent _SWEAgentTask at module scope — the environment should follow the same pattern.
examples/mini_swe_env/async_grpo/rollout_worker.py:600-603 — choice["token_ids"] and choice["logprobs"]["token_logprobs"] accessed without key guards. If vLLM returns a response without return_token_ids=True support, the code will raise a KeyError inside _generate(), crashing the entire rollout thread.
envs/mini_swe_env/server/swe_environment.py:690-695 — MCP server health-check timeout message is misleading. Error says "5s" but the actual polling window is 10 * 0.5s sleep + 10 * 5s curl timeout = ~55s.
envs/mini_swe_env/server/swe_environment.py:479-492 — FAIL_TO_PASS not type-checked. If present but not a list, list(metadata["FAIL_TO_PASS"]) will silently convert a string to a list of characters.
tests/envs/test_swe_async_rollout_worker.py:1-14 — sys.path manipulation in test file. Manually inserting repo root into sys.path bypasses normal package discovery and breaks if run from a different working directory. Use conftest.py with pythonpath config instead. Also, test coverage only covers 3 pure-function helpers with no coverage of SWERolloutWorker itself.

Tier 2: Alignment Discussion

ALIGNMENT FLAG: In-sandbox grading returns reward directly to agent

Principle: "Rewards inside environment" (RFC 002) / "Agents cannot reset" (RFC 001)
Concern: sandbox_mcp_server.py computes a reward when the agent calls final_answer and returns it in the tool response ("reward": reward). The agent observes its own grading signal during the episode. This departs from the OpenEnv model where reward flows only from server to training orchestrator. Additionally, there are now two grading code paths (in-sandbox and host-side) that must stay in sync.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: swe_environment.py calls private _ methods on CLIAgentDriver

Principle: Dual API boundary (RFC 001/004), Client-server separation (INVARIANTS.md)
Concern: Lines 331-338 call driver._bootstrap_sandbox(...) and driver._start_agent(...) — private methods on a core class. harness.py in this same PR uses the public SWESessionFactory abstraction correctly. Why does swe_environment.py bypass the factory?
Suggested reviewer: @Darktex

ALIGNMENT FLAG: No server/Dockerfile for the environment server

Principle: Container isolation (INVARIANTS.md §2)
Concern: The env relies on Docker images for sandboxing but has no specification for how the environment server itself should be containerized. PATTERNS.md shows server/Dockerfile as a required component.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: Async GRPO rollout worker in examples/ but imported by tests/

Principle: Minimize lifecycle deltas (PRINCIPLES.md)
Concern: examples/mini_swe_env/async_grpo/rollout_worker.py (~736 lines) is a substantial module imported by tests. If production code, it should be a proper package; if example code, it shouldn't be under test coverage. Ambiguous maintenance contract.
Suggested reviewer: @Darktex

Summary

7 mechanical issues (missing required manifest, real bug with wall_s, fragile dataclass pattern, 2 defensive coding gaps, test infrastructure issue, misleading error message)
4 alignment flags for human review (in-sandbox reward to agent, private API calls, missing Dockerfile, example-vs-package ambiguity)

The core SWE-Gym task loading, host-side grading logic (grading.py), and model design are well-structured. The alignment flags concern the black_box path specifics, not the overall architecture.

Automated review by Claude Code | Learn more

Darktex

Automated Alignment Review — PR #695

Note: This is an automated review by Claude Code, not a human review.

This PR was reviewed in four parallel passes covering: (1) core harness infrastructure, (2) mini_swe_env, (3) async GRPO training stack, (4) coding_agent_env refactor. The review applies the two-tier model: Tier 1 = bugs/correctness, Tier 2 = alignment with OpenEnv principles and invariants.

Tier 1: Fixes Required

Core Harness Infrastructure

T1-1: Thread-safety bug in _put_queue_threadsafe — src/openenv/core/harness/agents/interception_server.py
asyncio.Queue._loop was removed in Python 3.10. On 3.10+, getattr(q, "_loop", None) returns None and the code falls through to put_nowait() directly — which is not thread-safe from another thread. Fix: store the event loop in InterceptionServer.__init__ during start() and use loop.call_soon_threadsafe(q.put_nowait, item).

T1-2: Race window in _handle_chat_completions — src/openenv/core/harness/agents/interception_server.py
Between the first active_rollouts check and the second lock acquisition, unregister_rollout can run and cancel futures. The response_future is created before the second lock, so if cleanup runs in the gap, the intercept is stored after cleanup — leaking a never-resolved future.

T1-3: DockerBgJob.wait() returns 0 for killed processes — src/openenv/core/harness/sandbox/docker_backend.py
When kill() is called, _exit_code remains None, so wait() returns 0 — masking killed jobs as successes. HFBgJob correctly sets _exit_code = 137 on kill; DockerBgJob.kill() should do the same.

T1-4: Hardcoded spec.name == "pi" in shared driver — src/openenv/core/harness/agents/cli_driver.py:872-893
Pi-specific logic (models.json writing, PI_CODING_AGENT_DIR) in the generic CLIAgentDriver._start_agent breaks the declarative design of CLIAgentSpec. Should be moved to Pi's build_env_vars/build_command or a new pre_start hook.

mini_swe_env

T1-5: Missing _wrap_instruction in MCP server path (behavior bug) — envs/mini_swe_env/server/swe_environment.py:734
_build_agent_task passes task.instruction directly without wrapping. The harness path (harness.py:1100) calls _wrap_instruction() which injects answer/final_answer tool instructions. Without this, agents on the MCP server path won't know to call final_answer, breaking reward collection entirely for that code path.

T1-6: In-sandbox MCP server returns reward to the agent — envs/mini_swe_env/server/sandbox_mcp_server.py:202-205
terminal(final_answer="...") returns "reward": reward directly in the tool response. The agent can observe this during the episode, breaking the training abstraction. The host-side re-grading is authoritative, but the in-band reward is still visible.

T1-7: api_key passed as MCP tool parameter — envs/mini_swe_env/client.py:112, swe_environment.py:132
The LLM provider API key travels as a first-class string argument over MCP transport. Per INVARIANTS.md §Security: "Use environment variables for sensitive configuration." Should use server-side env var lookup instead.

T1-8: Potential credential leak in error path — envs/mini_swe_env/server/swe_environment.py:374-375
Broad except Exception stores f"{type(exc).__name__}: {exc}" into result.error. If the LLM client throws an AuthenticationError containing the API key, it leaks through the serialized result.

T1-9: Wrong docstring — envs/mini_swe_env/task_loader_swegym.py:178
Says __ is replaced with "_1776_" but the code uses "_s_".

Async GRPO Training

T1-10: Race condition — coroutine scheduled before loop starts — examples/mini_swe_env/train_swe_async_grpo.py:83
asyncio.run_coroutine_threadsafe(control_plane.start(), loop) is called before the thread running the loop starts (line 88). Fix: start the thread first, then schedule the coroutine.

T1-11: _generate crashes on null logprobs — examples/mini_swe_env/async_grpo/rollout_worker.py:601
choice["logprobs"]["token_logprobs"] raises KeyError when vLLM returns logprobs: null (truncated generation, error paths). Guard with .get().

T1-12: Hardcoded personal TRACKIO_SPACE_ID — examples/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh:163
"TRACKIO_SPACE_ID": "rycerzes/swe-grpo-dashboard" routes all users' metrics to the author's personal Space. Should be parameterized.

T1-13: HF_TOKEN shell-interpolated into Python -c string — examples/mini_swe_env/async_grpo/space_app/deploy_hf_space.sh:86-88
Token interpolated directly into python3 -c "api = HfApi(token='$HF_TOKEN')". If the token contains single quotes, this causes syntax error or injection. Use the heredoc pattern used elsewhere in the same file.

T1-14: Insecure default vllm_key = "token" with no warning — examples/mini_swe_env/train_swe_async_grpo.py:235
Silently falls back to a guessable API key when VLLM_API_KEY is unset. Should emit a warning.

coding_agent_env Refactor

T1-15: Unused parameters in _build_session_factory — envs/coding_agent_env/server/coding_environment.py
disable_thinking, top_logprobs, max_tokens_cap are declared as parameters but never read. These are already baked into config via _build_agent_config. Remove or document.

T1-16: Import order will fail usort check — envs/coding_agent_env/server/gradio_ui.py
catalog_summary sorts after ENDPOINT_KINDS (uppercase first). Run usort format.

T1-17: Format violations — Multiple files need ruff format:

envs/mini_swe_env/grading.py
envs/mini_swe_env/harness.py
envs/mini_swe_env/server/swe_environment.py
examples/mini_swe_env/run_swe_sample.py

Tier 2: Alignment Flags for Human Review

🔴 Critical Alignment Questions

New openenv.core.harness (singular) vs existing openenv.core.harnesses (plural)
Two parallel harness abstraction layers now exist. Is .harness the RFC 005 implementation, a replacement for .harnesses, or something alongside it? This needs explicit documentation.
CLIAgentSession.verify() computes reward on the trainer host, outside the environment boundary
The verifier: Callable runs host-side with (sandbox, task) → VerifyResult(env_reward=...). Per INVARIANTS.md/RFC 002, reward computation must stay inside the environment boundary. If this is an intentional exception for CLI-harness scenarios, it needs an RFC reference.
InterceptionServer tool endpoint creates agent→infrastructure channel
POST /rollout/{id}/v1/tools/{tool_name} is reachable from the sandbox. While RESERVED_TOOL_NAMES blocks reset/step/state, this is a third implicit channel from agent-controlled code to arbitrary trainer-host Python callables. Does this violate the dual-API boundary?
Two divergent reward paths: swe_environment.py vs harness.py
These implement structurally different code paths for the same SWE rollout. Divergence already exists (_wrap_instruction in one but not the other). Should one be canonical?
CLIAgentDriver bypasses the Gymnasium API
Uses SandboxHandle.exec()/start_bg()/write_text() instead of reset/step/state. CLIAgentSession doesn't implement reset(). Is this the RFC 005 implementation or a departure from it?

🟡 Moderate Concerns

advantage = raw_reward without baseline subtraction — rollout_worker.py:528
In GRPO, advantage should be reward - baseline. If AsyncGRPOTrainer doesn't apply its own normalization on the advantage field, training will be biased.
_noop_reward bypasses TRL's reward pipeline — train_swe_async_grpo.py
reward_funcs=_noop_reward (always returns zeros) with comment "rewards come from rollout_worker.advantage." Non-standard TRL usage that may break across versions.
aiohttp>=3.13.5 added to root pyproject.toml
Not imported in any file changed by this PR. Should live in the env/example-specific pyproject.toml, not the root.
CommandResult.exit_code changed from int to int | None
Breaking wire-type change. Existing clients doing exit_code != 0 checks will silently get None.
test_swe_async_rollout_worker.py bundled in rename PR
Tests mini_swe_env rollout worker but is part of the coding_agent_env rename changes. Per PR discipline: "Each PR should cover ONE meaningful semantic unit."
InterceptionServer on public port 7860 with auto-generated token
The interception server is the public-facing HF Space endpoint. If the token leaks, an adversary can inject arbitrary LLM responses mid-rollout, poisoning training data.
Docker-in-Docker capability in Dockerfile
docker.io installed + user added to docker group. If DinD is enabled, agents could escape the inner sandbox. Document the intended threat model.

Verdict: REQUEST CHANGES

Blocking issues:

T1-1 (thread-safety bug in streaming path)
T1-5 (missing _wrap_instruction breaks reward collection on MCP path)
T1-7 (API key as MCP tool parameter violates security invariant)
T1-10 (race condition in training entrypoint)
T1-11 (crash on null logprobs)

Must-resolve alignment questions:

#1 (harness vs harnesses namespace)
#2 (reward outside environment boundary)
#3 (agent→infrastructure channel via tool endpoint)
#4 (dual reward path divergence)

The architecture is ambitious and well-tested (1300+ lines of test coverage for the core harness). The interception server design is clever. But the invariant questions are load-bearing enough to warrant human sign-off before merge.

Automated review by Claude Code | Learn more

AmineDiro

Hey, thanks for the contribution! Amine from the TRL's team here 🤗 .

I have focused on the mini_swe_env example as it uses TRL's RolloutWorkerProtocol. As a bit of architectural primer, the_AsyncRolloutLoop in the worker is what owns the generation loop. This decision was made on purpose so we can centralize generation, staleness management, context clamping, scoring, and GIL isolation (ongoing work)
Your interception model inverts the dependency (which is ok): the sandbox agent (Pi) owns the loop, the worker reacts to the message. That's a legitimate divergence and is worth keeping in your example; it's also a great point for improving our current TRL design.
Below is about parts that diverged from TRL without IMHO a valid reason, and can be simplified greatly

1. The weight-transfer

All send_weights/pause/resume/_init_weight_transfer NCCL setups are basically unreachable. Injecting a worker sets self.weight_transfer = None and skips every sync path so vLLM never receives updated weights! I think you can reuse WeightTransferClient in the trainer, or intentionally extend the protocol as a separate TRL change.

2. The loss in REINFORCE, not GRPO??

From what I can see, the advantage = reward = 0/1 and you have no baseline, and no group-relative normalization. It's ok if that's the target loss, but I just want to point that out.

3. Import helpers from TRL

There is subtle logic in tool-suffix tokenization/ multiturn handling/ token length clamping that is tricky. I would advise reusing the logic

Some other general divergence :

Staleness: You stamp model_version but never cancel in-flight stale rollouts, so minutes-long SWE rollouts get discarded post-hoc.
Defensive scaffolding, a lot of gettattr and try/except block.
Duplicated RolloutSample: You redefine the dataclass instead of importing it, so it will monitor the trainer's contract changes. Import from async_rollout_worker.
Hardcoded weight remapping: poking into NCCL internals is bespoke and fragile. I really think you can just use TRL weight sync
Reinvent turn-termination heuristics. TRL ends a turn on the absence of tool_calls. I might be missing something, is that something special for this env? If not, I'd lean back on the TRL implementation for turn termination

- update deps and envs for named-tunnels

- use trl prefix-preserving training template for reliable suffix computation - replace prev_prompt_ids with prev_base_ids (add_generation_prompt=false) - add fallback interstitial extraction on prefix mismatch - parse tool_call arguments from json strings to dicts for qwen3.5 - strip trailing eos tokens from vllm output

…rker

rycerzes · 2026-05-31T10:29:55Z

Thanks @AmineDiro for the detailed review. It helped me catch a few core issues. Addressing each point:

1. The weight-transfer

All send_weights/pause/resume/_init_weight_transfer NCCL setups are basically unreachable. Injecting a worker sets self.weight_transfer = None and skips every sync path so vLLM never receives updated weights.

I verified this, weight sync does happen. No code change needed, but I added a clarifying comment. The trainer's _sync_weight() calls self.rollout_worker.send_weights() regardless of injection, so my SWERolloutWorker.send_weights() is always reached. The hardcoded language_model. prefix remap is necessary due to a VLM param-name mismatch that TRL's default sync doesn't handle. I'll propose this fix upstream.

2. The loss is REINFORCE, not GRPO

advantage = reward = 0/1 and you have no baseline, and no group-relative normalization.

Thank you for catching this, I already had this fixed but havent pushed yet. I implemented proper group-relative normalization: accumulate N rollouts per task, compute advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8), exclude overflow/timeout samples. Bumped gradient_accumulation_steps to 4 and weight_decay to 0.1 to match Polar's hyperparameters.

3. Import helpers from TRL

Subtle logic in tool-suffix tokenization / multiturn handling / token length clamping is tricky. I would advise reusing the logic.

Fixed for suffix/template. I now import TRL's get_training_chat_template and is_chat_template_prefix_preserving, which resolved a silent bug where Qwen3.5's <think> tokens misaligned. For length clamping, I use progressive truncation before tokenization rather than post-hoc clamping.

Staleness: staleness handling matches Polar's approach for now but this results in wasted sandbox compute. I will visit this at a later revision.

Duplicated RolloutSample: TRL's version includes fields my pipeline doesn't use; my local version covers what RolloutQueueDataset actually reads, but I can update it as well if needed

Defensive scaffolding: Fair point. The getattr and try/except is higher than it should be — a lot of it accrued while debugging Qwen3.5-specific crashes. I will do a pass to tighten these up.

Reinvent turn-termination heuristics: The post_response_grace_s idle-stop is needed for remote sandboxes where process exit isn't synchronously observable. Documented inline.

Co-authored-by: Ben Burtenshaw <ben.burtenshaw@gmail.com>

rycerzes

quick summary of what changed after the GRPO normalization work:

moved to proper group-normalized GRPO with configurable num_generations (default 16)
fixed Qwen3.5/vLLM compatibility issues (weight sync + chat-template/tool-call tokenization edge cases)
hardened long-run stability with better context-window handling (trimming + overflow guards) so rollouts degrade gracefully instead of looping/crashing
integrated your key harness stabilizations: reward mode toggle, stronger prompt/instructions, anti-test-edit protections, and configurable git checkout timeout
improved control-plane reliability with explicit agent exit signaling so workers don’t sit on long idle timeouts
stabilized training stack for Spaces with multi-GPU/LoRA path, FSDP2-safe optimizer settings, and kernel/dependency fixes

rycerzes added 30 commits May 12, 2026 22:13

refactor: core sandbox infra to use openenv.core.harness.sandbox

57f9b70

feat: CLIAgentDriver Abstraction

024e904

feat: add tests

455b0e9

feat: impl Docker sandbox backend

e97fda0

- related tests

feat: pi agent adapter

9a35006

- fix agent handling

chore: agent specifications and improve interception module ref

06df791

feat: add tests for opencode + pi harness adapters

a3c4a3d

chore: migrate opencode_env to coding_agent_env

81e37a2

feat: hf sandbox backend

ddf1313

- tests

chore: ruff + usort format pass

9d85640

refactor: wire coding_agent_env with interception_gate

171a3ea

chore: update tests for interception_gate, remove proxy test cases

52a024e

chore: address greptile review comments

4b1b707

feat: task adapter + mini subset builder

8830b72

feat: mini_swe_env impl

6c4caa4

refactor: task adapter

4b115a8

feat: SWE-Gym task model and loader

afb9bd9

feat: integrate pi agent harness

8d3b79f

refactor: extract sandbox bootstrap to driver and fix interception races

a478fa8

chore: remove unsupported mode checks from CodingAgentEnvironment

b18caf2

chore: revert linting for out of scope files

bfc7305

feat: host-side tool routing and Pi gate models bootstrap

c2ca0c8

fix: support configurable Pi workdir for command and MCP config

1b1d9fb

fix: Docker host gateway mapping and per-create image override

2f52e48

feat: configurable extension directory support for CLI agents

f3fede2

refactor: harness and models for host-side grading

9bdee17

Merge remote-tracking branch 'fork/feat/multi-harness' into feat/mini…

d14cac4

…_swe_env

refactor: SWE-Gym grading to use native test outcomes

341c387

rycerzes added 3 commits May 20, 2026 12:20

refactor(mini_swe): move async_grpo training stack from env package

cf15c8b

to examples

fix(pi): use Node.js 22.x

42419d6

fix: env handling in deployment scripts

9e61e76

rycerzes commented May 23, 2026

View reviewed changes

Darktex suggested changes May 23, 2026

View reviewed changes

chore: update vLLM version to 0.21.0 and adjust weight lifecycle mana…

c3228f7

…gement

Darktex suggested changes May 24, 2026

View reviewed changes

feat: traj collection

9396bf6

Darktex suggested changes May 24, 2026

View reviewed changes

rycerzes added 6 commits May 25, 2026 11:22

chore: rm mention of cf

cc6dacd

fix: trajectory persistence and periodic uploads

4e8c4b0

fix: rollout handling

131edf3

feat: per-turn completion records for RL reconstruction

afc6c79

feat: post-hoc grading & evaluation

d0659e2

fix: allow unlimited agent turns

b9ca3b1

AmineDiro reviewed May 29, 2026

View reviewed changes

fix: group-relative advantage with num_generations parameter

212dd47

burtenshaw mentioned this pull request May 29, 2026

mini_swe: stabilize async GRPO validation runs rycerzes/OpenEnv#2

Open

rycerzes added 6 commits May 30, 2026 01:50

fix: update vLLM weight prefix handling for Qwen3.5 models

9e3c3a7

fix: add num_generations parameter to deploy_hf_space.sh

49765f7

- update deps and envs for named-tunnels

fix: strip cf from the name

a1a764d

chore: clarify RolloutSample docstring regarding TRL's reward functions

6c9e121

refactor: ctx overflow handling and model length config in rollout wo…

74acb44

…rker

rycerzes and others added 3 commits May 31, 2026 17:12

fix: grading, context management and message truncation

2a90f47

Co-authored-by: Ben Burtenshaw <ben.burtenshaw@gmail.com>

feat: LoRA config with multi GPU

57e89ea

fix: fsdp2, adamw_torch and flash-linear-attn

a0ab2e8

rycerzes force-pushed the feat/mini_swe_env branch from f9aa2eb to a0ab2e8 Compare May 31, 2026 11:50

rycerzes commented May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment#695

feat(mini_swe_env): add SWE-Gym async GRPO environment with Pi interception and HF Space deployment#695
rycerzes wants to merge 80 commits into
huggingface:mainfrom
rycerzes:feat/mini_swe_env

rycerzes commented May 18, 2026

Uh oh!

rycerzes left a comment

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

AmineDiro left a comment •

edited

Loading

Uh oh!

rycerzes commented May 31, 2026

Uh oh!

rycerzes left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rycerzes commented May 18, 2026

Summary

Type of Change

Alignment Checklist

What this PR does (self-contained)

1) Adds a new SWE environment package

2) Supports both rollout modes needed by SWE workflows

3) Adds async GRPO control plane + rollout worker

4) Adds runnable training/examples and HF Space deploy path

5) Core harness compatibility updates used by this env

Execution modes

Network diagram (async training path)

black_box mode at a glance

Reward and grading integrity

Uh oh!

rycerzes left a comment

Choose a reason for hiding this comment

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review

Tier 1: Fix Before Merge

Tier 2: Alignment Discussion Points

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review Report

Automated Checks

Tier 1: Fixes Required

Tier 2: Alignment Discussion

Summary

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Automated Alignment Review — PR #695

Tier 1: Fixes Required

Core Harness Infrastructure

mini_swe_env

Async GRPO Training

coding_agent_env Refactor

Tier 2: Alignment Flags for Human Review

🔴 Critical Alignment Questions

🟡 Moderate Concerns

Verdict: REQUEST CHANGES

Uh oh!

AmineDiro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. The weight-transfer

2. The loss in REINFORCE, not GRPO??

3. Import helpers from TRL

Uh oh!

rycerzes commented May 31, 2026

1. The weight-transfer

2. The loss is REINFORCE, not GRPO

3. Import helpers from TRL

Uh oh!

rycerzes left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`black_box` mode at a glance

AmineDiro left a comment •

edited

Loading