Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission) by larstalian · Pull Request #266 · vecna-labs/open-range

larstalian · 2026-06-12T16:56:13Z

What this does

Makes the cyber gym self-verifying and transfer-real, in layers that build on each other. (The leak/consequence oracle from #259 is already on main; this PR builds on it.)

LLM realization behind a dynamic admission gate (#260). The LLM can write a vuln handler; we don't trust it. We render it into a procedurally-built world, run the intended exploit and a benign request, and let the consequence oracle decide: the exploit must leak the flag, the benign request must not. Accept iff solvable-and-not-trivial. Driven by a new ClaudeBackend for the claude CLI, since codex declines the cyber task.

A real-container backing, wired as a runtime (#252). The same generated app the in-memory PROCESS backing runs, but as a real container that episodes actually use (ContainerWebappRuntime, selected by Backing.CONTAINER), with OPENRANGE_REALFS set so surfaces go real:

file-read (path_traversal, xxe) does a real open() against a real filesystem — a traversal escape is real OS path resolution, not a dict lookup;
command_injection runs a real sh -c, with the three mutually-exclusive injection contexts preserved;
world images stay per-vuln lean — only the OS tool a world's own command_injection runs server-side is installed; a file-read/SQLi-only world installs nothing;
the container that now runs attacker code is contained: all Linux capabilities dropped, no privilege escalation, memory/cpu/pid caps.

It reuses the subprocess runtime (docker run is the supervised child), resolves the published host port with docker port, and reads the leak signal out of the running container. It's all additive — the PROCESS backing stays byte-for-byte the same.

Why

The cyber gym's value is bounded by its verifier; a self-verifying loop will ship trivial or unfaithful worlds unless an independent consequence-verifier rejects them. This builds the realization gate and the container backing it runs on — moving generation toward the LLM (variety, scale) while keeping correctness with procedural + admission. Design in packs/cyber_webapp/DESIGN.md §8 (the verifier) and §9 (the staged plan: process → container → networked → cluster, each stage tracked by its own issue).

Testing

Full suite: 796 passed, 4 skipped (env-gated: 2 live-GRPO, 2 strands extra).
One real trl.GRPOTrainer GRPO step over a live SWE and cyber world (HTTP tools) — both pass (OPENRANGE_LIVE_TRL=1).
Cross-backing parity (the load-bearing check): the same snapshot + same exploit grades identically on PROCESS and CONTAINER — only fidelity changes, not the task surface.
Real container integration: docker-gated tests build the image, run the container, recover the flag by exploiting over HTTP (across injection/confinement contexts), and verify the hardening is real (CapEff all-zero inside, still exploitable under the flags).
Reward rungs intact (test_trl_cyber); new modules at 100% branch coverage; no mocks — real subprocesses (docker, fake-CLI scripts), real HTTP, real episodes.

Scope / deferred (tracked)

Implement a plain Backing.CONTAINER realizer for cyber_webapp #252 — remaining: one container per service node on a real network (multi-service); this PR does one container for the whole world.
Harden the cyber CONTAINER world: resource/privilege limits, egress policy, flag-out-of-image #265 — world-container hardening (read-only rootfs, egress policy, flag-out-of-image) + ssti unsandboxed.
Epic: scale up — LLM-realized services on a procedural graph #261 epic; LLM node-realization behind a dynamic admission gate (cyber pack) #260 the admission-gate issue.

Notes

origin/main (the #259 leak-oracle squash) is merged in — this branch already contained that work (consequence.py is byte-identical) plus everything built on top, so the diff collapses to the net-new M0/M1/container/runtime work. Follows .rules (integration tests, no mocks, comments WHY-only, no roadmap/phase tags in code). The stray open-range.zip is untracked and not part of this PR.

…ation indictment Gym change — the §8.3 "any HIDDEN value leaked" consequence oracle, wired live: - consequence.py: detect_leak / guarded_values over HIDDEN value_ref nodes — the independent leak verifier, length-floored so a short value can't false-positive. - codegen bakes guarded_values(graph) into seed.json; the rendered app scans each response boundary and logs leaked node IDS only (never the secret value); realize surfaces final_state["leaked_secret_ids"]; check_success consumes it in `reason`. - Boundary held: success/subgoals — and thus the trainer's averaged reward rungs — are unchanged. Rewarding on the leak is a trainer-coordinated follow-up (#198). Research tooling — experiments/indictment: - An independent harness running LLM-generated worlds to measure the admit-gap: worlds a self-verifying loop ships that independent probes reject as trivial or unfaithful. 89 worlds across 4 classes; gap is a small, consistent ~2-4%. The harder finding: a reliable independent verifier is itself hard to build — false negatives AND positives, each found by hand-auditing. See RESULTS.md. DESIGN.md §8 documents the verification ladder, the spine, and the honest results (incl. the raw-substring oracle's known encoded-leak limitation). Live-validated end to end (a real SQLi episode records the leak by node id). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#259 review) The adversarial review of the §8.3 spine flagged two real (latent) limitations in the raw-substring oracle; close them: - Encoded exfil: detect_leak and the rendered app scanner now search for each guarded value AND its cheap reversible encodings (base64 / hex / percent) by encoding the needle, so an encoded leak is caught, not only the literal form. gzip / binary / multibyte splits remain out (documented). - Containment: detect_leak drops a guarded value that is a proper substring of another leaked value (offline / grader path; the live per-response signal stays raw, since the scanner logs ids not values). An agreement test pins the rendered app scanner to consequence.value_variants so live and offline verdicts can't drift. DESIGN.md §8.4 corrected (no longer "deferred"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

experiments/ (the indictment harness + 89 generated test worlds + writeup) was a one-off validation run, not gym/pack code — it doesn't belong in the repo. Its findings stay documented in DESIGN.md §8.10 as prose; the pack verifier (consequence.py) is unaffected — it lives in the pack, where it belongs. The coupled tests/test_indictment_harness.py goes with it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drop references that .rules forbids in code comments: a section pointer ("§8.3 spine"), a forward/"future" reference ("emergent mode"), and a docstring naming a specific test (rot-prone). The remaining comments are non-obvious WHY. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

) The realization primitive of the emergent-mode ladder (DESIGN.md §9). Today's admission is structural (a graph-path check); an LLM-realized handler can be wrong, so it is admitted DYNAMICALLY — run the intended exploit + a benign request and let the consequence verifier decide: the exploit must leak the flag, the benign request must not. Accept iff solvable and not trivial. - realize_admit.py (pack): the pure pieces — classify_admission (the verdict, over consequence.detect_leak) + cmdi_exploit_and_benign (the per-class exploit oracle). Running an episode is a host concern (packs must not import openrange), so the orchestration lives with the caller, not the pack. - codegen: a vuln node's realized_handler stands in for the template — the hook the LLM realization writes through. - DESIGN.md §9: the M0->M3 realization ladder (procedural architects, LLM realizes, admission verifies, freeze), mapping M1/M2/M3 to #252 / #212 / #235 / #189. Validated end to end: a faithful command_injection world is accepted; trivial and not-solvable verdicts are exercised. 100% branch coverage; import-boundary clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI runs mypy over tests/; the rendered-app exec helper retrieved values from a dict[str, object] namespace, so calling them tripped "object not callable". Type the namespace dict[str, Any] so the pulled-out functions are callable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Inject a stand-in "realized" handler into a command-injection world and run it through codegen -> runtime -> the admission gate: - a different-but-real handler (splits on ';', cats the file) is ACCEPTED — the gate lets in varied implementations, which is the diversity M0 is for; - a handler that returns the flag on any request is REJECTED as trivial; - a handler that never leaks is REJECTED as not solvable. Confirms the codegen hook and the gate work together on a live episode. The live LLM-writes-the-handler step plugs in on top (non-deterministic, so a demo not a test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…dex harness) Drives the codex LLM backend to write a command-injection handler, injects it, and runs it through the dynamic admission gate (accept iff the exploit leaks the flag and a benign request does not). Accepted handlers are the model's own varied implementations; trivial or broken ones are rejected — the reusable entry point for autonomous LLM realization, the live step on top of M0's already-tested gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A second LLMBackend alongside CodexBackend, driving `claude -p --output-format json`. Claude has no output-schema flag, so a structured request asks for a JSON object in the prompt and parses it out of the reply (bare or ```-fenced). Useful where codex is unavailable (quota-limited) or declines a task it flags as risky — claude authors the cyber gym's handlers that codex won't. examples/cyber_realize now selects the backend (--backend claude|codex, default claude). Ran live: claude wrote 5 distinct command-injection handlers, all 5 ACCEPTED by the dynamic admission gate (exploit leaks the flag, a benign request does not) — the M0 realization loop closed end to end with a real model. Fake-CLI tests (no mocks) cover parsing, fenced JSON, flag passing, failures, timeouts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

container.image_files packages a world's rendered app into a Docker build context (Dockerfile + app.py + seed.json). A docker-gated test proves the real path end to end: build the image, run the container, and recover the flag by exploiting the world over HTTP. This containerizes the existing in-memory app — the runtime foundation for Backing.CONTAINER. Making the exploits hit the container's real fs/shell, and wiring this in as the Backing.CONTAINER runtime, are the next M1 increments. Caveat: the seed (with the flag) is COPYed into the image for now (an image layer until the app unlinks it at startup); mounting it at run time is the follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…er (#252) At CONTAINER backing, container.realfs_cmdi_app builds a stdlib real-shell app: the injected input runs through a real `sh -c` against the real filesystem, and the flag is a real file (written from the OPENRANGE_FLAG env var at startup, never in an image layer). So `; cat <path>` actually executes — genuine RCE/file-read, not the in-memory emulation. A docker-gated test proves it: build, run with the flag env, and a real `cat` recovers the real file's flag over HTTP; a benign request does not. Scope: command_injection, plain `; cat` injection. Re-applying the mutually-exclusive contexts of §6 over the real shell, and a ContainerRuntime that selects this per class via Backing.CONTAINER, are the next increments. Real RCE runs inside the container sandbox; hardened isolation is #202. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The container backing's real-shell handler now applies the same naive, context-specific filter the in-memory emulation uses, so the three mutually-exclusive injection contexts (separator / substitution / quoted) survive the move from emulation to a REAL `sh -c`: - separator strips $()/backticks, keeps ; | & - substitution strips ; | & newline, keeps $()/backticks - quoted wraps the arg in QUOTE + strips $(); the exploit must break the quote A docker-gated, parametrized test proves it end to end: a world built for one context is exploited by THAT context's payload and is NOT exploited by another context's payload (the wrong vectors are filtered before sh). This carries the §6 validity work from the in-memory path to Backing.CONTAINER. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Generalize the CONTAINER backing past command_injection by putting the "real" mode into the ONE generated multi-service app — not another bespoke per-class app. The container sets OPENRANGE_REALFS; the rendered app then serves its `files` surface from a real filesystem (`_RealFiles`, a real open() per path) instead of the in-memory dict. The PROCESS backing never sets the env and stays byte-for-byte the in-memory emulation. This makes the whole file-read shape genuinely real with ZERO handler changes: a path-traversal escape is real OS path resolution against the container fs, and the cmdi readers `cat` real files. Proven by a docker- gated, parametrized test across the three confinement contexts (absolute_only / relative / dotdot_filter): each world is read only by its own technique's payload and neutralizes the others — the same mutually-exclusive-contexts guarantee the in-memory emulation makes, now holding over a real fs. The generated app is also the surface the next milestone containerizes and reads the request-log leak signal from, so this is foundation, not a throwaway. The stdlib image_files_realfs variant remains the standalone real-`sh -c` proof for code_exec until that folds into the generated app too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…oke variant (M1) Complete the CONTAINER unification: command_injection now runs a real `sh -c` inside the ONE generated multi-service app under the same OPENRANGE_REALFS gate, so the container is a single app with every shape real — no parallel per-class app. The bespoke `image_files_realfs` / `realfs_cmdi_app` stepping stone is removed. - command_injection.py.j2 gains a real-shell branch (gated by OPENRANGE_REALFS): the same naive, context-specific §6 filter, then a real `sh -c`. PROCESS leaves the env unset and stays the in-memory emulation byte-for-byte. - The image installs the diagnostic tools base_command samples from (ping/nslookup/dig/host/traceroute) so the real shell behaves like a real vulnerable endpoint: a chained `; cat` reads the flag, and `$(cat flag)` leaks it too because each tool echoes the flag-as-hostname in its resolver error. This is the faithfulness the bespoke app hid by hardcoding `echo`. - container.py drops the bespoke variant + its now-unused imports; the docker-gated cmdi tests retarget onto `image_files` (the generated app), still proving the three §6 contexts mutually exclusive over a real shell. - DESIGN.md §9 M1-status updated: file_read + code_exec both real on the one app; remaining M1 is ssti-unsandboxed then isolation (#202). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A world is the target the agent attacks over HTTP — not the agent's toolbox. So its image should carry only what its OWN vulns run server-side, not a diagnostic toolkit every world drags along. Replace the unconditional 5-tool apt-install with `required_apt_packages(graph)`: it returns only the apt packages the world's command_injection vulns need (base_command → package, union across vulns), and `_dockerfile()` skips the apt layer entirely when the set is empty. A path-traversal / SQLi-only world now builds a lean image with no OS tools; a cmdi world installs only its own base_command's tool (e.g. `dnsutils` for nslookup/dig/host). The base_command tool belongs in the TARGET container because the server runs it as the vulnerability — confirmed against the codebase's world/agent split (`base_url` = the world, `solver_root` = the agent's own workspace; "bring your own agent harness"). The agent's recon/exploit tooling lives in its separate sandbox, not the world image. DESIGN.md §9 gains a plain-language "Two environments, not one" note. Tests: required_apt_packages scopes to the world's tool (empty for file-read); the Dockerfile installs OS tools only when needed; a lockstep guard asserts every sampled base_command maps to a package (else a cmdi world would silently ship without its tool). Verified empirically that all five diagnostic tools reflect their argument on python:3.13-slim, so scoping never breaks the substitution exploit. Docker-gated cmdi + path_traversal tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The CONTAINER backing runs attacker-controlled code (a real `sh -c`, a real filesystem), so contain it. `hardening_run_args()` returns the `docker run` flags — `--cap-drop=ALL`, `--security-opt=no-new-privileges`, and memory / cpu / pids caps — so an exploit can't escalate, fork-bomb, or exhaust the host. It's a reusable building block the #252 CONTAINER runtime will run with; the docker tests now run every world through it. A docker-gated test proves the containment is real, not just configured: `docker inspect` shows the caps dropped + limits set, and `cat /proc/self/status` inside the container shows CapEff all-zero. Crucially the world stays exploitable over HTTP under the flags (the DNS-resolution leak and the `cat` chain need no capabilities), so containment doesn't break the vuln. This is task 1 of #265. Read-only-rootfs + egress-blocking + flag-out-of-image remain there (read-only needs the app's writable-path rework; egress rides the M2 network rung). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Code comments/docstrings shouldn't carry roadmap-phase or task tags (M0/M1, DESIGN.md §-refs, issue numbers) — they rot, and commits/PRs are the place for that context. Strip them from the new code, keeping the WHY. Design-doc references stay in DESIGN.md, not in the code. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pin hardening_run_args' contract and required_apt_packages' defensive branches (non-mapping params, unmapped base_command) with non-docker unit tests, so container.py hits 100% branch coverage without needing docker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The M0/M1/M2/M3 labels were invented phase tags, not grounded in anything. Name each stage by what it does and anchor it to its tracking issue (#260/#252/#212/#235/#189) instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The container code was behind a NotImplementedError — episodes couldn't reach it. Wire it: ContainerWebappRuntime runs the world as a real Docker container, and WebappPack.realize() routes CONTAINER to it. It reuses SubprocessRuntime by treating `docker run` (foreground) as the supervised child — the container's app prints the same startup line a local subprocess would, the published host port is resolved with `docker port`, and the request log is read out of the running container (`docker exec cat`). A `_read_log_bytes()` seam shares all the existing log/surface/collect logic between the two backings; PROCESS stays byte-for-byte the same. Load-bearing parity test: the SAME snapshot + SAME exploit grades identically on PROCESS (in-memory emulation) and CONTAINER (a real shell in a real container) — only fidelity changes, not the task surface. Plus unit tests for the backing routing, the construct-without-docker path, and image reuse across resets. New code is at 100% branch coverage (one container-gone guard pragma'd). Scope: one container for the whole world. Multiple per-service containers on a real network is the networked-services work (#212 / #235), not this. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CONTAINER is now wired, so it no longer surfaces NotImplementedError. Prove the backing selector reaches pack.realize with a still-unwired backing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ling branch #259 landed on main as a squash of this branch's early leak-oracle + admission work. This branch already contains all of it (consequence.py is byte-identical) plus the M0/M1/container/runtime work built on top, so -s ours keeps the superset tree and records main as merged. Verified main has no unique content: every line it has that this branch lacks is a superseded older version. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian and others added 21 commits June 11, 2026 22:00

docs(cyber): record M1 status in the §9 realization ladder

e124794

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(cyber): point M1 remaining work at #265 (world-container hardening)

7472bc9

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(cyber): record world-container hardening landed (#265 task 1)

5009f14

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian added roadmap Tracked on the public roadmap pack-cyber Cyber pack work research Exploratory / no near-term plan labels Jun 12, 2026

larstalian changed the title ~~Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission, M1)~~ Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission) Jun 12, 2026

larstalian and others added 2 commits June 12, 2026 13:13

docs(cyber): record the CONTAINER backing is wired as a runtime (#252)

d3f3872

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian mentioned this pull request Jun 12, 2026

Implement a plain Backing.CONTAINER realizer for cyber_webapp #252

Closed

5 tasks

test(core): unwired-backing selection test uses SIMULATOR, not CONTAINER

6ccf507

CONTAINER is now wired, so it no longer surfaces NotImplementedError. Prove the backing selector reaches pack.realize with a still-unwired backing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian merged commit bf5e69b into main Jun 12, 2026
2 checks passed

larstalian deleted the feat/cyber-verification-ceiling branch June 12, 2026 19:30

github-actions Bot locked and limited conversation to collaborators Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission)#266

Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission)#266
larstalian merged 26 commits into
mainfrom
feat/cyber-verification-ceiling

larstalian commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larstalian commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Why

Testing

Scope / deferred (tracked)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larstalian commented Jun 12, 2026 •

edited

Loading