Skip to content

Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission)#266

Merged
larstalian merged 26 commits into
mainfrom
feat/cyber-verification-ceiling
Jun 12, 2026
Merged

Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission)#266
larstalian merged 26 commits into
mainfrom
feat/cyber-verification-ceiling

Conversation

@larstalian

@larstalian larstalian commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What this does

Makes the cyber gym self-verifying and transfer-real, in layers that build on each other. (The leak/consequence oracle from #259 is already on main; this PR builds on it.)

LLM realization behind a dynamic admission gate (#260). The LLM can write a vuln handler; we don't trust it. We render it into a procedurally-built world, run the intended exploit and a benign request, and let the consequence oracle decide: the exploit must leak the flag, the benign request must not. Accept iff solvable-and-not-trivial. Driven by a new ClaudeBackend for the claude CLI, since codex declines the cyber task.

A real-container backing, wired as a runtime (#252). The same generated app the in-memory PROCESS backing runs, but as a real container that episodes actually use (ContainerWebappRuntime, selected by Backing.CONTAINER), with OPENRANGE_REALFS set so surfaces go real:

  • file-read (path_traversal, xxe) does a real open() against a real filesystem — a traversal escape is real OS path resolution, not a dict lookup;
  • command_injection runs a real sh -c, with the three mutually-exclusive injection contexts preserved;
  • world images stay per-vuln lean — only the OS tool a world's own command_injection runs server-side is installed; a file-read/SQLi-only world installs nothing;
  • the container that now runs attacker code is contained: all Linux capabilities dropped, no privilege escalation, memory/cpu/pid caps.

It reuses the subprocess runtime (docker run is the supervised child), resolves the published host port with docker port, and reads the leak signal out of the running container. It's all additive — the PROCESS backing stays byte-for-byte the same.

Why

The cyber gym's value is bounded by its verifier; a self-verifying loop will ship trivial or unfaithful worlds unless an independent consequence-verifier rejects them. This builds the realization gate and the container backing it runs on — moving generation toward the LLM (variety, scale) while keeping correctness with procedural + admission. Design in packs/cyber_webapp/DESIGN.md §8 (the verifier) and §9 (the staged plan: process → container → networked → cluster, each stage tracked by its own issue).

Testing

  • Full suite: 796 passed, 4 skipped (env-gated: 2 live-GRPO, 2 strands extra).
  • One real trl.GRPOTrainer GRPO step over a live SWE and cyber world (HTTP tools) — both pass (OPENRANGE_LIVE_TRL=1).
  • Cross-backing parity (the load-bearing check): the same snapshot + same exploit grades identically on PROCESS and CONTAINER — only fidelity changes, not the task surface.
  • Real container integration: docker-gated tests build the image, run the container, recover the flag by exploiting over HTTP (across injection/confinement contexts), and verify the hardening is real (CapEff all-zero inside, still exploitable under the flags).
  • Reward rungs intact (test_trl_cyber); new modules at 100% branch coverage; no mocks — real subprocesses (docker, fake-CLI scripts), real HTTP, real episodes.

Scope / deferred (tracked)

Notes

origin/main (the #259 leak-oracle squash) is merged in — this branch already contained that work (consequence.py is byte-identical) plus everything built on top, so the diff collapses to the net-new M0/M1/container/runtime work. Follows .rules (integration tests, no mocks, comments WHY-only, no roadmap/phase tags in code). The stray open-range.zip is untracked and not part of this PR.

larstalian and others added 21 commits June 11, 2026 22:00
…ation indictment

Gym change — the §8.3 "any HIDDEN value leaked" consequence oracle, wired live:
- consequence.py: detect_leak / guarded_values over HIDDEN value_ref nodes — the
  independent leak verifier, length-floored so a short value can't false-positive.
- codegen bakes guarded_values(graph) into seed.json; the rendered app scans each
  response boundary and logs leaked node IDS only (never the secret value); realize
  surfaces final_state["leaked_secret_ids"]; check_success consumes it in `reason`.
- Boundary held: success/subgoals — and thus the trainer's averaged reward rungs —
  are unchanged. Rewarding on the leak is a trainer-coordinated follow-up (#198).

Research tooling — experiments/indictment:
- An independent harness running LLM-generated worlds to measure the admit-gap:
  worlds a self-verifying loop ships that independent probes reject as trivial or
  unfaithful. 89 worlds across 4 classes; gap is a small, consistent ~2-4%. The
  harder finding: a reliable independent verifier is itself hard to build — false
  negatives AND positives, each found by hand-auditing. See RESULTS.md.

DESIGN.md §8 documents the verification ladder, the spine, and the honest results
(incl. the raw-substring oracle's known encoded-leak limitation).

Live-validated end to end (a real SQLi episode records the leak by node id).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
 #259 review)

The adversarial review of the §8.3 spine flagged two real (latent) limitations in
the raw-substring oracle; close them:
- Encoded exfil: detect_leak and the rendered app scanner now search for each guarded
  value AND its cheap reversible encodings (base64 / hex / percent) by encoding the
  needle, so an encoded leak is caught, not only the literal form. gzip / binary /
  multibyte splits remain out (documented).
- Containment: detect_leak drops a guarded value that is a proper substring of another
  leaked value (offline / grader path; the live per-response signal stays raw, since
  the scanner logs ids not values).

An agreement test pins the rendered app scanner to consequence.value_variants so live
and offline verdicts can't drift. DESIGN.md §8.4 corrected (no longer "deferred").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
experiments/ (the indictment harness + 89 generated test worlds + writeup) was a
one-off validation run, not gym/pack code — it doesn't belong in the repo. Its
findings stay documented in DESIGN.md §8.10 as prose; the pack verifier
(consequence.py) is unaffected — it lives in the pack, where it belongs. The
coupled tests/test_indictment_harness.py goes with it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop references that .rules forbids in code comments: a section pointer
("§8.3 spine"), a forward/"future" reference ("emergent mode"), and a docstring
naming a specific test (rot-prone). The remaining comments are non-obvious WHY.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
)

The realization primitive of the emergent-mode ladder (DESIGN.md §9). Today's
admission is structural (a graph-path check); an LLM-realized handler can be wrong, so
it is admitted DYNAMICALLY — run the intended exploit + a benign request and let the
consequence verifier decide: the exploit must leak the flag, the benign request must
not. Accept iff solvable and not trivial.

- realize_admit.py (pack): the pure pieces — classify_admission (the verdict, over
  consequence.detect_leak) + cmdi_exploit_and_benign (the per-class exploit oracle).
  Running an episode is a host concern (packs must not import openrange), so the
  orchestration lives with the caller, not the pack.
- codegen: a vuln node's realized_handler stands in for the template — the hook the
  LLM realization writes through.
- DESIGN.md §9: the M0->M3 realization ladder (procedural architects, LLM realizes,
  admission verifies, freeze), mapping M1/M2/M3 to #252 / #212 / #235 / #189.

Validated end to end: a faithful command_injection world is accepted; trivial and
not-solvable verdicts are exercised. 100% branch coverage; import-boundary clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI runs mypy over tests/; the rendered-app exec helper retrieved values from a
dict[str, object] namespace, so calling them tripped "object not callable".
Type the namespace dict[str, Any] so the pulled-out functions are callable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inject a stand-in "realized" handler into a command-injection world and run it through
codegen -> runtime -> the admission gate:
- a different-but-real handler (splits on ';', cats the file) is ACCEPTED — the gate
  lets in varied implementations, which is the diversity M0 is for;
- a handler that returns the flag on any request is REJECTED as trivial;
- a handler that never leaks is REJECTED as not solvable.

Confirms the codegen hook and the gate work together on a live episode. The live
LLM-writes-the-handler step plugs in on top (non-deterministic, so a demo not a test).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dex harness)

Drives the codex LLM backend to write a command-injection handler, injects it, and runs
it through the dynamic admission gate (accept iff the exploit leaks the flag and a
benign request does not). Accepted handlers are the model's own varied implementations;
trivial or broken ones are rejected — the reusable entry point for autonomous LLM
realization, the live step on top of M0's already-tested gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A second LLMBackend alongside CodexBackend, driving `claude -p --output-format json`.
Claude has no output-schema flag, so a structured request asks for a JSON object in the
prompt and parses it out of the reply (bare or ```-fenced). Useful where codex is
unavailable (quota-limited) or declines a task it flags as risky — claude authors the
cyber gym's handlers that codex won't.

examples/cyber_realize now selects the backend (--backend claude|codex, default
claude). Ran live: claude wrote 5 distinct command-injection handlers, all 5 ACCEPTED
by the dynamic admission gate (exploit leaks the flag, a benign request does not) — the
M0 realization loop closed end to end with a real model.

Fake-CLI tests (no mocks) cover parsing, fenced JSON, flag passing, failures, timeouts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
container.image_files packages a world's rendered app into a Docker build context
(Dockerfile + app.py + seed.json). A docker-gated test proves the real path end to end:
build the image, run the container, and recover the flag by exploiting the world over
HTTP. This containerizes the existing in-memory app — the runtime foundation for
Backing.CONTAINER. Making the exploits hit the container's real fs/shell, and wiring
this in as the Backing.CONTAINER runtime, are the next M1 increments.

Caveat: the seed (with the flag) is COPYed into the image for now (an image layer until
the app unlinks it at startup); mounting it at run time is the follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er (#252)

At CONTAINER backing, container.realfs_cmdi_app builds a stdlib real-shell app: the
injected input runs through a real `sh -c` against the real filesystem, and the flag is
a real file (written from the OPENRANGE_FLAG env var at startup, never in an image
layer). So `; cat <path>` actually executes — genuine RCE/file-read, not the in-memory
emulation.

A docker-gated test proves it: build, run with the flag env, and a real `cat` recovers
the real file's flag over HTTP; a benign request does not.

Scope: command_injection, plain `; cat` injection. Re-applying the mutually-exclusive
contexts of §6 over the real shell, and a ContainerRuntime that selects this per class
via Backing.CONTAINER, are the next increments. Real RCE runs inside the container
sandbox; hardened isolation is #202.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The container backing's real-shell handler now applies the same naive,
context-specific filter the in-memory emulation uses, so the three
mutually-exclusive injection contexts (separator / substitution / quoted)
survive the move from emulation to a REAL `sh -c`:

  - separator    strips $()/backticks, keeps ; | &
  - substitution strips ; | & newline, keeps $()/backticks
  - quoted       wraps the arg in QUOTE + strips $(); the exploit must
                 break the quote

A docker-gated, parametrized test proves it end to end: a world built for
one context is exploited by THAT context's payload and is NOT exploited by
another context's payload (the wrong vectors are filtered before sh). This
carries the §6 validity work from the in-memory path to Backing.CONTAINER.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generalize the CONTAINER backing past command_injection by putting the
"real" mode into the ONE generated multi-service app — not another bespoke
per-class app. The container sets OPENRANGE_REALFS; the rendered app then
serves its `files` surface from a real filesystem (`_RealFiles`, a real
open() per path) instead of the in-memory dict. The PROCESS backing never
sets the env and stays byte-for-byte the in-memory emulation.

This makes the whole file-read shape genuinely real with ZERO handler
changes: a path-traversal escape is real OS path resolution against the
container fs, and the cmdi readers `cat` real files. Proven by a docker-
gated, parametrized test across the three confinement contexts
(absolute_only / relative / dotdot_filter): each world is read only by its
own technique's payload and neutralizes the others — the same
mutually-exclusive-contexts guarantee the in-memory emulation makes, now
holding over a real fs.

The generated app is also the surface the next milestone containerizes and
reads the request-log leak signal from, so this is foundation, not a
throwaway. The stdlib image_files_realfs variant remains the standalone
real-`sh -c` proof for code_exec until that folds into the generated app
too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oke variant (M1)

Complete the CONTAINER unification: command_injection now runs a real `sh -c`
inside the ONE generated multi-service app under the same OPENRANGE_REALFS
gate, so the container is a single app with every shape real — no parallel
per-class app. The bespoke `image_files_realfs` / `realfs_cmdi_app` stepping
stone is removed.

- command_injection.py.j2 gains a real-shell branch (gated by OPENRANGE_REALFS):
  the same naive, context-specific §6 filter, then a real `sh -c`. PROCESS
  leaves the env unset and stays the in-memory emulation byte-for-byte.
- The image installs the diagnostic tools base_command samples from
  (ping/nslookup/dig/host/traceroute) so the real shell behaves like a real
  vulnerable endpoint: a chained `; cat` reads the flag, and `$(cat flag)`
  leaks it too because each tool echoes the flag-as-hostname in its resolver
  error. This is the faithfulness the bespoke app hid by hardcoding `echo`.
- container.py drops the bespoke variant + its now-unused imports; the
  docker-gated cmdi tests retarget onto `image_files` (the generated app),
  still proving the three §6 contexts mutually exclusive over a real shell.
- DESIGN.md §9 M1-status updated: file_read + code_exec both real on the one
  app; remaining M1 is ssti-unsandboxed then isolation (#202).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A world is the target the agent attacks over HTTP — not the agent's toolbox.
So its image should carry only what its OWN vulns run server-side, not a
diagnostic toolkit every world drags along.

Replace the unconditional 5-tool apt-install with `required_apt_packages(graph)`:
it returns only the apt packages the world's command_injection vulns need
(base_command → package, union across vulns), and `_dockerfile()` skips the apt
layer entirely when the set is empty. A path-traversal / SQLi-only world now
builds a lean image with no OS tools; a cmdi world installs only its own
base_command's tool (e.g. `dnsutils` for nslookup/dig/host).

The base_command tool belongs in the TARGET container because the server runs
it as the vulnerability — confirmed against the codebase's world/agent split
(`base_url` = the world, `solver_root` = the agent's own workspace; "bring your
own agent harness"). The agent's recon/exploit tooling lives in its separate
sandbox, not the world image. DESIGN.md §9 gains a plain-language "Two
environments, not one" note.

Tests: required_apt_packages scopes to the world's tool (empty for file-read);
the Dockerfile installs OS tools only when needed; a lockstep guard asserts
every sampled base_command maps to a package (else a cmdi world would silently
ship without its tool). Verified empirically that all five diagnostic tools
reflect their argument on python:3.13-slim, so scoping never breaks the
substitution exploit. Docker-gated cmdi + path_traversal tests still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The CONTAINER backing runs attacker-controlled code (a real `sh -c`, a real
filesystem), so contain it. `hardening_run_args()` returns the `docker run`
flags — `--cap-drop=ALL`, `--security-opt=no-new-privileges`, and memory /
cpu / pids caps — so an exploit can't escalate, fork-bomb, or exhaust the
host. It's a reusable building block the #252 CONTAINER runtime will run with;
the docker tests now run every world through it.

A docker-gated test proves the containment is real, not just configured:
`docker inspect` shows the caps dropped + limits set, and `cat
/proc/self/status` inside the container shows CapEff all-zero. Crucially the
world stays exploitable over HTTP under the flags (the DNS-resolution leak and
the `cat` chain need no capabilities), so containment doesn't break the vuln.

This is task 1 of #265. Read-only-rootfs + egress-blocking + flag-out-of-image
remain there (read-only needs the app's writable-path rework; egress rides the
M2 network rung).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code comments/docstrings shouldn't carry roadmap-phase or task tags (M0/M1,
DESIGN.md §-refs, issue numbers) — they rot, and commits/PRs are the place
for that context. Strip them from the new code, keeping the WHY. Design-doc
references stay in DESIGN.md, not in the code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pin hardening_run_args' contract and required_apt_packages' defensive
branches (non-mapping params, unmapped base_command) with non-docker unit
tests, so container.py hits 100% branch coverage without needing docker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian added roadmap Tracked on the public roadmap pack-cyber Cyber pack work research Exploratory / no near-term plan labels Jun 12, 2026
The M0/M1/M2/M3 labels were invented phase tags, not grounded in anything.
Name each stage by what it does and anchor it to its tracking issue
(#260/#252/#212/#235/#189) instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian changed the title Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission, M1) Cyber: self-verifying generation + real-container backing (leak oracle, LLM admission) Jun 12, 2026
larstalian and others added 2 commits June 12, 2026 13:13
The container code was behind a NotImplementedError — episodes couldn't reach
it. Wire it: ContainerWebappRuntime runs the world as a real Docker container,
and WebappPack.realize() routes CONTAINER to it.

It reuses SubprocessRuntime by treating `docker run` (foreground) as the
supervised child — the container's app prints the same startup line a local
subprocess would, the published host port is resolved with `docker port`, and
the request log is read out of the running container (`docker exec cat`). A
`_read_log_bytes()` seam shares all the existing log/surface/collect logic
between the two backings; PROCESS stays byte-for-byte the same.

Load-bearing parity test: the SAME snapshot + SAME exploit grades identically
on PROCESS (in-memory emulation) and CONTAINER (a real shell in a real
container) — only fidelity changes, not the task surface. Plus unit tests for
the backing routing, the construct-without-docker path, and image reuse across
resets. New code is at 100% branch coverage (one container-gone guard pragma'd).

Scope: one container for the whole world. Multiple per-service containers on a
real network is the networked-services work (#212 / #235), not this.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CONTAINER is now wired, so it no longer surfaces NotImplementedError. Prove
the backing selector reaches pack.realize with a still-unwired backing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ling branch

#259 landed on main as a squash of this branch's early leak-oracle + admission
work. This branch already contains all of it (consequence.py is byte-identical)
plus the M0/M1/container/runtime work built on top, so -s ours keeps the
superset tree and records main as merged. Verified main has no unique content:
every line it has that this branch lacks is a superseded older version.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian merged commit bf5e69b into main Jun 12, 2026
2 checks passed
@larstalian larstalian deleted the feat/cyber-verification-ceiling branch June 12, 2026 19:30
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pack-cyber Cyber pack work research Exploratory / no near-term plan roadmap Tracked on the public roadmap

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant