feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes by larstalian · Pull Request #257 · vecna-labs/open-range

larstalian · 2026-06-10T04:11:10Z

Self-Review

Scope focused on the cyber gym's per-class transfer-validity + trainability
Reviewed the diff for architectural drift and unintended public API changes
Tests and docs updated; live-agent validated, not just scripted oracles

Toward #190; foundation for #212. Design: packs/cyber_webapp/DESIGN.md. Follow-ups: #258.

Summary

The cyber gym shipped 3 vuln classes, one exploit shape, with memorizable templates — too narrow and too easy-to-overfit for a per-vulnerability-class sim-to-real transfer study (H2). This generalizes it to 9 classes across 3 shapes, hardens each into a faithful, replay-resistant, discoverable exploit, and — critically — makes it real-agent solvable, verified by driving a live LLM agent through the actual episode harness (not scripted oracles).

Validity (the H2 measurement target).

Faithful engines — command_injection/ssti/xxe run real shlex/Jinja/xml.sax, not string-matchers ({{7*7}}→49; a bare SYSTEM "file://" no longer leaks).
Discoverable flags — a read-config→pivot recon chain; the flag path is randomized so brute force doesn't pay.
Mutually-exclusive payload contexts — a live 3×3 replay matrix is fully diagonal for all 9 classes; single-payload replay floor 67%→33%, so an agent must learn all three techniques, not one string.

Trainability (the part scripted tests hid). Driving a real agent through the harness showed the validity-hardened "standard" tier is too hard for a fresh agent — it solved ~2 of 9 classes, because the thin instruction blocked vuln classification and the recon chain made file-loot a two-stage exploit it couldn't walk. So the gym adds a difficulty knob:

standard (default): blind, recon-required — the H2 transfer-measurement target.
easy/guided: names the vuln class, the flag's location, the sampled context, and a one-step payload recipe — the agent still crafts and executes the real exploit; only recon/classification is removed.

A live-agent matrix (9 classes × 2 contexts, a real claude agent through the real harness) solves 18/18 at easy vs ~3/22 at standard. The gym is real-agent-trainable via the easy tier and a manifest-driven easy→standard curriculum.

Testing

tests/test_cyber_staged_generation.py: the real pipeline end to end (no mocks) — every class forced as the oracle and solved by its own context-appropriate HTTP exploit; mutual-exclusivity, discoverability, guided-instruction (every per-class hint branch), and degenerate-graph guards. Full gauntlet green (ruff/mypy/boundary/pytest/coverage, 738 passed, 87%).
Live-agent eval: a real agent solves 18/18 at easy tier; the standard-tier ~14% is the (intended) hard H2 baseline. This is what made "actually works" defensible.

Review Notes & honest residuals

PROCESS rung emulates faithfully (real engines/parsers in-process); a real shell/filesystem with RCE is the container backing (Implement a plain Backing.CONTAINER realizer for cyber_webapp #252).
sql_injection/idor/weak_credentials contexts are disjoint serializations of one skill, not three competencies — documented in DESIGN.md.
Deferred and tracked in cyber gym: POST/body channel + difficulty-curriculum knobs (deferred from #257) #258: POST/body channel (body-shaped exploits are GET-query today), a richer default instruction, and an auto-difficulty curriculum.

… goal The cyber pack's missing design counterpart to its README. Captures: procedural owns correctness / LLM owns variety (behind admission); staged constraint- propagating generation (the builder already does this via oracle_service_id, but hardcodes one loot shape); organize by exploit SHAPE not CWE; reuse data_store.engine for file/exec loot (no new ontology kind); and the goal — 3 shapes / ~8 vuln classes, solvable by construction. The spec the staged- generation work builds against. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… shape Generalize the builder's hardcoded loot placement into a staged choice that emits the loot shape as the constraint the vuln stage consumes — the staged, constraint-propagating generation in DESIGN.md. A "db" loot keys the flag by a record (response-leak exploits read it); a "file" loot keys it by an absolute path in an in-memory file map (a file-read exploit reads it). The oracle vuln is forced to a kind whose exploit shape matches the loot, so every world is solvable by construction — no extra reject-and-repair. Adds the path_traversal class (file_read shape): a Jinja template whose handler joins a client path onto a base dir without confinement, so '../' or an absolute path escapes to any file in the store. The flag lives only in the in-memory file map (never on disk, never in the db/secrets) so a stray response-leak vuln can't shortcut the challenge. Reuses data_store kind=file (already in the ontology; no ontology change). Loot shape and vuln-class mix are manifest-configurable (loot_shapes / vuln_kinds fold into the prior, like scale), so the study can target a shape or class. Proven end to end: a file-loot world admits, realizes, and is solved by a real path-traversal HTTP exploit that recovers the flag (tests/ test_cyber_staged_generation.py). 720 passed, 86% coverage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…peline Third exploit shape on the staged pipeline. A file loot now serves both file_read and code_exec: command_injection concatenates a client parameter into a diagnostic command, and an in-process interpreter resolves an injected `cat <path>` segment against the in-memory file store — the PROCESS-backing emulation of a shell (a container backing makes it real). Same flag, same store, no new realizer plumbing. The gym now spans 3 exploit shapes (response-leak, file-read, code-exec) across 5 classes. Proven end to end: a forced command_injection oracle solved by a real `; cat` injection that recovers the flag. 721 passed, 86% coverage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

….rules Close the new-code coverage gaps: edge tests for non-mapping and degenerate loot_shapes/vuln_kinds manifest values (both fall back to db), pragma the _forced_oracle None return (every loot shape has an eligible oracle vuln, so admission never sees it), and demote its docstring to an inline comment (underscore helper). Remaining uncovered lines in sampling.py are pre-existing helper guards (#201), not introduced here. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add four classes on the existing shape pipelines: xxe (file-read) and ssti (code-exec) on the file store; idor and weak_credentials (response-leak) on the db store. The gym now spans 3 exploit shapes across 9 classes — sql_injection, ssrf, broken_authz, idor, weak_credentials / path_traversal, xxe / command_injection, ssti — each proven end to end by its own real HTTP exploit (XXE external entity, SSTI expression, IDOR id, default credentials). Decoy files now sample into the content-addressed graph (a sampler stage adds benign file records to the loot store) instead of being hardcoded at realize time, so they vary by seed; the flag-path lookups target the flag's record, not a decoy. DESIGN.md updated: 9-class status table, the default loot mix (db:7, file:3) rationale, and that PROCESS emulates the fs/shell while a container backing (#252) makes them real with exec-sandbox hardening (#202). 100% branch coverage on the new classes; 727 passed, 86% coverage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The audit found command_injection / ssti / xxe were string-matchers, not the technique — `{{7*7}}` did nothing, a bare `SYSTEM "file://"` substring leaked, only `; cat <bareword>` worked. An agent trained on those learns a magic string that transfers to nothing, invalidating per-class transfer (H2). Now each runs a real engine in-process: SSTI a sandboxed Jinja env (`{{7*7}}`->49, `{{ config }}` dumps the store), XXE a real SAX parser with external-entity resolution over the in-memory store (well-formed DOCTYPE/ENTITY/reference required; a substring no longer leaks), command_injection a real `shlex` tokenizer honoring `;|&` separators, `$()`/backtick substitution, quoting, basename, and a broad reader set. weak_credentials is already real equality auth. The agent must produce the real technique, so it transfers; the only thing still emulated is an OS shell with RCE escalation, which the container backing (#252) provides. DESIGN.md updated. 727 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…orce pool) The audit showed flag discovery for file/exec shapes was a blind guess from a 20-element hardcoded path pool (404 on miss, decoys deliberately disjoint from the loot dir) — the agent learns OpenRange's dictionary, not "find the file." Now each file-loot world plants a config at a conventional path (/etc/app/ settings.conf, …) disclosing the flag's directory and backup_file path. The exploit chain becomes real recon: read a guessable config via the vuln → pivot to the path it names → read the flag. Verified end to end for path traversal; the same store is read by xxe/cmdi, and ssti's context dump already exposes it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The re-audit caught that the discoverability fix was additive, not substitutive: the flag still sat at one of 20 enumerable dir/name combos, so brute-forcing the pool (3-9 requests) was strictly cheaper than reading the config — an RL agent had no pressure to learn recon, and the degenerate "memorize the dictionary" signal the audit flagged was still sitting next to it. Add a high-entropy directory segment to the flag path (16^8 space), so the absolute path is unenumerable and brute-forcing the dir/name pools no longer finds it. The planted config still discloses the full path, so reading it is now the only tractable route — discovery becomes a genuine recon capability. Verified: brute-force of the 20-combo pool no longer hits; the read-config -> pivot -> flag chain still works. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…-setter) The re-audit's #1 remaining transfer-validity blocker: each class was one replayable payload, so an agent memorizes the string instead of learning the technique — confounding per-class transfer (H2). Fix: sample an injection *context* per build that forces the agent to adapt the exploit. command_injection now samples a quoting context (unquoted / single / double) and a real quote-aware shlex tokenizer (punctuation_chars) splits on UNQUOTED separators while command substitution fires except inside single quotes — real shell semantics. Verified each context requires a different correct break-out (`; cat` unquoted, `$(cat …)` in double quotes, `'…; …; echo '` in single) and a mismatched-context payload fails. Sets the pattern for the other 8 classes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The re-audit's last open H2 blocker: each class was one replayable payload, so an agent memorizes the string, not the technique — a per-class transfer confound. Since the agent only sees the HTTP surface (never server code), the fix is to sample an injection CONTEXT per build where the correct exploit genuinely differs and a mismatched-context payload fails: - sql_injection: single / numeric / double quoting (real sqlite) - command_injection: unquoted / single / double (quote-aware shlex tokenizer) - path_traversal: absolute / ../ / ....//-past-a-naive-filter (real posixpath) - ssti: raw / comment / expr render sink (real sandboxed Jinja) - xxe: element-content / wrapped-root / scheme-prefix entity (real xml.sax) - ssrf: no-filter / scheme-block / host-allowlist-bypass — and rewired from a dead decoy into a live oracle (resolves to an internal host -> leaks secret) - idor: direct / base64 / prefixed reference encoding - broken_authz: single-token / dual-factor / encoded-token forge - weak_credentials: pair / combined / basic submission Context params are `default()`-safe so mutation.py and bare callers still render. The episode test fans out over all 9 classes, each solved by its own context-appropriate exploit through the live harness; pure-function tests cover every payload-builder branch and broken_authz's dual_factor. 733 passed, 86%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…7%->33%) The re-audit measured a ~67% single-payload replay floor: the 3 contexts per class formed a permissiveness order, so one "strict" payload also solved the more-permissive builds (e.g. `$(cat)` worked unquoted AND double-quoted). An agent could memorize one string per class and pass ~2/3 of builds without adapting — a residual H2 confound. Each leaky class's handler now ENFORCES its context so the 3 are mutually exclusive (a payload for one build fails the other two): - command_injection separator/substitution/quoted — each strips the others' vectors ($()/backticks vs `;|&` separators vs quote-wrapping) - path_traversal absolute_only/relative/dotdot_filter — strip-to-convergence vs no-strip+re-anchor vs strip-once+re-anchor - ssti attribute/comment/expr — distinct break-outs, each inert in the others - broken_authz single/dual/encoded — single & encoded reject a foreign confirm param; dual requires it; encoded requires the hashed value - ssrf scheme_block/host_allowlist/decimal_ip — three disjoint evasions of the same internal host (also retires the permissive no_filter) - xxe (already done) element/wrapped-root/scheme-prefix A live 3x3 replay matrix per class confirms it: 53/54 cells correct — 5 classes perfectly diagonal, xxe with one inherent residual (reflect-any accepts the specific-root payload; left distinct rather than collapsed). Floor ~33%. 733 passed, 86%, ruff/mypy clean. DESIGN.md updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… floor The final re-audit confirmed 8/9 classes at the 33% replay floor but found xxe still at 66.7%: the wrapped_root payload solved element_content builds 300/300 seeds, because element_content reflected ANY root and so was a strict superset of wrapped_root. Close it without collapsing the two into one technique: element_content now reflects only the document root's DIRECT (depth-1) text, while the wrapped_root payload nests the entity a level deeper — distinct injection positions (top-level vs nested), not a root-name swap. The live 3x3 matrix is now fully diagonal (0 off-diagonal leaks), so all 9 classes sit at the ~33% single-payload floor. DESIGN.md records the one remaining threat-to-validity (sql_injection / idor / weak_credentials contexts are disjoint serializations of one skill). 733 passed, 86%, ruff/mypy clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The re-audit's secondary residual: a wrong-technique attempt was indistinguishable from a benign miss (path traversal both 404; ssti both empty 200), so the agent got no signal it was hitting the right vuln class with the wrong technique. - path_traversal: a neutralized traversal attempt now returns 403 ("path not permitted") vs 404 for a benign filename miss; base dirs sampled at varied DEPTH (2-5) so the relative payload's "../" count is build-specific structure. - command_injection: a stripped injection (shell metacharacters) returns "input rejected" vs the benign diagnostic echo. - ssti: a swallowed template injection returns "template directive ignored" vs a plain render. All three reshape only the NON-leak responses, so the mutual-exclusivity matrix is unchanged (re-verified: cmdi/path/ssti still 0 off-diagonal). Tests for the path and cmdi feedback signals. DESIGN.md documents the feedback + the honest structural-variety asymmetry (SQLi embeds table+column in the payload; file-read/ cmd-exec carry their diversity in three distinct techniques, not payload shape). 734 passed, 86%, ruff/mypy clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…nable) Live-agent eval (a real claude agent driven through the actual episode harness, not scripted oracles) showed the gym was NOT trainable as-is: at standard tier a strong agent solved only ~1-2 of 9 classes — the thin instruction left it unable to classify the vuln, and the discovery recon chain made file-loot a 2-stage exploit it couldn't walk (command_injection failed even with rich hints and a 20-minute budget). Add a `difficulty` manifest knob: - `easy`/guided: the pentest instruction names the vuln class, the flag's exact location, and the sampled context, plus a concrete one-step payload recipe — so the world is a single exploit a real agent can actually solve (bootstrapping / curriculum floor). The core skill (craft + execute the exploit) remains. - `standard` (default, unchanged): the blind, recon-required, validity-hardened world used for the H2 transfer measurement. Live-validated: the exact command_injection world that failed thin, rich, AND at 20 minutes is solved at easy tier in under 500s. Tests cover every per-class hint branch, the tier aliases, and the degenerate-graph guards. 738 passed, 87% cov. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…lity tier broken_authz was the lone easy-tier failure (0/2): the trusted value is a query param named like a header (X-User-Role), and the dual_factor/encoded_token hints omitted 'query parameter', so the agent tried it as an HTTP header. Clarify all three hints; both contexts now solve live. Easy-tier matrix is 18/18 across all 9 classes (vs ~3/22 standard). DESIGN.md documents the live-agent finding, the validity-vs-trainability tradeoff, and the standard/easy difficulty tiers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Audited every comment and docstring added by the staged-generation / difficulty work against .rules: dropped references to the development process and research framing (the audit, H2 / transfer confound, replay floor, "the agent must adapt/replay", validity-hardened, "was X -> now Y"), removed BUG: tags and name-restating docstrings on underscore helpers, and deleted comments that only restated the code. Kept the load-bearing WHY (hidden constraints, invariants like SQLite double-quote-as-string and the secret-never-on-disk rule, and the terse deferred container-backing note). Comments/docstrings only — 738 passed, no behavior change; all templates still render. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

larstalian and others added 5 commits June 9, 2026 22:28

larstalian changed the title ~~feat(cyber): staged loot→vuln generation + file-read & code-exec shapes~~ feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes Jun 10, 2026

larstalian and others added 8 commits June 10, 2026 09:26

test(cyber): fix mypy no-any-return in discovery test

1661a0f

larstalian mentioned this pull request Jun 10, 2026

cyber gym: POST/body channel + difficulty-curriculum knobs (deferred from #257) #258

Open

larstalian and others added 4 commits June 10, 2026 12:55

larstalian merged commit f499c5b into main Jun 11, 2026
2 checks passed

larstalian deleted the feat/cyber-staged-generation branch June 11, 2026 15:09

github-actions Bot locked and limited conversation to collaborators Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes#257

feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes#257
larstalian merged 17 commits into
mainfrom
feat/cyber-staged-generation

larstalian commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larstalian commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-Review

Summary

Testing

Review Notes & honest residuals

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larstalian commented Jun 10, 2026 •

edited

Loading