Skip to content

feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes#257

Merged
larstalian merged 17 commits into
mainfrom
feat/cyber-staged-generation
Jun 11, 2026
Merged

feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes#257
larstalian merged 17 commits into
mainfrom
feat/cyber-staged-generation

Conversation

@larstalian

@larstalian larstalian commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Self-Review

  • Scope focused on the cyber gym's per-class transfer-validity + trainability
  • Reviewed the diff for architectural drift and unintended public API changes
  • Tests and docs updated; live-agent validated, not just scripted oracles

Toward #190; foundation for #212. Design: packs/cyber_webapp/DESIGN.md. Follow-ups: #258.

Summary

The cyber gym shipped 3 vuln classes, one exploit shape, with memorizable templates — too narrow and too easy-to-overfit for a per-vulnerability-class sim-to-real transfer study (H2). This generalizes it to 9 classes across 3 shapes, hardens each into a faithful, replay-resistant, discoverable exploit, and — critically — makes it real-agent solvable, verified by driving a live LLM agent through the actual episode harness (not scripted oracles).

Validity (the H2 measurement target).

  • Faithful engines — command_injection/ssti/xxe run real shlex/Jinja/xml.sax, not string-matchers ({{7*7}}→49; a bare SYSTEM "file://" no longer leaks).
  • Discoverable flags — a read-config→pivot recon chain; the flag path is randomized so brute force doesn't pay.
  • Mutually-exclusive payload contexts — a live 3×3 replay matrix is fully diagonal for all 9 classes; single-payload replay floor 67%→33%, so an agent must learn all three techniques, not one string.

Trainability (the part scripted tests hid). Driving a real agent through the harness showed the validity-hardened "standard" tier is too hard for a fresh agent — it solved ~2 of 9 classes, because the thin instruction blocked vuln classification and the recon chain made file-loot a two-stage exploit it couldn't walk. So the gym adds a difficulty knob:

  • standard (default): blind, recon-required — the H2 transfer-measurement target.
  • easy/guided: names the vuln class, the flag's location, the sampled context, and a one-step payload recipe — the agent still crafts and executes the real exploit; only recon/classification is removed.

A live-agent matrix (9 classes × 2 contexts, a real claude agent through the real harness) solves 18/18 at easy vs ~3/22 at standard. The gym is real-agent-trainable via the easy tier and a manifest-driven easy→standard curriculum.

Testing

  • tests/test_cyber_staged_generation.py: the real pipeline end to end (no mocks) — every class forced as the oracle and solved by its own context-appropriate HTTP exploit; mutual-exclusivity, discoverability, guided-instruction (every per-class hint branch), and degenerate-graph guards. Full gauntlet green (ruff/mypy/boundary/pytest/coverage, 738 passed, 87%).
  • Live-agent eval: a real agent solves 18/18 at easy tier; the standard-tier ~14% is the (intended) hard H2 baseline. This is what made "actually works" defensible.

Review Notes & honest residuals

larstalian and others added 5 commits June 9, 2026 22:28
… goal

The cyber pack's missing design counterpart to its README. Captures: procedural
owns correctness / LLM owns variety (behind admission); staged constraint-
propagating generation (the builder already does this via oracle_service_id, but
hardcodes one loot shape); organize by exploit SHAPE not CWE; reuse
data_store.engine for file/exec loot (no new ontology kind); and the goal —
3 shapes / ~8 vuln classes, solvable by construction. The spec the staged-
generation work builds against.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… shape

Generalize the builder's hardcoded loot placement into a staged choice that
emits the loot shape as the constraint the vuln stage consumes — the staged,
constraint-propagating generation in DESIGN.md. A "db" loot keys the flag by a
record (response-leak exploits read it); a "file" loot keys it by an absolute
path in an in-memory file map (a file-read exploit reads it). The oracle vuln is
forced to a kind whose exploit shape matches the loot, so every world is
solvable by construction — no extra reject-and-repair.

Adds the path_traversal class (file_read shape): a Jinja template whose handler
joins a client path onto a base dir without confinement, so '../' or an absolute
path escapes to any file in the store. The flag lives only in the in-memory file
map (never on disk, never in the db/secrets) so a stray response-leak vuln can't
shortcut the challenge. Reuses data_store kind=file (already in the ontology;
no ontology change).

Loot shape and vuln-class mix are manifest-configurable (loot_shapes /
vuln_kinds fold into the prior, like scale), so the study can target a shape or
class. Proven end to end: a file-loot world admits, realizes, and is solved by a
real path-traversal HTTP exploit that recovers the flag (tests/
test_cyber_staged_generation.py). 720 passed, 86% coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…peline

Third exploit shape on the staged pipeline. A file loot now serves both
file_read and code_exec: command_injection concatenates a client parameter into
a diagnostic command, and an in-process interpreter resolves an injected
`cat <path>` segment against the in-memory file store — the PROCESS-backing
emulation of a shell (a container backing makes it real). Same flag, same store,
no new realizer plumbing.

The gym now spans 3 exploit shapes (response-leak, file-read, code-exec) across 5
classes. Proven end to end: a forced command_injection oracle solved by a real
`; cat` injection that recovers the flag. 721 passed, 86% coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
….rules

Close the new-code coverage gaps: edge tests for non-mapping and degenerate
loot_shapes/vuln_kinds manifest values (both fall back to db), pragma the
_forced_oracle None return (every loot shape has an eligible oracle vuln, so
admission never sees it), and demote its docstring to an inline comment
(underscore helper). Remaining uncovered lines in sampling.py are pre-existing
helper guards (#201), not introduced here.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add four classes on the existing shape pipelines: xxe (file-read) and ssti
(code-exec) on the file store; idor and weak_credentials (response-leak) on the
db store. The gym now spans 3 exploit shapes across 9 classes — sql_injection,
ssrf, broken_authz, idor, weak_credentials / path_traversal, xxe /
command_injection, ssti — each proven end to end by its own real HTTP exploit
(XXE external entity, SSTI expression, IDOR id, default credentials).

Decoy files now sample into the content-addressed graph (a sampler stage adds
benign file records to the loot store) instead of being hardcoded at realize
time, so they vary by seed; the flag-path lookups target the flag's record, not
a decoy. DESIGN.md updated: 9-class status table, the default loot mix (db:7,
file:3) rationale, and that PROCESS emulates the fs/shell while a container
backing (#252) makes them real with exec-sandbox hardening (#202).

100% branch coverage on the new classes; 727 passed, 86% coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@larstalian larstalian changed the title feat(cyber): staged loot→vuln generation + file-read & code-exec shapes feat(cyber): staged loot→vuln generation — 9 classes across 3 exploit shapes Jun 10, 2026
larstalian and others added 8 commits June 10, 2026 09:26
The audit found command_injection / ssti / xxe were string-matchers, not the
technique — `{{7*7}}` did nothing, a bare `SYSTEM "file://"` substring leaked,
only `; cat <bareword>` worked. An agent trained on those learns a magic string
that transfers to nothing, invalidating per-class transfer (H2).

Now each runs a real engine in-process: SSTI a sandboxed Jinja env (`{{7*7}}`->49,
`{{ config }}` dumps the store), XXE a real SAX parser with external-entity
resolution over the in-memory store (well-formed DOCTYPE/ENTITY/reference
required; a substring no longer leaks), command_injection a real `shlex`
tokenizer honoring `;|&` separators, `$()`/backtick substitution, quoting,
basename, and a broad reader set. weak_credentials is already real equality auth.
The agent must produce the real technique, so it transfers; the only thing still
emulated is an OS shell with RCE escalation, which the container backing (#252)
provides. DESIGN.md updated. 727 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…orce pool)

The audit showed flag discovery for file/exec shapes was a blind guess from a
20-element hardcoded path pool (404 on miss, decoys deliberately disjoint from
the loot dir) — the agent learns OpenRange's dictionary, not "find the file."

Now each file-loot world plants a config at a conventional path (/etc/app/
settings.conf, …) disclosing the flag's directory and backup_file path. The
exploit chain becomes real recon: read a guessable config via the vuln → pivot
to the path it names → read the flag. Verified end to end for path traversal;
the same store is read by xxe/cmdi, and ssti's context dump already exposes it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The re-audit caught that the discoverability fix was additive, not
substitutive: the flag still sat at one of 20 enumerable dir/name combos, so
brute-forcing the pool (3-9 requests) was strictly cheaper than reading the
config — an RL agent had no pressure to learn recon, and the degenerate
"memorize the dictionary" signal the audit flagged was still sitting next to it.

Add a high-entropy directory segment to the flag path (16^8 space), so the
absolute path is unenumerable and brute-forcing the dir/name pools no longer
finds it. The planted config still discloses the full path, so reading it is now
the only tractable route — discovery becomes a genuine recon capability.
Verified: brute-force of the 20-combo pool no longer hits; the read-config ->
pivot -> flag chain still works.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-setter)

The re-audit's #1 remaining transfer-validity blocker: each class was one
replayable payload, so an agent memorizes the string instead of learning the
technique — confounding per-class transfer (H2). Fix: sample an injection
*context* per build that forces the agent to adapt the exploit.

command_injection now samples a quoting context (unquoted / single / double) and
a real quote-aware shlex tokenizer (punctuation_chars) splits on UNQUOTED
separators while command substitution fires except inside single quotes — real
shell semantics. Verified each context requires a different correct break-out
(`; cat` unquoted, `$(cat …)` in double quotes, `'…; …; echo '` in single) and a
mismatched-context payload fails. Sets the pattern for the other 8 classes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The re-audit's last open H2 blocker: each class was one replayable payload, so an
agent memorizes the string, not the technique — a per-class transfer confound.
Since the agent only sees the HTTP surface (never server code), the fix is to
sample an injection CONTEXT per build where the correct exploit genuinely differs
and a mismatched-context payload fails:

- sql_injection: single / numeric / double quoting (real sqlite)
- command_injection: unquoted / single / double (quote-aware shlex tokenizer)
- path_traversal: absolute / ../ / ....//-past-a-naive-filter (real posixpath)
- ssti: raw / comment / expr render sink (real sandboxed Jinja)
- xxe: element-content / wrapped-root / scheme-prefix entity (real xml.sax)
- ssrf: no-filter / scheme-block / host-allowlist-bypass — and rewired from a
  dead decoy into a live oracle (resolves to an internal host -> leaks secret)
- idor: direct / base64 / prefixed reference encoding
- broken_authz: single-token / dual-factor / encoded-token forge
- weak_credentials: pair / combined / basic submission

Context params are `default()`-safe so mutation.py and bare callers still render.
The episode test fans out over all 9 classes, each solved by its own
context-appropriate exploit through the live harness; pure-function tests cover
every payload-builder branch and broken_authz's dual_factor. 733 passed, 86%.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…7%->33%)

The re-audit measured a ~67% single-payload replay floor: the 3 contexts per
class formed a permissiveness order, so one "strict" payload also solved the
more-permissive builds (e.g. `$(cat)` worked unquoted AND double-quoted). An
agent could memorize one string per class and pass ~2/3 of builds without
adapting — a residual H2 confound.

Each leaky class's handler now ENFORCES its context so the 3 are mutually
exclusive (a payload for one build fails the other two):
- command_injection separator/substitution/quoted — each strips the others'
  vectors ($()/backticks vs `;|&` separators vs quote-wrapping)
- path_traversal absolute_only/relative/dotdot_filter — strip-to-convergence vs
  no-strip+re-anchor vs strip-once+re-anchor
- ssti attribute/comment/expr — distinct break-outs, each inert in the others
- broken_authz single/dual/encoded — single & encoded reject a foreign confirm
  param; dual requires it; encoded requires the hashed value
- ssrf scheme_block/host_allowlist/decimal_ip — three disjoint evasions of the
  same internal host (also retires the permissive no_filter)
- xxe (already done) element/wrapped-root/scheme-prefix

A live 3x3 replay matrix per class confirms it: 53/54 cells correct — 5 classes
perfectly diagonal, xxe with one inherent residual (reflect-any accepts the
specific-root payload; left distinct rather than collapsed). Floor ~33%.
733 passed, 86%, ruff/mypy clean. DESIGN.md updated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… floor

The final re-audit confirmed 8/9 classes at the 33% replay floor but found xxe
still at 66.7%: the wrapped_root payload solved element_content builds 300/300
seeds, because element_content reflected ANY root and so was a strict superset
of wrapped_root.

Close it without collapsing the two into one technique: element_content now
reflects only the document root's DIRECT (depth-1) text, while the wrapped_root
payload nests the entity a level deeper — distinct injection positions
(top-level vs nested), not a root-name swap. The live 3x3 matrix is now fully
diagonal (0 off-diagonal leaks), so all 9 classes sit at the ~33% single-payload
floor. DESIGN.md records the one remaining threat-to-validity (sql_injection /
idor / weak_credentials contexts are disjoint serializations of one skill).

733 passed, 86%, ruff/mypy clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
larstalian and others added 4 commits June 10, 2026 12:55
The re-audit's secondary residual: a wrong-technique attempt was indistinguishable
from a benign miss (path traversal both 404; ssti both empty 200), so the agent
got no signal it was hitting the right vuln class with the wrong technique.

- path_traversal: a neutralized traversal attempt now returns 403 ("path not
  permitted") vs 404 for a benign filename miss; base dirs sampled at varied
  DEPTH (2-5) so the relative payload's "../" count is build-specific structure.
- command_injection: a stripped injection (shell metacharacters) returns
  "input rejected" vs the benign diagnostic echo.
- ssti: a swallowed template injection returns "template directive ignored" vs a
  plain render.

All three reshape only the NON-leak responses, so the mutual-exclusivity matrix
is unchanged (re-verified: cmdi/path/ssti still 0 off-diagonal). Tests for the
path and cmdi feedback signals. DESIGN.md documents the feedback + the honest
structural-variety asymmetry (SQLi embeds table+column in the payload; file-read/
cmd-exec carry their diversity in three distinct techniques, not payload shape).

734 passed, 86%, ruff/mypy clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nable)

Live-agent eval (a real claude agent driven through the actual episode harness,
not scripted oracles) showed the gym was NOT trainable as-is: at standard tier a
strong agent solved only ~1-2 of 9 classes — the thin instruction left it unable
to classify the vuln, and the discovery recon chain made file-loot a 2-stage
exploit it couldn't walk (command_injection failed even with rich hints and a
20-minute budget).

Add a `difficulty` manifest knob:
- `easy`/guided: the pentest instruction names the vuln class, the flag's exact
  location, and the sampled context, plus a concrete one-step payload recipe —
  so the world is a single exploit a real agent can actually solve (bootstrapping
  / curriculum floor). The core skill (craft + execute the exploit) remains.
- `standard` (default, unchanged): the blind, recon-required, validity-hardened
  world used for the H2 transfer measurement.

Live-validated: the exact command_injection world that failed thin, rich, AND at
20 minutes is solved at easy tier in under 500s. Tests cover every per-class hint
branch, the tier aliases, and the degenerate-graph guards. 738 passed, 87% cov.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lity tier

broken_authz was the lone easy-tier failure (0/2): the trusted value is a query
param named like a header (X-User-Role), and the dual_factor/encoded_token hints
omitted 'query parameter', so the agent tried it as an HTTP header. Clarify all
three hints; both contexts now solve live. Easy-tier matrix is 18/18 across all 9
classes (vs ~3/22 standard). DESIGN.md documents the live-agent finding, the
validity-vs-trainability tradeoff, and the standard/easy difficulty tiers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Audited every comment and docstring added by the staged-generation / difficulty
work against .rules: dropped references to the development process and research
framing (the audit, H2 / transfer confound, replay floor, "the agent must
adapt/replay", validity-hardened, "was X -> now Y"), removed BUG: tags and
name-restating docstrings on underscore helpers, and deleted comments that only
restated the code. Kept the load-bearing WHY (hidden constraints, invariants like
SQLite double-quote-as-string and the secret-never-on-disk rule, and the terse
deferred container-backing note). Comments/docstrings only — 738 passed, no
behavior change; all templates still render.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@larstalian larstalian merged commit f499c5b into main Jun 11, 2026
2 checks passed
@larstalian larstalian deleted the feat/cyber-staged-generation branch June 11, 2026 15:09
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant