Skip to content

feat(cyber): detect leaked secrets in responses, not just submitted flags#259

Merged
larstalian merged 7 commits into
mainfrom
feat/cyber-verification-ceiling
Jun 12, 2026
Merged

feat(cyber): detect leaked secrets in responses, not just submitted flags#259
larstalian merged 7 commits into
mainfrom
feat/cyber-verification-ceiling

Conversation

@larstalian

@larstalian larstalian commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What this does

The gym builds a vulnerable web app with a hidden secret (the "flag") and an agent tries to hack it out. To grade the agent, the gym used to only check whether the agent submitted the right flag.

This adds a second check: the gym now also notices when the secret actually leaks into a response the app sends back — whether or not the agent submits it. It records which secret leaked by an internal id, never the secret value itself, so the log can't be read to cheat. It also catches secrets returned in common encodings (base64 / hex / url).

Why

Eventually an LLM will generate the worlds, where there is no single planted flag to submit — so "did sensitive data leave the box?" becomes the natural way to grade a breach automatically. This is the groundwork for that.

Honest scope: today there is only one secret (the flag), so this mostly overlaps with the existing submit check, and it does not change scoring or reward — the signal is just made available. Its real payoff is the next step (LLM-generated and bigger worlds).

Follow-ups (scaling to big multi-service worlds)

Testing

  • A real exploit episode confirms the leak is recorded end to end.
  • Tests cover the detector, the integrity property (ids logged, not values), and that scoring is unchanged.
  • ruff + mypy clean.

…ation indictment

Gym change — the §8.3 "any HIDDEN value leaked" consequence oracle, wired live:
- consequence.py: detect_leak / guarded_values over HIDDEN value_ref nodes — the
  independent leak verifier, length-floored so a short value can't false-positive.
- codegen bakes guarded_values(graph) into seed.json; the rendered app scans each
  response boundary and logs leaked node IDS only (never the secret value); realize
  surfaces final_state["leaked_secret_ids"]; check_success consumes it in `reason`.
- Boundary held: success/subgoals — and thus the trainer's averaged reward rungs —
  are unchanged. Rewarding on the leak is a trainer-coordinated follow-up (#198).

Research tooling — experiments/indictment:
- An independent harness running LLM-generated worlds to measure the admit-gap:
  worlds a self-verifying loop ships that independent probes reject as trivial or
  unfaithful. 89 worlds across 4 classes; gap is a small, consistent ~2-4%. The
  harder finding: a reliable independent verifier is itself hard to build — false
  negatives AND positives, each found by hand-auditing. See RESULTS.md.

DESIGN.md §8 documents the verification ladder, the spine, and the honest results
(incl. the raw-substring oracle's known encoded-leak limitation).

Live-validated end to end (a real SQLi episode records the leak by node id).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian marked this pull request as draft June 12, 2026 03:05
larstalian and others added 3 commits June 11, 2026 22:12
 #259 review)

The adversarial review of the §8.3 spine flagged two real (latent) limitations in
the raw-substring oracle; close them:
- Encoded exfil: detect_leak and the rendered app scanner now search for each guarded
  value AND its cheap reversible encodings (base64 / hex / percent) by encoding the
  needle, so an encoded leak is caught, not only the literal form. gzip / binary /
  multibyte splits remain out (documented).
- Containment: detect_leak drops a guarded value that is a proper substring of another
  leaked value (offline / grader path; the live per-response signal stays raw, since
  the scanner logs ids not values).

An agreement test pins the rendered app scanner to consequence.value_variants so live
and offline verdicts can't drift. DESIGN.md §8.4 corrected (no longer "deferred").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
experiments/ (the indictment harness + 89 generated test worlds + writeup) was a
one-off validation run, not gym/pack code — it doesn't belong in the repo. Its
findings stay documented in DESIGN.md §8.10 as prose; the pack verifier
(consequence.py) is unaffected — it lives in the pack, where it belongs. The
coupled tests/test_indictment_harness.py goes with it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop references that .rules forbids in code comments: a section pointer
("§8.3 spine"), a forward/"future" reference ("emergent mode"), and a docstring
naming a specific test (rot-prone). The remaining comments are non-obvious WHY.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
)

The realization primitive of the emergent-mode ladder (DESIGN.md §9). Today's
admission is structural (a graph-path check); an LLM-realized handler can be wrong, so
it is admitted DYNAMICALLY — run the intended exploit + a benign request and let the
consequence verifier decide: the exploit must leak the flag, the benign request must
not. Accept iff solvable and not trivial.

- realize_admit.py (pack): the pure pieces — classify_admission (the verdict, over
  consequence.detect_leak) + cmdi_exploit_and_benign (the per-class exploit oracle).
  Running an episode is a host concern (packs must not import openrange), so the
  orchestration lives with the caller, not the pack.
- codegen: a vuln node's realized_handler stands in for the template — the hook the
  LLM realization writes through.
- DESIGN.md §9: the M0->M3 realization ladder (procedural architects, LLM realizes,
  admission verifies, freeze), mapping M1/M2/M3 to #252 / #212 / #235 / #189.

Validated end to end: a faithful command_injection world is accepted; trivial and
not-solvable verdicts are exercised. 100% branch coverage; import-boundary clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian changed the title feat(cyber): live leak/consequence oracle (§8.3 spine) + self-verification indictment feat(cyber): detect leaked secrets in responses, not just submitted flags Jun 12, 2026
larstalian and others added 2 commits June 11, 2026 22:53
CI runs mypy over tests/; the rendered-app exec helper retrieved values from a
dict[str, object] namespace, so calling them tripped "object not callable".
Type the namespace dict[str, Any] so the pulled-out functions are callable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inject a stand-in "realized" handler into a command-injection world and run it through
codegen -> runtime -> the admission gate:
- a different-but-real handler (splits on ';', cats the file) is ACCEPTED — the gate
  lets in varied implementations, which is the diversity M0 is for;
- a handler that returns the flag on any request is REJECTED as trivial;
- a handler that never leaks is REJECTED as not solvable.

Confirms the codegen hook and the gate work together on a live episode. The live
LLM-writes-the-handler step plugs in on top (non-deterministic, so a demo not a test).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian marked this pull request as ready for review June 12, 2026 04:03
@larstalian larstalian merged commit ea9dc8c into main Jun 12, 2026
2 checks passed
@larstalian larstalian deleted the feat/cyber-verification-ceiling branch June 12, 2026 04:03
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
@larstalian larstalian restored the feat/cyber-verification-ceiling branch June 12, 2026 04:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant