feat(cyber): detect leaked secrets in responses, not just submitted flags by larstalian · Pull Request #259 · vecna-labs/open-range

larstalian · 2026-06-12T03:02:13Z

What this does

The gym builds a vulnerable web app with a hidden secret (the "flag") and an agent tries to hack it out. To grade the agent, the gym used to only check whether the agent submitted the right flag.

This adds a second check: the gym now also notices when the secret actually leaks into a response the app sends back — whether or not the agent submits it. It records which secret leaked by an internal id, never the secret value itself, so the log can't be read to cheat. It also catches secrets returned in common encodings (base64 / hex / url).

Why

Eventually an LLM will generate the worlds, where there is no single planted flag to submit — so "did sensitive data leave the box?" becomes the natural way to grade a breach automatically. This is the groundwork for that.

Honest scope: today there is only one secret (the flag), so this mostly overlaps with the existing submit check, and it does not change scoring or reward — the signal is just made available. Its real payoff is the next step (LLM-generated and bigger worlds).

Follow-ups (scaling to big multi-service worlds)

Cyber gym: speed up leak detection when there are many secrets #262 — speed up the scan when there are many secrets
Cyber gym: define what counts as a leak in multi-service worlds #263 — define what counts as a leak across services
Cyber gym: scope leak detection per service for big lazily-built worlds #264 — scope the scan per service for lazily-built worlds

Testing

A real exploit episode confirms the leak is recorded end to end.
Tests cover the detector, the integrity property (ids logged, not values), and that scoring is unchanged.
ruff + mypy clean.

…ation indictment Gym change — the §8.3 "any HIDDEN value leaked" consequence oracle, wired live: - consequence.py: detect_leak / guarded_values over HIDDEN value_ref nodes — the independent leak verifier, length-floored so a short value can't false-positive. - codegen bakes guarded_values(graph) into seed.json; the rendered app scans each response boundary and logs leaked node IDS only (never the secret value); realize surfaces final_state["leaked_secret_ids"]; check_success consumes it in `reason`. - Boundary held: success/subgoals — and thus the trainer's averaged reward rungs — are unchanged. Rewarding on the leak is a trainer-coordinated follow-up (#198). Research tooling — experiments/indictment: - An independent harness running LLM-generated worlds to measure the admit-gap: worlds a self-verifying loop ships that independent probes reject as trivial or unfaithful. 89 worlds across 4 classes; gap is a small, consistent ~2-4%. The harder finding: a reliable independent verifier is itself hard to build — false negatives AND positives, each found by hand-auditing. See RESULTS.md. DESIGN.md §8 documents the verification ladder, the spine, and the honest results (incl. the raw-substring oracle's known encoded-leak limitation). Live-validated end to end (a real SQLi episode records the leak by node id). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#259 review) The adversarial review of the §8.3 spine flagged two real (latent) limitations in the raw-substring oracle; close them: - Encoded exfil: detect_leak and the rendered app scanner now search for each guarded value AND its cheap reversible encodings (base64 / hex / percent) by encoding the needle, so an encoded leak is caught, not only the literal form. gzip / binary / multibyte splits remain out (documented). - Containment: detect_leak drops a guarded value that is a proper substring of another leaked value (offline / grader path; the live per-response signal stays raw, since the scanner logs ids not values). An agreement test pins the rendered app scanner to consequence.value_variants so live and offline verdicts can't drift. DESIGN.md §8.4 corrected (no longer "deferred"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

experiments/ (the indictment harness + 89 generated test worlds + writeup) was a one-off validation run, not gym/pack code — it doesn't belong in the repo. Its findings stay documented in DESIGN.md §8.10 as prose; the pack verifier (consequence.py) is unaffected — it lives in the pack, where it belongs. The coupled tests/test_indictment_harness.py goes with it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drop references that .rules forbids in code comments: a section pointer ("§8.3 spine"), a forward/"future" reference ("emergent mode"), and a docstring naming a specific test (rot-prone). The remaining comments are non-obvious WHY. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

) The realization primitive of the emergent-mode ladder (DESIGN.md §9). Today's admission is structural (a graph-path check); an LLM-realized handler can be wrong, so it is admitted DYNAMICALLY — run the intended exploit + a benign request and let the consequence verifier decide: the exploit must leak the flag, the benign request must not. Accept iff solvable and not trivial. - realize_admit.py (pack): the pure pieces — classify_admission (the verdict, over consequence.detect_leak) + cmdi_exploit_and_benign (the per-class exploit oracle). Running an episode is a host concern (packs must not import openrange), so the orchestration lives with the caller, not the pack. - codegen: a vuln node's realized_handler stands in for the template — the hook the LLM realization writes through. - DESIGN.md §9: the M0->M3 realization ladder (procedural architects, LLM realizes, admission verifies, freeze), mapping M1/M2/M3 to #252 / #212 / #235 / #189. Validated end to end: a faithful command_injection world is accepted; trivial and not-solvable verdicts are exercised. 100% branch coverage; import-boundary clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI runs mypy over tests/; the rendered-app exec helper retrieved values from a dict[str, object] namespace, so calling them tripped "object not callable". Type the namespace dict[str, Any] so the pulled-out functions are callable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Inject a stand-in "realized" handler into a command-injection world and run it through codegen -> runtime -> the admission gate: - a different-but-real handler (splits on ';', cats the file) is ACCEPTED — the gate lets in varied implementations, which is the diversity M0 is for; - a handler that returns the flag on any request is REJECTED as trivial; - a handler that never leaks is REJECTED as not solvable. Confirms the codegen hook and the gate work together on a live episode. The live LLM-writes-the-handler step plugs in on top (non-deterministic, so a demo not a test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian marked this pull request as draft June 12, 2026 03:05

larstalian and others added 3 commits June 11, 2026 22:12

This was referenced Jun 12, 2026

M0: LLM node-realization behind a dynamic admission gate (cyber pack) #260

Open

Epic: emergent mode at scale — LLM-realized services on a procedural graph (the realization ladder) #261

Open

This was referenced Jun 12, 2026

Cyber gym: speed up leak detection when there are many secrets #262

Open

Cyber gym: define what counts as a leak in multi-service worlds #263

Open

Cyber gym: scope leak detection per service for big lazily-built worlds #264

Open

larstalian changed the title ~~feat(cyber): live leak/consequence oracle (§8.3 spine) + self-verification indictment~~ feat(cyber): detect leaked secrets in responses, not just submitted flags Jun 12, 2026

larstalian and others added 2 commits June 11, 2026 22:53

larstalian marked this pull request as ready for review June 12, 2026 04:03

larstalian merged commit ea9dc8c into main Jun 12, 2026
2 checks passed

larstalian deleted the feat/cyber-verification-ceiling branch June 12, 2026 04:03

github-actions Bot locked and limited conversation to collaborators Jun 12, 2026

larstalian restored the feat/cyber-verification-ceiling branch June 12, 2026 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cyber): detect leaked secrets in responses, not just submitted flags#259

feat(cyber): detect leaked secrets in responses, not just submitted flags#259
larstalian merged 7 commits into
mainfrom
feat/cyber-verification-ceiling

larstalian commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larstalian commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Why

Follow-ups (scaling to big multi-service worlds)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larstalian commented Jun 12, 2026 •

edited

Loading