feat(cyber): detect leaked secrets in responses, not just submitted flags#259
Merged
Conversation
…ation indictment Gym change — the §8.3 "any HIDDEN value leaked" consequence oracle, wired live: - consequence.py: detect_leak / guarded_values over HIDDEN value_ref nodes — the independent leak verifier, length-floored so a short value can't false-positive. - codegen bakes guarded_values(graph) into seed.json; the rendered app scans each response boundary and logs leaked node IDS only (never the secret value); realize surfaces final_state["leaked_secret_ids"]; check_success consumes it in `reason`. - Boundary held: success/subgoals — and thus the trainer's averaged reward rungs — are unchanged. Rewarding on the leak is a trainer-coordinated follow-up (#198). Research tooling — experiments/indictment: - An independent harness running LLM-generated worlds to measure the admit-gap: worlds a self-verifying loop ships that independent probes reject as trivial or unfaithful. 89 worlds across 4 classes; gap is a small, consistent ~2-4%. The harder finding: a reliable independent verifier is itself hard to build — false negatives AND positives, each found by hand-auditing. See RESULTS.md. DESIGN.md §8 documents the verification ladder, the spine, and the honest results (incl. the raw-substring oracle's known encoded-leak limitation). Live-validated end to end (a real SQLi episode records the leak by node id). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#259 review) The adversarial review of the §8.3 spine flagged two real (latent) limitations in the raw-substring oracle; close them: - Encoded exfil: detect_leak and the rendered app scanner now search for each guarded value AND its cheap reversible encodings (base64 / hex / percent) by encoding the needle, so an encoded leak is caught, not only the literal form. gzip / binary / multibyte splits remain out (documented). - Containment: detect_leak drops a guarded value that is a proper substring of another leaked value (offline / grader path; the live per-response signal stays raw, since the scanner logs ids not values). An agreement test pins the rendered app scanner to consequence.value_variants so live and offline verdicts can't drift. DESIGN.md §8.4 corrected (no longer "deferred"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
experiments/ (the indictment harness + 89 generated test worlds + writeup) was a one-off validation run, not gym/pack code — it doesn't belong in the repo. Its findings stay documented in DESIGN.md §8.10 as prose; the pack verifier (consequence.py) is unaffected — it lives in the pack, where it belongs. The coupled tests/test_indictment_harness.py goes with it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop references that .rules forbids in code comments: a section pointer
("§8.3 spine"), a forward/"future" reference ("emergent mode"), and a docstring
naming a specific test (rot-prone). The remaining comments are non-obvious WHY.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
) The realization primitive of the emergent-mode ladder (DESIGN.md §9). Today's admission is structural (a graph-path check); an LLM-realized handler can be wrong, so it is admitted DYNAMICALLY — run the intended exploit + a benign request and let the consequence verifier decide: the exploit must leak the flag, the benign request must not. Accept iff solvable and not trivial. - realize_admit.py (pack): the pure pieces — classify_admission (the verdict, over consequence.detect_leak) + cmdi_exploit_and_benign (the per-class exploit oracle). Running an episode is a host concern (packs must not import openrange), so the orchestration lives with the caller, not the pack. - codegen: a vuln node's realized_handler stands in for the template — the hook the LLM realization writes through. - DESIGN.md §9: the M0->M3 realization ladder (procedural architects, LLM realizes, admission verifies, freeze), mapping M1/M2/M3 to #252 / #212 / #235 / #189. Validated end to end: a faithful command_injection world is accepted; trivial and not-solvable verdicts are exercised. 100% branch coverage; import-boundary clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI runs mypy over tests/; the rendered-app exec helper retrieved values from a dict[str, object] namespace, so calling them tripped "object not callable". Type the namespace dict[str, Any] so the pulled-out functions are callable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inject a stand-in "realized" handler into a command-injection world and run it through codegen -> runtime -> the admission gate: - a different-but-real handler (splits on ';', cats the file) is ACCEPTED — the gate lets in varied implementations, which is the diversity M0 is for; - a handler that returns the flag on any request is REJECTED as trivial; - a handler that never leaks is REJECTED as not solvable. Confirms the codegen hook and the gate work together on a live episode. The live LLM-writes-the-handler step plugs in on top (non-deterministic, so a demo not a test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
The gym builds a vulnerable web app with a hidden secret (the "flag") and an agent tries to hack it out. To grade the agent, the gym used to only check whether the agent submitted the right flag.
This adds a second check: the gym now also notices when the secret actually leaks into a response the app sends back — whether or not the agent submits it. It records which secret leaked by an internal id, never the secret value itself, so the log can't be read to cheat. It also catches secrets returned in common encodings (base64 / hex / url).
Why
Eventually an LLM will generate the worlds, where there is no single planted flag to submit — so "did sensitive data leave the box?" becomes the natural way to grade a breach automatically. This is the groundwork for that.
Honest scope: today there is only one secret (the flag), so this mostly overlaps with the existing submit check, and it does not change scoring or reward — the signal is just made available. Its real payoff is the next step (LLM-generated and bigger worlds).
Follow-ups (scaling to big multi-service worlds)
Testing