Vendor-neutral reference implementation for agent runtime evidence: release manifests, runtime events, permissioned tool calls, traces, and replay state for agentic workflows.
Operational Evidence Plane is a small, runnable reference implementation for making agent runtime behavior reconstructable after release. It binds release-time intent (model, prompt, tool schema, policy, workflow, rollout, eval, and data-state references) to runtime evidence (events, OPA-backed permission decisions, traces, replay state, eval results, and reconstruction packets).
A quick scan should make three things clear: the repository is code, not a concept note (make verify rebuilds replay state and checks joins, including counterfactual replay determinism); it is vendor-neutral, not a replacement for Bedrock, LangSmith, OPA, OTel, MCP, A2A, cloud release tools, Styra DAS, or Permit.io; and it is intentionally bounded to a deterministic code-review demo using mocked LLM behavior, SQLite state, and real OPA decisions.
Use it to inspect whether a runtime action can be joined back to a release manifest, verify policy decisions against trace and replay evidence, replay a stored decision under a substituted policy bundle, and package a minimal evidence chain that humans and CI can review without a vendor-specific control plane.
This repository is an open, vendor-neutral reference implementation. It extends patterns already visible in vendor-native agent versions, prompt registries, policy engines, and telemetry specs, but does not replace them and does not claim standardization or production readiness. The first demo target is a deterministic code-review agent using Python, SQLite, real OPA decisions, scenario-agnostic schemas, and mocked LLM behavior.
The inspectable path is intentionally narrow: a release manifest names the shipped configuration, runtime events and OPA-backed permission packets record what happened, trace and replay state preserve the joins, eval/reconstruction outputs show what can and cannot be reconstructed, and the v0.3 counterfactual branch composes policy, cost, drift, cache, and identity metadata under a stored decision id.
The v0.3 branch adds counterfactual replay primitives over the stored decision record. Given a stored decision_id, the policy replay path reconstructs the recorded OPA input context from SQLite, substitutes a different policy bundle, re-runs OPA deterministically, and emits an original-vs-counterfactual decision diff that validates against replay/counterfactual_replay.v0.schema.json. Additional v0.3 paths record cost, reserve, five-surface drift, cache, and identity metadata under the same decision id.
oep replay pder_code_review_read_diff_0001 \
--counterfactual \
--policy-bundle permissions/policy/counterfactual/compound_reliability_step_bound.rego \
--output-format json \
--replay-timestamp-utc 2026-05-23T00:00:00ZPolicy, budget, reserve accounting, cache staleness, and config-surface
diffs are deterministic replays over recorded fields. Cross-provider model
substitution, cache substitution that implies a fresh model call, and
pre-session projection are labelled replay_class: evaluative and should
be read as counterfactual estimates. The three counterfactual demos, the
composed CLI paths, the per-surface validation gates, and the commercial
and academic precedents are documented in
docs/counterfactual_replay.md.
Boundary: this is not a production-grade replay engine, not a compliance certification, not a substitute for vendor authorization-replay products, and does not constitute legal or regulatory adequacy by itself.
Prerequisites:
- Python 3.11-3.14, matching the CI matrix
- OPA CLI 1.x, tested with 1.7.1.
OPA install examples:
# macOS, Homebrew
brew install opa
# Linux, x86_64
curl -L -o opa https://openpolicyagent.org/downloads/v1.7.1/opa_linux_amd64_static
chmod +x opa
sudo mv opa /usr/local/bin/opaIf OPA is available outside PATH, set OEP_OPA_BIN_PATH=/path/to/opa.
OPA_PATH is also accepted as a fallback override.
Set OEP_OPA_EVAL_TIMEOUT_SECONDS to tune the counterfactual OPA
subprocess timeout in seconds; the default is 30 and the minimum is
0.001. OPA stdin payloads are capped at 8 MiB; split larger replay
batches before evaluation.
Set OEP_OPA_COMMAND_WRAPPER to prepend a local containment command
to OPA invocations, for example prlimit --as=100000000 in CI
environments that evaluate substituted policy bundles. The wrapper
executable is restricted to docker, nice, prlimit, or sudo.
The executable must resolve from PATH to a trusted system or local tool
directory such as /usr/bin, /bin, /usr/sbin, /sbin,
/usr/local/bin, or /opt/homebrew/bin.
Wrapper arguments are restricted to allow-listed options and strict
values for the selected wrapper; positional alternate binary targets are
rejected. Docker wrappers must use docker run, must include --init,
and may only use a constrained option set such as --rm,
--network none, --user, --cpus, --memory, --pids-limit,
--read-only, and read-only -v / --volume bind mounts in
host_path:container_path:ro form. When a
read-only bind mount contains the policy bundle path, OEP rewrites the
OPA --data argument to the corresponding container path.
Wrappers must keep the OPA child in the spawned process group
or forward termination signals so timeout cleanup can stop the full
evaluation tree.
Set OEP_SQLITE_BATCH_VARIABLE_LIMIT to raise the replay reader batch
limit above the default 900 on modern SQLite builds; values must stay
between 1 and 32766.
The reference implementation invokes OPA through the CLI for each replay
batch. Higher-volume deployments can preserve the same deterministic
input/output contract while routing evaluation through a local OPA server
or a WASM runtime.
python3 -m venv .venv
. .venv/bin/activate
python -m pip install uv==0.11.10
uv sync --extra dev --locked
make verifyThis compiles the packages, tests and evaluates the OPA policy, validates the counterfactual replay schema, regenerates demo/state/code_review_agent.sqlite, checks every cross-artifact join, validates the deterministic eval, checks the reconstruction packet, verifies the committed DTR JSONL projection, checks the MCP -> OEP permission packet projection, exercises the oep replay reader against generated replay state, and checks counterfactual replay determinism across generated SQLite, JSON/JSONL, and DTR outputs.
It also builds the root wheel/sdist, installs the wheel in a temporary virtual environment, and checks that the installed packages keep their typing markers.
Smoke tests run through pytest:
make testCoverage runs the full verify chain plus pytest and fails below 95%:
make coverageLinting, type checking, policy tests, and artifact maintenance:
make lint
make typecheck
make test-policy
make sync-resources
make build-check
make check-digests
make check-dtr-jsonl
make validate-counterfactual-replay
make check-replay-determinism
make update-digestsInstalled package entry points:
oep-verify-manifest
oep-run-demo --state-path /tmp/oep-code-review.sqlite
oep-check-reconstruction
oep replay pder_code_review_read_diff_0001
OEP_REPLAY_MODE=counterfactual oep replay pder_code_review_read_diff_0001 --policy-bundle permissions/policy/counterfactual/compound_reliability_step_bound.regooep replay <decision_id> is a read-only reader over the local SQLite
replay store. It joins the recorded permission packet, agent-step event,
trace bundle, and release-manifest summary for a recorded decision id.
It does not make live model or vendor calls.
To inspect generated replay state:
sqlite3 demo/state/code_review_agent.sqlite \
"select 'events', count(*) from events union all select 'permissions', count(*) from permissions union all select 'traces', count(*) from traces union all select 'evals', count(*) from evals union all select 'findings', count(*) from findings;"To reset generated state:
make clean-stateTo isolate generated state for a test run:
OEP_DEMO_STATE_PATH=/tmp/oep-code-review.sqlite make verify
python demo/scripts/run_code_review_demo.py --state-path /tmp/oep-code-review.sqliteFor the fastest read, open these in order:
- Architecture walkthrough
- Release manifest example
- Agent-step event example
- Tool permission packet
- Operational trace bundle
- Deterministic eval result
- Reconstruction packet
- Quickstart walkthrough
- Architecture walkthrough
- Counterfactual replay guide
- Schema reference
- Schema versioning policy
- Schema migration v0.3
- Record-keeping reference
- Landscape and prior art
- Decision log
- Public claims guide
- Release checklist
- Contributing guide
- Bedrock translation
- Decision Trace Reconstructor integration
- Model Context Protocol (MCP) adapter
- LangGraph adapter mapping — projection demo for LangGraph checkpoint events; production wrapper ships separately as the
langgraph-oeppip package (pip install langgraph-oep)
The release manifest is the first inspectable release-time layer:
- Release manifest schema
- Code-review-agent release example
- Deterministic model-behavior contract
- Code-review prompt contract
- Rollback and reconstruction rules
The manifest schema binds eight release-time field groups: model, prompt, tool schema, policy, workflow, rollout, eval, and data state.
The event profile is the first runtime join layer:
The event schema carries release_manifest_id, trace_id, span_id, checkpoint, entity_ref, action_type, tool_call_id, permission_packet_ref, replay_handle, and evidence-loss notes.
The permission packet is the first OPA-backed runtime evidence layer:
- Tool permission packet schema
- Code-review tool permission example
- Denied write permission example
- OPA policy
- Counterfactual replay output schema
- Compound reliability counterfactual policy
- Budget-per-run counterfactual policy
- Approval-per-step counterfactual policy
- OPA input
- Denied OPA input
The trace bundle is the first stitched reconstruction view:
- Operational trace schema
- Code-review-agent trace bundle
- Denied trace bundle
- Eval result schema
- Code-review-agent eval result
- Denied replay-readiness eval result
The demo materializes local replay state:
- Synthetic diff fixture
- Replay-state recipe
- Deterministic demo runner
- Counterfactual demo runner
- Run script
- Replay-state checker
- Counterfactual replay checker
make verify regenerates demo/state/code_review_agent.sqlite from the committed artifacts, checks that the committed DTR JSONL projection is up to date, and runs the counterfactual replay determinism checks. SQLite files and generated counterfactual JSON/JSONL outputs under demo/counterfactual/ are intentionally ignored; they are reproducible local state, not source.
The playbook packet is the first reconstruction output:
- Reconstruction packet schema
- Code-review reconstruction packet
- Denied blocked reconstruction packet
- Scenario reconstruction checker
- Incident reconstruction case study
The current inspectable chain is:
release manifest -> agent-step event -> OPA-backed permission packet -> trace bundle -> SQLite replay state -> deterministic eval result -> reconstruction packet
The v0.3 counterfactual branch starts from the stored permission decision and replay state, substitutes policy, budget, model, cache, or config-surface inputs, and emits schema-validated attribution output. Deterministic surfaces replay over recorded fields; model and cache-fresh-call substitutions are labelled evaluative estimates. The primary eval is a deterministic smoke check over one synthetic fixture. The denied path demonstrates blocked replay readiness when OPA denies a tool call and no SQLite replay state is generated. Neither is a benchmark, model-quality claim, safety certification, or production monitoring result.
The replayable permission trace fields introduced in the v0.2 release line and still current (credential lifetime, approval capture, policy and manifest digests, resolved model identity, and the non-deterministic builtin cache) and the education-only mapping of OEP record fields to EU AI Act and NIST AI RMF record-keeping language are documented in docs/record_keeping_reference.md. The mapping is reference material, not a compliance or audit claim.
oep replay <decision_id>
oep replay <decision_id> --counterfactual --policy-bundle <path-to-rego-bundle>oep replay is a thin read-only reader over the local SQLite replay
store generated by oep-run-demo or make verify. It reconstructs the
recorded permission trace for a decision id (the pder_* packet
identifier) by joining the recorded permission packet, agent-step
event, trace bundle, and release-manifest summary.
The demo runner materializes SQLite state at a temporary path and publishes the completed database with an atomic replace. Existing replay readers keep their current file handle; new readers open the completed replacement.
- The CLI does not make live model or vendor API calls.
- It does not introduce new persistence; it only reads existing rows.
- Pass
--state-pathto read from an alternate SQLite path, or setOEP_DEMO_STATE_PATHbefore running the demo. - Pass
--field <name>to print a specific record field instead of the full JSON record. - Pass
--counterfactual --policy-bundle <path>to re-derive the decision under a substituted policy bundle.OEP_REPLAY_MODEacceptsread-only(default) orcounterfactual. - Pass
--output-format json,jsonl, orhuman. Read-only replay defaults to JSON; counterfactual replay defaults to human output. - Pass
--replay-timestamp-utc <date-time>in counterfactual mode when CLI JSON must be compared byte-for-byte. - Pass
--strip-exclusionsin counterfactual JSON/JSONL mode to remove fields listed inreplay_metadata.determinism_exclusionsbefore output.
The integrations/mcp/ directory ships an
illustrative adapter that translates one Model Context Protocol
(MCP) tools/call envelope into an OEP permission packet, including
the replayable permission trace fields added in v0.2. It is documentation and
mapping data with a standalone script — it does not call MCP servers
or vendor APIs.
The adapter is illustration, not a replacement for MCP, LangSmith,
Bedrock, OTel, A2A, or OPA. A LangGraph adapter ships alongside in
integrations/langgraph/ as projection
mapping, and as the separate langgraph-oep
pip package for runtime use. Other framework adapters (OpenAI
Assistants, Bedrock) remain post-core translation material.
The public anchors for the permission-evidence framing this adapter
exposes are the Model Context Protocol authorization specification
(2025-03-26)
and the NSA Cybersecurity Information sheet on MCP security design
considerations.
Both are cited as third-party public guidance, not as endorsements of
this repository. The integration-side reference list lives in
integrations/mcp/README.md.
- reference implementation, not framework
- not a vendor replacement
- not ready for production use
- not a production-grade replay engine
- not a standardization proposal
- not a compliance certification
- not a substitute for vendor authorization-replay products
- does not create compliance, audit readiness, or legal sufficiency by itself
- does not constitute legal or regulatory adequacy by itself
- demonstrates one wiring pattern among several plausible ones
- designed for inspectability and education first
The public Python distribution is the root package, operational-evidence-plane. The workspace directories below are source and development boundaries for the reference implementation; they are not independently published packages for this release line.
| Package | Intended scope |
|---|---|
manifest/ |
Release-time binding records for model, prompt, tool schema, policy, workflow, rollout, eval, and data-state references. |
events/ |
Scenario-agnostic runtime event profile and replay join keys. |
permissions/ |
Tool-call permission decision records backed by real OPA decisions. |
traces/ |
Trace and span examples that connect release manifests, events, permissions, evals, and replay handles. |
playbooks/ |
Rollback and incident reconstruction rules that explain what evidence is sufficient, missing, or stale. |
demo/ |
Deterministic code-review-agent scenario using mocked LLM behavior and local SQLite state. |
How this repository relates to Bedrock, Azure AI Projects, Vertex, LangSmith, Styra DAS / Permit.io, OTel GenAI, MCP, and A2A, plus the SAL and ACP prior art and the Bedrock translation, is collected in docs/landscape.md.
This is not an agent framework, model gateway, tracing backend, policy language, compliance product, legal-audit package, vendor replacement, production platform, production-grade replay engine, compliance certification, or substitute for vendor authorization-replay products. It does not constitute legal or regulatory adequacy by itself. It is also not a claim that adjacent vendor and open-source tools are absent. The safer claim is narrower: public artifacts mostly expose adjacent slices, and this repository demonstrates one inspectable way to stitch release-time and runtime evidence together.
- Language: Python.
- Demo model behavior: deterministic mocked LLM.
- Local state: SQLite.
- Policy decisions: real OPA.
- Demo scenario: code-review agent.
- Core schemas: scenario-agnostic first.
- Post-core translation: optional Bedrock-specific examples only after the vendor-neutral core exists.
Apache-2.0