Attribute every changed line to its authoring agent, evaluate risk with Semgrep, and deliver governance-grade analytics.
Modern teams mix human and AI-generated code. This service makes that blend observable:
- Attribution – Trace each changed line to the producing agent/session via provenance markers supplied at ingestion.
- Detection – Run Semgrep rule packs on the diff to flag risky patterns (SQLi,
eval, etc.). - Analytics – Produce KPIs such as agent-level risk rate, provenance coverage, volume/churn, and trend deltas.
- Governance – Expose policy decisions (allow/block/warn) with evidence that maps back to repos, PRs, files, and lines.
┌────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│ Git Provider / │ │ Ingestion & │ │ Governance Engine │
│ CI Integrations│ ───▶ │ Attribution Engine │ ───▶ │ (policy + evidence) │
└────────────────┘ └─────────┬──────────┘ └───────────┬────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────────┐
│ Detection via │ │ Warehouse & │
│ Semgrep │ │ Analytics (Redis) │
└────────────────┘ └────────────────────┘
- FastAPI wires the ingestion (
/v1/analysis), analytics, and governance APIs. - Redis persists analyses, changed lines, findings, and decisions using sorted sets, lists, and hashes.
- Semgrep runs as the detection engine. Rules live in
app/detection_rules/semgrep_rules.yml; extend or replace them to suit your policy.
# 1. Install dependencies into a managed virtualenv
make setup
# 2. Ensure Redis is running (default: redis://localhost:6379/0)
# You can override via PROVENANCE_REDIS_URL
# 3. Launch the API with hot reload
make runVisit http://localhost:8000/docs for interactive OpenAPI documentation.
All settings can be driven by environment variables prefixed with PROVENANCE_.
Copy .env.example to .env and adjust values locally if you prefer dotenv-style configuration.
| Variable | Purpose | Default |
|---|---|---|
PROVENANCE_SERVICE_BASE_URL |
Base URL used in status links | http://localhost:8000 |
PROVENANCE_REDIS_URL |
Redis connection string | redis://localhost:6379/0 |
PROVENANCE_PROVENANCE_BLOCK_ON_UNKNOWN |
Block when changed lines lack agent attribution | false |
PROVENANCE_RISK_HIGH_SEVERITY_THRESHOLD |
Number of high-severity findings before issuing a warn | 1 |
PROVENANCE_ANALYTICS_DEFAULT_WINDOW |
Default lookback window for analytics | 7d |
PROVENANCE_SEMGREP_CONFIG_PATH |
Override path/URL for Semgrep configuration | (bundled rules) |
PROVENANCE_TIMESERIES_BACKEND |
Backend for analytics event export (file, bigquery, snowflake, clickhouse, off) |
file |
PROVENANCE_TIMESERIES_PATH |
File path for JSONL events when using file backend |
data/timeseries_events.jsonl |
PROVENANCE_TIMESERIES_PROJECT |
Cloud project/account for warehouse exports | (unset) |
PROVENANCE_TIMESERIES_DATABASE |
Warehouse database name (Snowflake only) | (unset) |
PROVENANCE_TIMESERIES_DATASET |
Dataset/schema for warehouse exports | (unset) |
PROVENANCE_TIMESERIES_TABLE |
Destination table for warehouse exports | (unset) |
PROVENANCE_TIMESERIES_CREDENTIALS_PATH |
Service account JSON for BigQuery (optional) | (unset) |
PROVENANCE_TIMESERIES_USER |
Warehouse service user (Snowflake) | (unset) |
PROVENANCE_TIMESERIES_PASSWORD |
Warehouse password/secret (Snowflake) | (unset) |
PROVENANCE_TIMESERIES_WAREHOUSE |
Snowflake warehouse to use | (unset) |
PROVENANCE_TIMESERIES_ROLE |
Snowflake role to assume | (unset) |
PROVENANCE_TIMESERIES_BATCH_SIZE |
Buffer size before flushing warehouse writes | 25 |
PROVENANCE_OTEL_ENABLED |
Enable OpenTelemetry metrics export | false |
PROVENANCE_OTEL_EXPORTER |
Metrics exporter target (console) |
console |
PROVENANCE_OTEL_OTLP_ENDPOINT |
OTLP metrics endpoint (when exporter=otlp) |
(unset) |
PROVENANCE_OTEL_PROMETHEUS_PORT |
Prometheus exporter port (when exporter=prometheus) |
9464 |
PROVENANCE_POLICY_WARN_THRESHOLDS |
JSON map of category warn thresholds | {} |
PROVENANCE_POLICY_BLOCK_THRESHOLDS |
JSON map of category block thresholds | {} |
PROVENANCE_DETECTOR_MODULE_PATHS |
JSON array of detector module paths to auto-load | [] |
PROVENANCE_GITHUB_TOKEN |
Personal access token for GitHub API enrichment | (unset) |
PROVENANCE_GITHUB_BASE_URL |
GitHub enterprise base URL (optional) | (unset) |
PROVENANCE_GITHUB_AGENT_LABEL_PREFIX |
PR label prefix used to infer agent IDs | agent: |
PROVENANCE_GITHUB_CACHE_TTL_SECONDS |
Cache TTL (seconds) for GitHub metadata lookups | 300 |
PROVENANCE_GITHUB_AGENT_MAP |
JSON map of GitHub logins/keywords to agent IDs | {} |
PROVENANCE_GITHUB_REVIEWER_TEAM_MAP |
JSON map of reviewer logins to team names for cohort reporting | {} |
PROVENANCE_TEAM_REVIEW_BUDGETS |
JSON map of team names to max expected human review counts per window | {} |
PROVENANCE_AGENT_PUBLIC_KEYS |
JSON map of agent IDs to base64 Ed25519 public keys for attestation verification | {} |
PROVENANCE_CLICKHOUSE_URL |
ClickHouse HTTP endpoint for analytics export (when backend=clickhouse) |
(unset) |
PROVENANCE_CLICKHOUSE_DATABASE |
ClickHouse database name (optional if table already qualified) | provenance |
PROVENANCE_CLICKHOUSE_USER |
ClickHouse user for authenticated writes | (unset) |
PROVENANCE_CLICKHOUSE_PASSWORD |
ClickHouse password for authenticated writes | (unset) |
-
Rules are bundled in
app/detection_rules/semgrep_rules.yml. -
Override with your own rule pack by setting
PROVENANCE_SEMGREP_CONFIG_PATH(file path, directory, or remote Semgrep registry URL). -
During analysis execution we materialize the changed lines to a temporary workspace and run:
semgrep --config app/detection_rules/semgrep_rules.yml --json
-
The JSON results are mapped back to the originating changed lines so findings retain repo/PR/file/line attribution.
-
Extend the rule pack or point the detector at your organization-wide Semgrep registry by updating
SemgrepDetectorinapp/services/detection.py. -
Register additional detectors by providing module paths in
PROVENANCE_DETECTOR_MODULE_PATHS; each module should exposeregister_detectors()returningBaseDetectorinstances. -
Built-in heuristics include a Python import detector that flags risky modules (e.g.,
subprocess,pickle); extend this pattern with your own detectors via modular hooks. -
Introspect active detectors via
/v1/detectors/capabilitiesto confirm bundled Semgrep packs, digests, and custom modules loaded at runtime.
- When GitHub credentials are configured, the service inspects commit trailers, PR labels, review comments, reviewer identities/teams, and PR timelines to fill missing agent attribution (see
app/provenance/github_resolver.py). - The resolver persists review conversations (thread counts, team participation, bot override details, classification breakdowns, response latency), CI outcomes (time-to-green, failing check taxonomy), and commit/timeline summaries (force pushes, human follow-ups, rewrite loops) so analytics/governance can act without re-crawling GitHub.
- Governance automatically raises alerts when bot change requests are bypassed or force-pushes land after approval;
/v1/analytics/review-alertsand/v1/analytics/review-loadexpose the same signals for monitoring. - Agents can optionally attach Ed25519 signatures for each changed line. Supply public keys via
PROVENANCE_AGENT_PUBLIC_KEYS; verified signatures boost provenance confidence and surface cryptographic evidence alongside heuristic attribution.
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET |
Liveness probe |
/v1/analysis |
POST |
Submit a pull request (diff + provenance) for asynchronous analysis |
/v1/analysis/{id} |
GET |
Poll analysis status, findings count, and risk summary snapshot |
/v1/analysis/{id}/decision |
GET |
Fetch the governance decision (allow/block/warn) with evidence |
/v1/analysis/{id}/bundle |
GET |
Retrieve the signed DSSE decision bundle |
/v1/analysis/{id}/sarif |
GET |
Retrieve the SARIF 2.1.0 findings report for the analysis |
/v1/analytics/summary |
GET |
Retrieve aggregated KPIs (risk rate, provenance, volume, churn, complexity, etc.) |
/v1/analytics/agents/behavior |
GET |
Retrieve composite behavioral snapshots for each agent |
/v1/detectors/capabilities |
GET |
Enumerate active detectors (Semgrep configs, versions, metadata) |
/metrics |
GET |
Prometheus scrape endpoint (when exporter is prometheus) |
Example ingestion payload:
{
"repo": "acme/shop",
"pr_number": "42",
"base_sha": "abc123",
"head_sha": "def456",
"branch": "feature/checkout-hardening",
"provenance_data": {
"metadata": {
"attestation_url": "https://evidence.example.com/claims/123"
},
"changed_lines": [
{
"file_path": "api/orders.py",
"line_number": 128,
"change_type": "added",
"content": "query = f\"SELECT * FROM orders WHERE id = {order_id}\"",
"language": "python",
"attribution": {
"agent_id": "claude-3-opus",
"agent_session_id": "sess-789",
"provenance_marker": "Agent-ID: claude-3-opus-20240229"
}
}
]
}
}/v1/analytics/summarynow surfaces GitHub-aware metrics alongside the existing risk/volume suite:code_volume,code_churn_rate,avg_line_complexity,agent_response_rate,agent_response_p50_hours,agent_response_p90_hours,reopened_threads,force_push_events,rewrite_loops,human_followup_commits,human_followup_fast,ci_time_to_green_hours,ci_failed_checks,agent_commit_ratio,commit_lead_time_hours,force_push_after_approval,human_reviewer_count,avg_human_reviewers,avg_unique_reviewers,bot_review_events,bot_block_events,bot_block_overrides,bot_block_resolved,bot_reviewer_count,bot_informational_only_reviewer_count,bot_comment_count, andclassification_<label>_count(e.g.,classification_security_count)./v1/analytics/agents/behaviorreturns composite snapshots that now blend code/finding metrics with review conversation health (thread counts, response latency, classification breakdowns), CI friction (failures, time-to-green), commit dynamics (force pushes, rewrite loops, human follow-ups), and attention heatmaps (top paths + hot files) per agent.- Snapshots also include reviewer cohort context (
human_reviewer_count, association breakdowns), bot review behavior (bot_block_events,bot_block_overrides), provenance anomalies (force_push_after_approval_count), and CI failure taxonomy (failing check names and contexts) to highlight operational hotspots. /v1/analytics/review-alertshighlights agents/analyses where bot change-requests were overridden or force-pushes occurred post-approval./v1/analytics/review-loadreports human vs. bot review load per agent, while/v1/analytics/review-load/teamsaggregates human reviewer effort by the configured team map.- Review-focused metrics (
review_comments,unique_reviewers,review_events,agent_comment_mentions) continue to leverage GitHub PR data when credentials are supplied; classification metrics reflect the resolver's heuristic labeling of each conversation snippet. - Use
PROVENANCE_ANALYTICS_DEFAULT_WINDOWor query parameters such as?time_window=14dto track longer horizons and compare agents.
To smoke-test the GitHub resolver end-to-end:
- Export
PROVENANCE_GITHUB_TOKENwith repo-scoped access. - Create a throwaway branch and open a PR against
main; include a commit trailer such asAgent-ID: test-agent. - POST a synthetic analysis payload pointing at that PR (
repo,pr_number,head_sha) so the resolver can hydrate metadata. - Inspect the resulting
analysisrecord and/v1/analytics/summaryoutput to confirm review/CI signals flowed through.
The same process works against forks or sandboxes—helpful when validating new heuristics without polluting production repositories.
- A composite GitHub Action is bundled at
clients/github-action/. Reference it from.github/workflows/provenance.ymland passapi_url+api_tokensecrets to submit each pull request diff. The action fails automatically when the governance outcome isblock. - The workflow helper collects the PR diff (
base_sha..head_sha), submits it to/v1/analysis, polls/v1/analysis/{id}, and prints the enriched decision payload so reviewers can inspect risk summaries inline. - Consume
/v1/analysis/{id}/sarifwhen you need static-analysis interoperability (e.g., uploading to GitHub code scanning or aggregating findings in other dashboards). - Surface decision bundles in CI by hitting
/v1/analysis/{id}/bundle(e.g., attach the DSSE envelope as a build artifact) to preserve signed provenance for downstream policy checks. - See docs/ci-integration.md for a comprehensive workflow guide, SARIF upload recipe, artifact archiving, and non-GitHub CI examples.
- Each analysis generates an
analysis_metricsevent written todata/timeseries_events.jsonlby default. - Switch
PROVENANCE_TIMESERIES_BACKENDtobigqueryorsnowflakeand provide the project/database/dataset/table knobs to buffer events for warehouse loaders. - Point the backend to
offto disable exports entirely. - Install warehouse dependencies when needed:
uv sync --group warehouse(installssnowflake-connector-python). - Install observability exporters via
uv sync --group observabilitywhen enabling OTLP (installopentelemetry-exporter-otlp). For Prometheus support, installopentelemetry-exporter-prometheusmanually. - Set
PROVENANCE_OTEL_ENABLED=trueto emit OpenTelemetry metrics (currently using the console exporter by default). - Event payloads include per-agent code volume, churn rates, complexity heuristics, and counts by finding category/severity.
- Run
make clickhouse-upto launch a local ClickHouse instance with the starter schema frominfrastructure/clickhouse/schema.sql. - Configure
PROVENANCE_TIMESERIES_BACKEND=clickhouse,PROVENANCE_CLICKHOUSE_URL=http://localhost:8123, andPROVENANCE_TIMESERIES_TABLE=analysis_events(or point at your own table) to mirror analytics events into ClickHouse. - Downstream jobs can query the
provenancedatabase (tables:analysis_events,findings,review_events) for long-horizon reporting while Redis continues to serve hot state.
- Install dashboard dependencies:
uv sync --group dashboard - Launch Streamlit UI:
uv run --group dashboard -- streamlit run dashboards/agent_dashboard.py - Set
PROVENANCE_DASHBOARD_APIto point at your deployed API when running remotely. - To enable trend charts, set
PROVENANCE_DASHBOARD_EVENTSto a path containing the exported JSONL events (defaults todata/timeseries_events.jsonl). - If the Prometheus exporter is enabled (
PROVENANCE_OTEL_EXPORTER=prometheus), scrape metrics from/metrics.
- Generate an OpenAPI schema with
make docs(writesopenapi.json). - A lightweight synchronous client lives in
clients/python; useProvenanceClientfor basic ingestion/status/analytics calls. - Async support is available via
AsyncProvenanceClient. Install the client SDK withpip install provenance[client]and import fromclients.python.
- CI Integration Guide – Configure GitHub Actions, upload SARIF, archive decision bundles, and adapt the workflow to other CI systems.
- API Cookbook – Use curl/Python snippets to ingest analyses, poll decisions, and pull evidence or analytics.
- Governance & Risk Model – Understand decision flow, thresholds, and tuning guidance.
- Configuration Reference – Environment variables grouped by subsystem with defaults and usage tips.
- Detector Authoring Guide – Build custom detectors, register modules, and manage rule packs.
- Deployment & Operations Guide – Deploy with Docker/Kubernetes, scale detectors, and instrument observability.
- SARIF Reporting – Understand the SARIF 2.1.0 output, severity mapping, and customization hooks.
- DSSE Decision Bundles – Inspect the envelope schema, verify signatures, and integrate with transparency logs.
- Release Process – Publish application releases and Helm charts with the bundled workflows.
- Analyses – Stored as JSON blobs keyed by
analysis:{analysis_id}with a sorted set index for time-window queries. - Changed Lines – Backed by Redis lists (
analysis:{id}:lines) to retain insertion order and provenance data. - Findings – Redis hashes per analysis hold Semgrep findings for quick lookup.
- Decisions – Single JSON document per analysis storing policy outcome and evidence.
# Static checks (Python bytecode compilation)
make compile
# Run the test suite (uses fakeredis for analytics tests)
make testUse the provided Dockerfile to build and run the API locally:
docker build -t provenance-api .
docker run --rm -p 8000:8000 provenance-apiOr bring up the service plus Redis via docker compose:
docker compose up --buildThe API will be available at http://localhost:8000 with Prometheus metrics (if enabled) exposed at /metrics.
- Wire Semgrep to enterprise rule registries or rule bundles.
- Stream analysis events to your data warehouse instead of relying solely on Redis.
- Enrich analytics endpoints with MTTR, suppression rates, and trend deltas.