Recover an image's signed origin after the web strips its metadata — and verify it offline.
Provenance metadata embedded in image files (C2PA / Content Credentials) is stripped by the open web within seconds — screenshots, recompression, resizing, and CDN transcodes all destroy it. veriroot stores a robust perceptual fingerprint of an image at registration time plus a tamper-evident, signed record of its origin. Later, even after every embedded byte of provenance is gone, the origin is recovered by resemblance (think Shazam, for images) and the recovered record is independently verifiable.
Attribution, not authenticity. A record proves who registered what, and when — never that the pixels are true. Output is always
recovered/no_match/uncertain, never "real" / "fake". This is both the honest framing and the liability firewall.
This repository is a v1 proof-of-concept: fully working end to end, and benchmarked honestly so it quantifies exactly where the approach holds and where it breaks.
The proof of origin normally rides inside the file, so it falls off the moment the file travels. Detecting fakes is a losing arms race, and running a provenance network is a chicken-and-egg problem. veriroot takes a third path: let accountable parties prove they stood behind a specific image at a specific time, in a way that survives the web mangling it — and make that proof valuable the instant you register, with no network required.
Registering an image issues a portable proof-of-existence certificate. Anyone can verify it with the service switched off, against a published signed root:
$ veriroot-verify photo.png.cert.json --image photo.png
[PASS] record_hash matches canonical record
[PASS] registrant signature
[PASS] merkle inclusion proof
[PASS] signed tree head signature
[PASS] image asset.sha256 (exact bytes)
RESULT: VERIFIEDAlter the image, the record, or the proof and it flips to FAILED. This is the court-usable / takedown-usable proof of priority that makes registration valuable even with an empty registry.
register → SSCD + PDQ fingerprint → canonical OriginRecord (JCS) → Ed25519-sign
→ append to an RFC 6962 Merkle log → sign + publish the root → pgvector upsert
→ issue a CERTIFICATE
recover → fingerprint the query → exact cosine + PDQ Hamming → fuse → 3-state result + inclusion proof
verify → (offline) recompute hash + root, check signatures, match the published root
Two fingerprints are stored per image: SSCD (a 512-d self-supervised copy-detection embedding, robust to compression / resize / screenshots / moderate crops) and PDQ (a 256-bit DCT hash, cheap and near-perfect on format and compression). They cover each other's blind spots, and the benchmark reports both plus their fusion.
The local "spine" (API, Merkle log, certificate, offline verifier, tests) has a tiny footprint — no GPU, no torch required.
make install # core + dev deps (no torch)
make db # Postgres + pgvector via Docker Compose
make migrate # create the schema
make serve # FastAPI on :8000 — open http://localhost:8000 for the web UI
make register IMG=photo.png # writes photo.png.cert.json
make recover IMG=transformed.png # three-state recovery result
make verify CERT=photo.png.cert.json IMG=photo.png # offline certificate check
make test # 23 tests: Merkle (property-based), certificate, fingerprint, API
make demo # Phase 0 acceptance: rank-1 over 100 distractorsWithout the SSCD model installed, veriroot uses a deterministic DCT-descriptor fallback (labelled dct-fallback-v1 in every record and in RESULTS.md) so the whole pipeline runs offline. Install the real engine with:
make install-sscd # adds torch / torchvision
make model # download + SHA256-verify the pinned sscd_disc_mixupmake bench runs the full harness on a seeded synthetic corpus and never spends money. Because real image corpora and the SSCD model are large, the heavy benchmark is designed to run on Kaggle — the corpus stays in the cloud, embeddings happen there, and only the results come back:
make bench # synthetic corpus, offline → benchmark/out/ + RESULTS.md
python scripts/run_bench_kaggle.py # run on Kaggle against a real corpus; pull results backThe corpus is pluggable: synthetic (default, zero-download), folder:PATH (a DISC2021 or CC0 subset), or kaggle:owner/slug. A PDQ-based dedup / leakage guard keeps the registered and distractor sets disjoint.
From a real run with the sscd_disc_mixup model on 2,000 natural photos. Thresholds were selected on DEV_SUITE only, frozen, then applied unchanged to the held-out suites. Full numbers and figures live in RESULTS.md.
| Suite | Transforms | Recovery (fused) | µAP |
|---|---|---|---|
| DEV | jpeg70 + resize½ + crop5% | 0.86 | 0.93 |
| HELDOUT_A | jpeg50 + resize¾ + crop10% + resave | 0.44 | 0.53 |
| HELDOUT_B | screenshot + jpeg40 + crop15% | 0.61 | 0.76 |
| HELDOUT_SCREENSHOT | resize + jpeg60 + gamma + pad | 0.90 | 0.95 |
The two engines genuinely complement each other: on the screenshot suite SSCD alone collapses but PDQ carries it to 0.90, while on the crop suites PDQ collapses and SSCD holds. False-match rates stay at 0.001–0.006, and the DEV → held-out generalization gap (0.21 fused) is reported, not hidden.
veriroot measures its failure modes rather than pretending to beat them:
- Heavy cropping degrades resemblance — the hardest case, quantified per transform.
- Generative regeneration (diffusion img2img) can repaint a look-alike with a new fingerprint — an open research frontier; the probe is flag-gated and spend-capped.
- Deliberate collisions are reported with the exact attacker budget assumed (the black-box probe found 0% success under its stated budget).
- Identity is out of scope for v1: records bind to a key, not a verified real-world identity. That key→identity layer is the genuinely hard trust problem and is deferred to v2.
See docs/THREAT_MODEL.md for the full picture.
src/veriroot/ the service spine (fingerprint, log, store, services, api)
src/veriroot_verify/ standalone offline certificate verifier (no service dependency)
benchmark/ corpus, transforms, metrics, threshold selection, adversarial probes
tests/ property-based + integration tests
scripts/ CLI, Phase-0 demo, model fetcher, Kaggle orchestration
docs/ threat model
- docs/THREAT_MODEL.md — what v1 proves, what it does not, and the measured failure modes.
- RESULTS.md — the latest benchmark dashboard (regenerated by
make bench).
v1 is complete: end-to-end register / recover / verify, an offline certificate verifier, a tamper-evident append-only log with consistency proofs, an honest benchmark with held-out suites, and measured adversarial failure modes. Next up: batched embedding for throughput, a full-accuracy run on a larger real corpus, the key→identity binding layer, and a hosted demo.