DocExtract AI

Document-extraction RAG system: turns messy PDFs into structured data with eval-gated CI, cost-aware model routing, citation grounding, and a live demo.

Metric	Value	Basis
Extraction accuracy (F1)	95.5%	CI-replayed, 28-case deterministic baseline; independent Gemini judge eliminates self-grading bias (ADR-0018)
Test suite	1,280 tests, 81% coverage	80% CI gate enforced
Eval corpus	72 cases (51 golden + 21 adversarial)	28 deterministic-replay in CI + 44 live-metered when API budget attached; adversarial set covers injection, PII leak, hallucination bait
Modeled estimates	cost ~$0.03/doc · latency ~4.1s p95	Pricing table x call distribution; reproduce metered numbers with `scripts/benchmark.py` (cost-model.md)

What this does

FastAPI service that extracts structured data from PDFs and other documents via a two-pass Claude pipeline (draft + verify). Extracted records are embedded in pgvector for semantic search. Quality is measured continuously by an LLM-as-judge (Gemini 2.5 Flash, 10% sampling) and enforced in CI via an eval gate that fails on >3-point F1 regression.

Why this is interesting (engineering)

Eval-gated CI: eval-gate.yml replays 28-case deterministic baseline at zero API cost via scripts/eval_offline_replay.py; PRs touching prompts or extraction services must pass before merge
Cost-aware model routing: Claude Haiku for classification (saves ~67% vs Sonnet, <2% quality loss confirmed by A/B z-test), Sonnet for extraction; prompt caching cuts repeat-call cost ~60%
Independent judge: Gemini 2.5 Flash grades extractions to eliminate self-grading bias; documented in ADR-0018
Circuit breaker fallback: Sonnet → Haiku with dead-letter queue, idempotent retries, and HMAC-signed webhooks
OpenTelemetry cost attribution: per-request USD cost computed from token counts via app/services/cost_tracker.py; exported as OTel metrics to Grafana

Architecture

graph LR
  A[Browser / API Client] -->|POST /documents| B[FastAPI]
  B -->|enqueue| C[ARQ Worker]
  C -->|classify| D{Model Router}
  D -->|primary| E[Claude Sonnet]
  D -->|fallback| F[Claude Haiku]
  E -->|Pass 2: extract + correct| G[pgvector HNSW]
  B -->|SSE stream stages| A
  G -->|semantic search| B
  B -->|/metrics| H[Prometheus]
  D --- I[Circuit Breaker]

Demo

Live demo on Streamlit Cloud (allow ~60s cold start). Or run it locally with no API key:

DEMO_MODE=true streamlit run frontend/app.py

Progress streams over two real Server-Sent Events endpoints: /jobs/{id}/events (extraction stages) and /agent-search/stream (agentic retrieval reasoning).

Install

git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env  # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501  # Streamlit UI

Services: API :8000 (/docs for Swagger) | Frontend :8501 | PostgreSQL :5432 | Redis :6379

Tests

pytest tests/ -v                      # 1,280 collected tests
python scripts/run_eval_ci.py --ci    # Deterministic eval (no API key)
make eval                             # Full eval suite (~$0.44, ~4 min)

Architecture Decisions

19 ADRs at docs/adr/. Key decisions:

ADR	Decision
ADR-0003	Two-pass Claude extraction with confidence gating
ADR-0006	Circuit breaker model fallback chain
ADR-0015	Anthropic prompt caching, 60%+ eval cost reduction
ADR-0017	Two-layer semantic cache (L1 exact hash + L2 embedding similarity)
ADR-0018	Gemini 2.5 as independent judge (eliminates self-grading bias)
ADR-0019	TF-IDF reranker + agentic self-reflection loop

More: CASE_STUDY.md | DEMO.md | docs/eval-methodology.md | docs/cost-model.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.claude		.claude
.github		.github
.streamlit		.streamlit
adapters		adapters
alembic		alembic
app		app
autoresearch		autoresearch
deploy		deploy
docs		docs
evals		evals
frontend		frontend
notebooks		notebooks
prompts		prompts
scripts		scripts
storage/reports		storage/reports
tests		tests
worker		worker
.claudeignore		.claudeignore
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.gitleaksignore		.gitleaksignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CASE_STUDY.md		CASE_STUDY.md
CONTRIBUTING.md		CONTRIBUTING.md
DECISIONS.md		DECISIONS.md
DEMO.md		DEMO.md
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.observability.yml		docker-compose.observability.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
locust.conf		locust.conf
mcp_server.py		mcp_server.py
promptfooconfig.yaml		promptfooconfig.yaml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements.txt		requirements.txt
requirements_ci.txt		requirements_ci.txt
requirements_demo.txt		requirements_demo.txt
requirements_full.txt		requirements_full.txt
streamlit_demo.py		streamlit_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocExtract AI

What this does

Why this is interesting (engineering)

Architecture

Demo

Install

Tests

Architecture Decisions

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocExtract AI

What this does

Why this is interesting (engineering)

Architecture

Demo

Install

Tests

Architecture Decisions

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages