Skip to content

ChunkyTortoise/docextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocExtract AI live demo: document extraction with evaluation scores, agent trace, and cost analysis

DocExtract AI

Document-extraction RAG system: turns messy PDFs into structured data with eval-gated CI, cost-aware model routing, citation grounding, and a live demo.

Tests Eval Gate Python 3.10+

Metric Value Basis
Extraction accuracy (F1) 95.5% CI-replayed, 28-case deterministic baseline; independent Gemini judge eliminates self-grading bias (ADR-0018)
Test suite 1,280 tests, 81% coverage 80% CI gate enforced
Eval corpus 72 cases (51 golden + 21 adversarial) 28 deterministic-replay in CI + 44 live-metered when API budget attached; adversarial set covers injection, PII leak, hallucination bait
Modeled estimates cost ~$0.03/doc · latency ~4.1s p95 Pricing table x call distribution; reproduce metered numbers with scripts/benchmark.py (cost-model.md)

What this does

FastAPI service that extracts structured data from PDFs and other documents via a two-pass Claude pipeline (draft + verify). Extracted records are embedded in pgvector for semantic search. Quality is measured continuously by an LLM-as-judge (Gemini 2.5 Flash, 10% sampling) and enforced in CI via an eval gate that fails on >3-point F1 regression.

Why this is interesting (engineering)

  • Eval-gated CI: eval-gate.yml replays 28-case deterministic baseline at zero API cost via scripts/eval_offline_replay.py; PRs touching prompts or extraction services must pass before merge
  • Cost-aware model routing: Claude Haiku for classification (saves ~67% vs Sonnet, <2% quality loss confirmed by A/B z-test), Sonnet for extraction; prompt caching cuts repeat-call cost ~60%
  • Independent judge: Gemini 2.5 Flash grades extractions to eliminate self-grading bias; documented in ADR-0018
  • Circuit breaker fallback: Sonnet → Haiku with dead-letter queue, idempotent retries, and HMAC-signed webhooks
  • OpenTelemetry cost attribution: per-request USD cost computed from token counts via app/services/cost_tracker.py; exported as OTel metrics to Grafana

Architecture

graph LR
  A[Browser / API Client] -->|POST /documents| B[FastAPI]
  B -->|enqueue| C[ARQ Worker]
  C -->|classify| D{Model Router}
  D -->|primary| E[Claude Sonnet]
  D -->|fallback| F[Claude Haiku]
  E -->|Pass 2: extract + correct| G[pgvector HNSW]
  B -->|SSE stream stages| A
  G -->|semantic search| B
  B -->|/metrics| H[Prometheus]
  D --- I[Circuit Breaker]
Loading

Demo

Live demo on Streamlit Cloud (allow ~60s cold start). Or run it locally with no API key:

DEMO_MODE=true streamlit run frontend/app.py

Progress streams over two real Server-Sent Events endpoints: /jobs/{id}/events (extraction stages) and /agent-search/stream (agentic retrieval reasoning).

Install

git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env  # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501  # Streamlit UI

Services: API :8000 (/docs for Swagger) | Frontend :8501 | PostgreSQL :5432 | Redis :6379

Tests

pytest tests/ -v                      # 1,280 collected tests
python scripts/run_eval_ci.py --ci    # Deterministic eval (no API key)
make eval                             # Full eval suite (~$0.44, ~4 min)

Architecture Decisions

19 ADRs at docs/adr/. Key decisions:

ADR Decision
ADR-0003 Two-pass Claude extraction with confidence gating
ADR-0006 Circuit breaker model fallback chain
ADR-0015 Anthropic prompt caching, 60%+ eval cost reduction
ADR-0017 Two-layer semantic cache (L1 exact hash + L2 embedding similarity)
ADR-0018 Gemini 2.5 as independent judge (eliminates self-grading bias)
ADR-0019 TF-IDF reranker + agentic self-reflection loop

More: CASE_STUDY.md | DEMO.md | docs/eval-methodology.md | docs/cost-model.md

License

MIT

About

Production document AI with hybrid retrieval, eval-gated CI, and 95.5% accepted extraction F1. FastAPI + pgvector + Claude API.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors