Prediction cache generation and evaluation pipeline for idp-leaderboard.org.
Three benchmarks, each testing a different axis of document understanding:
| Benchmark | What it tests | Scale |
|---|---|---|
| OlmOCR Bench | Math, tables, reading order, text presence/absence | 1,403 pages · 7,010 tests |
| OmniDocBench | Text extraction, formula recognition, table structure, reading order | 1,355 pages · 18K+ samples |
| IDP Core | Key info extraction, OCR, table parsing, visual QA | 6,406 samples across 4 tasks |
All benchmark data is fetched automatically — no manual downloads needed for most workflows.
| Benchmark | Ground truth | Images / PDFs |
|---|---|---|
| OlmOCR Bench | JSONL from allenai/olmOCR-bench (auto-cached to ground_truth/) |
PDFs + pre-rendered PNGs from HuggingFace |
| OmniDocBench | OmniDocBench.json (local, from opendatalab/OmniDocBench) |
Local if available, falls back to HuggingFace URLs |
| IDP Core | Loaded via docext from HuggingFace datasets |
Embedded in dataset (base64) |
OmniDocBench is the only benchmark that needs a local file (OmniDocBench.json). Clone the dataset repo or point --omnidoc-root at a directory containing it.
Requires Python 3.10+.
Automated setup (recommended for first-time users):
./setup.shThis creates a virtualenv, installs all dependencies (including Playwright for olmOCR evaluation), sets up a .env template, and validates the install.
Manual setup:
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m playwright install chromium # needed for olmOCR evaluationFor IDP benchmark, also install docext:
git clone https://github.com/NanoNets/docext.git ../docext
pip install -e ../docextFor OmniDocBench, clone the dataset:
git clone https://huggingface.co/datasets/opendatalab/OmniDocBench ../OmniDocBenchSet your API keys (create a .env file or export directly):
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AIza..."
export ANTHROPIC_API_KEY="sk-ant-..."All three benchmarks support --provider litellm --model-id <litellm_model_id> for any model that LiteLLM can route to. This is the primary way to benchmark third-party models.
LiteLLM model IDs follow the provider/model-name format, e.g.:
openai/gpt-5.4-2026-03-05gemini/gemini-3.1-pro-previewanthropic/claude-sonnet-4-6
python benchmarks/olmocr/run.py \
--model gpt-5.4 \
--provider litellm \
--model-id openai/gpt-5.4-2026-03-05 \
--workers 25python benchmarks/omnidocbench/run.py \
--model gpt-5.4 \
--provider litellm \
--model-id openai/gpt-5.4-2026-03-05 \
--omnidoc-root ~/Nanobench/data/omnidocbench \
--workers 25python benchmarks/idp/run.py \
--model gpt-5.4 \
--provider litellm \
--model-id openai/gpt-5.4-2026-03-05 \
--workers 25For the Nanonets model, use the default provider (no --provider flag needed):
python benchmarks/olmocr/run.py --model nanonets --workers 10
python benchmarks/omnidocbench/run.py --model nanonets --workers 10
python benchmarks/idp/run.py --model nanonets --workers 5After prediction caches are populated, run evaluation:
# OlmOCR
python benchmarks/olmocr/evaluate.py --model gpt-5.4
python benchmarks/olmocr/evaluate.py --model gpt-5.4 --postprocess
# OmniDocBench (Docker recommended for full CDM metric)
python benchmarks/omnidocbench/evaluate.py --model gpt-5.4 --docker
python benchmarks/omnidocbench/evaluate.py --model gpt-5.4 --host # partial, no CDM
# IDP
python benchmarks/idp/evaluate.py --model gpt-5.4After generating caches and evaluations:
python scripts/consolidate_results.pysetup.sh Automated first-time setup
models/
litellm_model.py LiteLLM adapter (OpenAI, Anthropic, Google, etc.)
nanonets.py Nanonets OCR2+ API client
benchmarks/
olmocr/ OlmOCR Bench — run, evaluate, postprocess
omnidocbench/ OmniDocBench — run, evaluate, postprocess
idp/ IDP Core — run, evaluate
scripts/
consolidate_results.py Merge best runs into canonical results/
validate_caches.py Sanity-check prediction caches
migrate_caches.py Restructure legacy cache layouts
rerun_empty.py Re-run empty/failed cache entries
caches/{model}/ Prediction caches (gitignored, auto-created)
ground_truth/ Auto-downloaded ground truth (gitignored)
results/{model}/ Evaluation output JSON files
| Variable | Required for | Description |
|---|---|---|
OPENAI_API_KEY |
GPT models | OpenAI API key (auto-read by LiteLLM) |
GOOGLE_API_KEY |
Gemini models | Google AI API key (auto-read by LiteLLM) |
ANTHROPIC_API_KEY |
Claude models | Anthropic API key (auto-read by LiteLLM) |
NANONETS_API_KEY |
Nanonets model | API key for extraction-api.nanonets.com |