Gemini-powered document classifier that analyzes a complete PDF in a single API call, detects its UBL 2.1 document type, identifies relevant pages, and builds a page-to-document-ID map for downstream extractors.
Given raw PDF bytes, pageclassifier makes a single call to Google Gemini and returns:
- The UBL 2.1 document type (e.g.
380= Invoice,381= Credit Note) with confidence - Per-page relevance — which pages contain line items or document identifiers
- A page map — each document ID mapped to its pages, including continuation pages where the ID does not repeat
- Irrelevant pages — with confidence and reason, so downstream knows exactly why each page was excluded
- Warnings — if the model returned an incomplete response, missing pages are filled synthetically and flagged
The intended deployment is as a gate in front of a more expensive extraction step. Downstream extractors consume page_map directly and never touch pages that were not classified as relevant.
PDF bytes
-> pageclassifier.classify(pdf_bytes) (single Gemini call)
-> DocumentClassificationResult
.page_map -> downstream extractor (runs only on relevant pages)
.irrelevant_pages -> skip
.warnings -> alert if model was incomplete
| Code | Type |
|---|---|
| 380 | Commercial Invoice |
| 381 | Credit Note |
| 383 | Debit Note |
| 384 | Corrected Invoice |
| 385 | Freight Invoice |
| 386 | Prepayment Invoice |
| 389 | Self-Billing Invoice |
| 261 | Self-Billing Credit Note |
| 480 | Invoice not subject to VAT |
| Layer | Technology |
|---|---|
| LLM | google-genai (Gemini 3.1 Flash Lite Preview) |
| Validation | pydantic 2.x |
| Logging | loguru (opt-in) |
| Retry | tenacity (opt-in) |
| Env | python-dotenv |
| Build / packaging | hatchling |
| Tests | pytest |
Runtime dependencies: 5. No native binaries. Pure-Python wheel.
Requires Python 3.10 or newer.
pip install https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whlPin in requirements.txt for reproducible installs:
pageclassifier @ https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whl
git clone https://github.com/sherozshaikh/pageclassifier.git
cd pageclassifier
make setup
make installRequires a Google Gemini API key. Provide it in one of three ways:
- Set
GEMINI_API_KEYin the process environment, or - Create a
.envfile and load it viapython-dotenv, or - Pass
api_key=...toDocumentClassifierConfig
Get a key: https://aistudio.google.com/apikey
from dotenv import load_dotenv
from pathlib import Path
from pageclassifier import DocumentClassifier
load_dotenv()
pdf_bytes = Path("invoice.pdf").read_bytes()
classifier = DocumentClassifier()
result = classifier.classify(pdf_bytes=pdf_bytes)
print(result.ubl_code) # "380"
print(result.ubl_confidence) # 0.95
print(result.page_map) # {"INV-001": [0, 1], "INV-002": [3]}
print(result.irrelevant_pages) # [IrrelevantPageResult(page=2, confidence=0.93, reason="...")]
print(result.warnings) # [] — empty when model response was completefrom pageclassifier import DocumentClassifier, DocumentClassifierConfig
classifier = DocumentClassifier(config=DocumentClassifierConfig())
result = classifier.classify(pdf_bytes=b"...")from pageclassifier import classify_document
result = classify_document(pdf_bytes=b"...")from pageclassifier import classify_document_async
result = await classify_document_async(pdf_bytes=b"...")result.ubl_code # str — UBL 2.1 code, e.g. "380"
result.ubl_confidence # float (X.XX) — confidence in document type
result.total_pages # int — total pages the model analyzed
result.relevant_pages # list[PageResult]
result.irrelevant_pages # list[IrrelevantPageResult]
result.page_map # dict[str, list[int]] — doc_id -> page indices
result.warnings # list[str] — non-empty if model response was incomplete
result.model_used # str
result.classifier_version # str
result.latency_ms # float
result.input_tokens # int
result.output_tokens # int
result.request_id # strp.page # int — 0-indexed page number
p.confidence # float (X.XX)
p.reason # str
p.document_id # str | None — extracted verbatim, null if not on this pagep.page # int — 0-indexed page number
p.confidence # float (X.XX) — close to 1.0 = confident it is irrelevant; 0.0 = synthetic fill
p.reason # strimport pageclassifier
# See default prompts without instantiating anything
pageclassifier.default_system_prompt() # returns str
pageclassifier.default_user_prompt() # returns str
# Active prompts on a live instance
classifier.system_prompt # str — default or custom
classifier.user_prompt # str — default or custom
# Override at config time (your prompts ship with your code, not the wheel)
config = DocumentClassifierConfig(
api_key="...",
system_prompt=Path("my_system_prompt.txt").read_text(),
user_prompt=Path("my_user_prompt.txt").read_text(),
)
classifier = DocumentClassifier(config=config)import pageclassifier
print(pageclassifier.__version__) # "2.0.0"len(relevant_pages) + len(irrelevant_pages) == total_pages— always- All confidence scores are
X.XX(rounded to 2 decimal places) page_mapkeys are document IDs; values are sorted 0-indexed page lists including continuation pageswarnings == []means the model returned a complete response; non-empty means at least one page was filled synthetically (identifiable byconfidence == 0.0)- Every page 0 through
total_pages - 1appears in exactly one of the two page lists
pageclassifier/
├── src/
│ └── pageclassifier/
│ ├── __init__.py
│ ├── classifier/ # DocumentClassifier, config, Gemini client, result models
│ ├── prompts/ # system_prompt.txt + user_prompt.txt
│ ├── exceptions.py # exception hierarchy
│ └── logger.py # opt-in loguru wiring
├── tests/
│ ├── conftest.py
│ ├── test_document_classifier.py
│ ├── test_client.py
│ ├── test_models.py
│ ├── test_exceptions.py
│ └── test_logger.py
├── docs/
│ ├── api-reference.md
│ ├── prompts.md
│ ├── observability.md
│ └── production.md
├── experiments/ # Docker + Prometheus + Grafana demo (not in wheel)
├── example.py
├── Makefile
├── pyproject.toml
├── LICENSE
└── README.md
Setup:
make setup Create venv with uv (python 3.10, run once)
make install Install package + dev deps in editable mode
make install-runtime Install runtime deps only
Testing:
make test Run the full test suite
make test-smoke Run smoke tests only
make test-edge Run edge case tests only
make test-fast Run tests in parallel across CPU cores
Code Quality:
make format isort + black + ruff fix + ruff format
Build:
make build Build the wheel and sdist
make verify-wheel Build and install the wheel into a fresh venv
Cleanup:
make clean-build Remove dist/ and build artifacts
make clean Remove caches and build artifacts
make test77 tests covering smoke and edge cases across classifier, client, models, exceptions, and logger. No real API calls. Runs in under 1 second.
MIT License — see LICENSE file for details.