pageclassifier

Gemini-powered document classifier that analyzes a complete PDF in a single API call, detects its UBL 2.1 document type, identifies relevant pages, and builds a page-to-document-ID map for downstream extractors.

What This Does

Given raw PDF bytes, pageclassifier makes a single call to Google Gemini and returns:

The UBL 2.1 document type (e.g. 380 = Invoice, 381 = Credit Note) with confidence
Per-page relevance — which pages contain line items or document identifiers
A page map — each document ID mapped to its pages, including continuation pages where the ID does not repeat
Irrelevant pages — with confidence and reason, so downstream knows exactly why each page was excluded
Warnings — if the model returned an incomplete response, missing pages are filled synthetically and flagged

The intended deployment is as a gate in front of a more expensive extraction step. Downstream extractors consume page_map directly and never touch pages that were not classified as relevant.

Pipeline Position

PDF bytes
  -> pageclassifier.classify(pdf_bytes)    (single Gemini call)
  -> DocumentClassificationResult
       .page_map        -> downstream extractor (runs only on relevant pages)
       .irrelevant_pages -> skip
       .warnings        -> alert if model was incomplete

Supported Document Types (UBL 2.1)

Code	Type
380	Commercial Invoice
381	Credit Note
383	Debit Note
384	Corrected Invoice
385	Freight Invoice
386	Prepayment Invoice
389	Self-Billing Invoice
261	Self-Billing Credit Note
480	Invoice not subject to VAT

Tech Stack

Layer	Technology
LLM	google-genai (Gemini 3.1 Flash Lite Preview)
Validation	pydantic 2.x
Logging	loguru (opt-in)
Retry	tenacity (opt-in)
Env	python-dotenv
Build / packaging	hatchling
Tests	pytest

Runtime dependencies: 5. No native binaries. Pure-Python wheel.

Install

Requires Python 3.10 or newer.

From the GitHub Release (recommended)

pip install https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whl

Pin in requirements.txt for reproducible installs:

pageclassifier @ https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whl

From source (development)

git clone https://github.com/sherozshaikh/pageclassifier.git
cd pageclassifier
make setup
make install

Environment

Requires a Google Gemini API key. Provide it in one of three ways:

Set GEMINI_API_KEY in the process environment, or
Create a .env file and load it via python-dotenv, or
Pass api_key=... to DocumentClassifierConfig

Get a key: https://aistudio.google.com/apikey

Quick Start

from dotenv import load_dotenv
from pathlib import Path
from pageclassifier import DocumentClassifier

load_dotenv()
pdf_bytes = Path("invoice.pdf").read_bytes()

classifier = DocumentClassifier()
result = classifier.classify(pdf_bytes=pdf_bytes)

print(result.ubl_code)          # "380"
print(result.ubl_confidence)    # 0.95
print(result.page_map)          # {"INV-001": [0, 1], "INV-002": [3]}
print(result.irrelevant_pages)  # [IrrelevantPageResult(page=2, confidence=0.93, reason="...")]
print(result.warnings)          # [] — empty when model response was complete

API Contract

`DocumentClassifier`

from pageclassifier import DocumentClassifier, DocumentClassifierConfig

classifier = DocumentClassifier(config=DocumentClassifierConfig())
result = classifier.classify(pdf_bytes=b"...")

`classify_document` (sync singleton)

from pageclassifier import classify_document
result = classify_document(pdf_bytes=b"...")

`classify_document_async` (async singleton)

from pageclassifier import classify_document_async
result = await classify_document_async(pdf_bytes=b"...")

`DocumentClassificationResult`

result.ubl_code          # str  — UBL 2.1 code, e.g. "380"
result.ubl_confidence    # float (X.XX) — confidence in document type
result.total_pages       # int  — total pages the model analyzed
result.relevant_pages    # list[PageResult]
result.irrelevant_pages  # list[IrrelevantPageResult]
result.page_map          # dict[str, list[int]] — doc_id -> page indices
result.warnings          # list[str] — non-empty if model response was incomplete
result.model_used        # str
result.classifier_version # str
result.latency_ms        # float
result.input_tokens      # int
result.output_tokens     # int
result.request_id        # str

`PageResult`

p.page           # int  — 0-indexed page number
p.confidence     # float (X.XX)
p.reason         # str
p.document_id    # str | None — extracted verbatim, null if not on this page

`IrrelevantPageResult`

p.page           # int  — 0-indexed page number
p.confidence     # float (X.XX) — close to 1.0 = confident it is irrelevant; 0.0 = synthetic fill
p.reason         # str

Prompt access and customisation

import pageclassifier

# See default prompts without instantiating anything
pageclassifier.default_system_prompt()   # returns str
pageclassifier.default_user_prompt()     # returns str

# Active prompts on a live instance
classifier.system_prompt   # str — default or custom
classifier.user_prompt     # str — default or custom

# Override at config time (your prompts ship with your code, not the wheel)
config = DocumentClassifierConfig(
    api_key="...",
    system_prompt=Path("my_system_prompt.txt").read_text(),
    user_prompt=Path("my_user_prompt.txt").read_text(),
)
classifier = DocumentClassifier(config=config)

Version

import pageclassifier
print(pageclassifier.__version__)   # "2.0.0"

Output Guarantees

len(relevant_pages) + len(irrelevant_pages) == total_pages — always
All confidence scores are X.XX (rounded to 2 decimal places)
page_map keys are document IDs; values are sorted 0-indexed page lists including continuation pages
warnings == [] means the model returned a complete response; non-empty means at least one page was filled synthetically (identifiable by confidence == 0.0)
Every page 0 through total_pages - 1 appears in exactly one of the two page lists

Project Structure

pageclassifier/
├── src/
│   └── pageclassifier/
│       ├── __init__.py
│       ├── classifier/         # DocumentClassifier, config, Gemini client, result models
│       ├── prompts/            # system_prompt.txt + user_prompt.txt
│       ├── exceptions.py       # exception hierarchy
│       └── logger.py           # opt-in loguru wiring
├── tests/
│   ├── conftest.py
│   ├── test_document_classifier.py
│   ├── test_client.py
│   ├── test_models.py
│   ├── test_exceptions.py
│   └── test_logger.py
├── docs/
│   ├── api-reference.md
│   ├── prompts.md
│   ├── observability.md
│   └── production.md
├── experiments/                # Docker + Prometheus + Grafana demo (not in wheel)
├── example.py
├── Makefile
├── pyproject.toml
├── LICENSE
└── README.md

Makefile Targets

Setup:
  make setup              Create venv with uv (python 3.10, run once)
  make install            Install package + dev deps in editable mode
  make install-runtime    Install runtime deps only

Testing:
  make test               Run the full test suite
  make test-smoke         Run smoke tests only
  make test-edge          Run edge case tests only
  make test-fast          Run tests in parallel across CPU cores

Code Quality:
  make format             isort + black + ruff fix + ruff format

Build:
  make build              Build the wheel and sdist
  make verify-wheel       Build and install the wheel into a fresh venv

Cleanup:
  make clean-build        Remove dist/ and build artifacts
  make clean              Remove caches and build artifacts

Tests

make test

77 tests covering smoke and edge cases across classifier, client, models, exceptions, and logger. No real API calls. Runs in under 1 second.

License

MIT License — see LICENSE file for details.

Author

Sheroz Shaikh — GitHub | LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pageclassifier

What This Does

Pipeline Position

Supported Document Types (UBL 2.1)

Tech Stack

Install

From the GitHub Release (recommended)

From source (development)

Environment

Quick Start

API Contract

`DocumentClassifier`

`classify_document` (sync singleton)

`classify_document_async` (async singleton)

`DocumentClassificationResult`

`PageResult`

`IrrelevantPageResult`

Prompt access and customisation

Version

Output Guarantees

Project Structure

Makefile Targets

Tests

License

Author

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
experiments		experiments
src/pageclassifier		src/pageclassifier
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pageclassifier

What This Does

Pipeline Position

Supported Document Types (UBL 2.1)

Tech Stack

Install

From the GitHub Release (recommended)

From source (development)

Environment

Quick Start

API Contract

DocumentClassifier

classify_document (sync singleton)

classify_document_async (async singleton)

DocumentClassificationResult

PageResult

IrrelevantPageResult

Prompt access and customisation

Version

Output Guarantees

Project Structure

Makefile Targets

Tests

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`DocumentClassifier`

`classify_document` (sync singleton)

`classify_document_async` (async singleton)

`DocumentClassificationResult`

`PageResult`

`IrrelevantPageResult`

Packages