Skip to content

sherozshaikh/pageclassifier

Repository files navigation

pageclassifier

Gemini-powered document classifier that analyzes a complete PDF in a single API call, detects its UBL 2.1 document type, identifies relevant pages, and builds a page-to-document-ID map for downstream extractors.

Python 3.10+ License: MIT


What This Does

Given raw PDF bytes, pageclassifier makes a single call to Google Gemini and returns:

  • The UBL 2.1 document type (e.g. 380 = Invoice, 381 = Credit Note) with confidence
  • Per-page relevance — which pages contain line items or document identifiers
  • A page map — each document ID mapped to its pages, including continuation pages where the ID does not repeat
  • Irrelevant pages — with confidence and reason, so downstream knows exactly why each page was excluded
  • Warnings — if the model returned an incomplete response, missing pages are filled synthetically and flagged

The intended deployment is as a gate in front of a more expensive extraction step. Downstream extractors consume page_map directly and never touch pages that were not classified as relevant.


Pipeline Position

PDF bytes
  -> pageclassifier.classify(pdf_bytes)    (single Gemini call)
  -> DocumentClassificationResult
       .page_map        -> downstream extractor (runs only on relevant pages)
       .irrelevant_pages -> skip
       .warnings        -> alert if model was incomplete

Supported Document Types (UBL 2.1)

Code Type
380 Commercial Invoice
381 Credit Note
383 Debit Note
384 Corrected Invoice
385 Freight Invoice
386 Prepayment Invoice
389 Self-Billing Invoice
261 Self-Billing Credit Note
480 Invoice not subject to VAT

Tech Stack

Layer Technology
LLM google-genai (Gemini 3.1 Flash Lite Preview)
Validation pydantic 2.x
Logging loguru (opt-in)
Retry tenacity (opt-in)
Env python-dotenv
Build / packaging hatchling
Tests pytest

Runtime dependencies: 5. No native binaries. Pure-Python wheel.


Install

Requires Python 3.10 or newer.

From the GitHub Release (recommended)

pip install https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whl

Pin in requirements.txt for reproducible installs:

pageclassifier @ https://github.com/sherozshaikh/pageclassifier/releases/download/v2.0.0/pageclassifier-2.0.0-py3-none-any.whl

From source (development)

git clone https://github.com/sherozshaikh/pageclassifier.git
cd pageclassifier
make setup
make install

Environment

Requires a Google Gemini API key. Provide it in one of three ways:

  • Set GEMINI_API_KEY in the process environment, or
  • Create a .env file and load it via python-dotenv, or
  • Pass api_key=... to DocumentClassifierConfig

Get a key: https://aistudio.google.com/apikey


Quick Start

from dotenv import load_dotenv
from pathlib import Path
from pageclassifier import DocumentClassifier

load_dotenv()
pdf_bytes = Path("invoice.pdf").read_bytes()

classifier = DocumentClassifier()
result = classifier.classify(pdf_bytes=pdf_bytes)

print(result.ubl_code)          # "380"
print(result.ubl_confidence)    # 0.95
print(result.page_map)          # {"INV-001": [0, 1], "INV-002": [3]}
print(result.irrelevant_pages)  # [IrrelevantPageResult(page=2, confidence=0.93, reason="...")]
print(result.warnings)          # [] — empty when model response was complete

API Contract

DocumentClassifier

from pageclassifier import DocumentClassifier, DocumentClassifierConfig

classifier = DocumentClassifier(config=DocumentClassifierConfig())
result = classifier.classify(pdf_bytes=b"...")

classify_document (sync singleton)

from pageclassifier import classify_document
result = classify_document(pdf_bytes=b"...")

classify_document_async (async singleton)

from pageclassifier import classify_document_async
result = await classify_document_async(pdf_bytes=b"...")

DocumentClassificationResult

result.ubl_code          # str  — UBL 2.1 code, e.g. "380"
result.ubl_confidence    # float (X.XX) — confidence in document type
result.total_pages       # int  — total pages the model analyzed
result.relevant_pages    # list[PageResult]
result.irrelevant_pages  # list[IrrelevantPageResult]
result.page_map          # dict[str, list[int]] — doc_id -> page indices
result.warnings          # list[str] — non-empty if model response was incomplete
result.model_used        # str
result.classifier_version # str
result.latency_ms        # float
result.input_tokens      # int
result.output_tokens     # int
result.request_id        # str

PageResult

p.page           # int  — 0-indexed page number
p.confidence     # float (X.XX)
p.reason         # str
p.document_id    # str | None — extracted verbatim, null if not on this page

IrrelevantPageResult

p.page           # int  — 0-indexed page number
p.confidence     # float (X.XX) — close to 1.0 = confident it is irrelevant; 0.0 = synthetic fill
p.reason         # str

Prompt access and customisation

import pageclassifier

# See default prompts without instantiating anything
pageclassifier.default_system_prompt()   # returns str
pageclassifier.default_user_prompt()     # returns str

# Active prompts on a live instance
classifier.system_prompt   # str — default or custom
classifier.user_prompt     # str — default or custom

# Override at config time (your prompts ship with your code, not the wheel)
config = DocumentClassifierConfig(
    api_key="...",
    system_prompt=Path("my_system_prompt.txt").read_text(),
    user_prompt=Path("my_user_prompt.txt").read_text(),
)
classifier = DocumentClassifier(config=config)

Version

import pageclassifier
print(pageclassifier.__version__)   # "2.0.0"

Output Guarantees

  • len(relevant_pages) + len(irrelevant_pages) == total_pages — always
  • All confidence scores are X.XX (rounded to 2 decimal places)
  • page_map keys are document IDs; values are sorted 0-indexed page lists including continuation pages
  • warnings == [] means the model returned a complete response; non-empty means at least one page was filled synthetically (identifiable by confidence == 0.0)
  • Every page 0 through total_pages - 1 appears in exactly one of the two page lists

Project Structure

pageclassifier/
├── src/
│   └── pageclassifier/
│       ├── __init__.py
│       ├── classifier/         # DocumentClassifier, config, Gemini client, result models
│       ├── prompts/            # system_prompt.txt + user_prompt.txt
│       ├── exceptions.py       # exception hierarchy
│       └── logger.py           # opt-in loguru wiring
├── tests/
│   ├── conftest.py
│   ├── test_document_classifier.py
│   ├── test_client.py
│   ├── test_models.py
│   ├── test_exceptions.py
│   └── test_logger.py
├── docs/
│   ├── api-reference.md
│   ├── prompts.md
│   ├── observability.md
│   └── production.md
├── experiments/                # Docker + Prometheus + Grafana demo (not in wheel)
├── example.py
├── Makefile
├── pyproject.toml
├── LICENSE
└── README.md

Makefile Targets

Setup:
  make setup              Create venv with uv (python 3.10, run once)
  make install            Install package + dev deps in editable mode
  make install-runtime    Install runtime deps only

Testing:
  make test               Run the full test suite
  make test-smoke         Run smoke tests only
  make test-edge          Run edge case tests only
  make test-fast          Run tests in parallel across CPU cores

Code Quality:
  make format             isort + black + ruff fix + ruff format

Build:
  make build              Build the wheel and sdist
  make verify-wheel       Build and install the wheel into a fresh venv

Cleanup:
  make clean-build        Remove dist/ and build artifacts
  make clean              Remove caches and build artifacts

Tests

make test

77 tests covering smoke and edge cases across classifier, client, models, exceptions, and logger. No real API calls. Runs in under 1 second.


License

MIT License — see LICENSE file for details.


Author

Sheroz ShaikhGitHub | LinkedIn

About

Gemini-powered page classifier that decides whether a document page image contains invoice line items. Designed to sit between paperflight and any downstream extractor to reduce expensive LLM calls on irrelevant pages.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors