Releases: ma-wi-lo/pubs
v1.1.0
v1.0.2 - Initial Public Release
OCR Pipeline for Historical Print Periodicals using Mistral AI
Publication-ready OCR pipeline for batch processing scanned historical periodicals.
Features
- Structured OCR with Markdown formatting preserving document layout
- Automatic PDF splitting for large files (>50 MB or >1000 pages)
- SQLite checkpoint system for resumable processing after interruptions
- Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
- Robust error handling with exponential backoff retry logic
- Batch processing with progress tracking and cost estimation
Technical Details
- Language: Python 3.8+
- Framework: Jupyter Notebook
- OCR Engine: Mistral AI Vision API (
mistral-ocr-latest) - License: MIT
- Institution: BBF | Research Library for the History of Education in Berlin
Use Cases
Designed for researchers and digital humanities practitioners working with:
- Historical periodicals and journals
- Scanned newspaper archives
- Magazine collections
- Academic journal archives
Documentation
Complete documentation included:
- Installation and setup guide
- System architecture documentation
- Prompt engineering details
- Zenodo publication metadata
Requirements
- Mistral AI API key (registration at console.mistral.ai)
- Python 3.8 or higher
- See
requirements.txtfor complete dependency list
Citation
See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).
Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)
V.1.0.1 Initial Public Release
OCR Pipeline for Historical Print Periodicals using Mistral AI
Publication-ready OCR pipeline for batch processing scanned historical periodicals.
Features
- Structured OCR with Markdown formatting preserving document layout
- Automatic PDF splitting for large files (>50 MB or >1000 pages)
- SQLite checkpoint system for resumable processing after interruptions
- Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
- Robust error handling with exponential backoff retry logic
- Batch processing with progress tracking and cost estimation
Technical Details
- Language: Python 3.8+
- Framework: Jupyter Notebook
- OCR Engine: Mistral AI Vision API (
mistral-ocr-latest) - License: MIT
- Institution: BBF | Research Library for the History of Education in Berlin
Use Cases
Designed for researchers and digital humanities practitioners working with:
- Historical periodicals and journals
- Scanned newspaper archives
- Magazine collections
- Academic journal archives
Documentation
Complete documentation included:
- Installation and setup guide
- System architecture documentation
- Prompt engineering details
- Zenodo publication metadata
Requirements
- Mistral AI API key (registration at console.mistral.ai)
- Python 3.8 or higher
- See
requirements.txtfor complete dependency list
Citation
See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).
Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)
v1.0.0 - Initial Public Release
OCR Pipeline for Historical Print Periodicals using Mistral AI
Publication-ready OCR pipeline for batch processing scanned historical periodicals.
Features
- Structured OCR with Markdown formatting preserving document layout
- Automatic PDF splitting for large files (>50 MB or >1000 pages)
- SQLite checkpoint system for resumable processing after interruptions
- Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
- Robust error handling with exponential backoff retry logic
- Batch processing with progress tracking and cost estimation
Technical Details
- Language: Python 3.8+
- Framework: Jupyter Notebook
- OCR Engine: Mistral AI Vision API (
mistral-ocr-latest) - License: MIT
- Institution: BBF | Research Library for the History of Education in Berlin
Use Cases
Designed for researchers and digital humanities practitioners working with:
- Historical periodicals and journals
- Scanned newspaper archives
- Magazine collections
- Academic journal archives
Documentation
Complete documentation included:
- Installation and setup guide
- System architecture documentation
- Prompt engineering details
- Zenodo publication metadata
Requirements
- Mistral AI API key (registration at console.mistral.ai)
- Python 3.8 or higher
- See
requirements.txtfor complete dependency list
Citation
See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).
Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)