Skip to content

Releases: ma-wi-lo/pubs

v1.1.0

20 Dec 12:57
1f36d3f

Choose a tag to compare

new documentation and better logging after the release of a new ocr model

v1.0.2 - Initial Public Release

17 Nov 14:55

Choose a tag to compare

OCR Pipeline for Historical Print Periodicals using Mistral AI

Publication-ready OCR pipeline for batch processing scanned historical periodicals.

Features

  • Structured OCR with Markdown formatting preserving document layout
  • Automatic PDF splitting for large files (>50 MB or >1000 pages)
  • SQLite checkpoint system for resumable processing after interruptions
  • Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
  • Robust error handling with exponential backoff retry logic
  • Batch processing with progress tracking and cost estimation

Technical Details

  • Language: Python 3.8+
  • Framework: Jupyter Notebook
  • OCR Engine: Mistral AI Vision API (mistral-ocr-latest)
  • License: MIT
  • Institution: BBF | Research Library for the History of Education in Berlin

Use Cases

Designed for researchers and digital humanities practitioners working with:

  • Historical periodicals and journals
  • Scanned newspaper archives
  • Magazine collections
  • Academic journal archives

Documentation

Complete documentation included:

  • Installation and setup guide
  • System architecture documentation
  • Prompt engineering details
  • Zenodo publication metadata

Requirements

  • Mistral AI API key (registration at console.mistral.ai)
  • Python 3.8 or higher
  • See requirements.txt for complete dependency list

Citation

See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).

Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)

V.1.0.1 Initial Public Release

17 Nov 14:51

Choose a tag to compare

OCR Pipeline for Historical Print Periodicals using Mistral AI

Publication-ready OCR pipeline for batch processing scanned historical periodicals.

Features

  • Structured OCR with Markdown formatting preserving document layout
  • Automatic PDF splitting for large files (>50 MB or >1000 pages)
  • SQLite checkpoint system for resumable processing after interruptions
  • Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
  • Robust error handling with exponential backoff retry logic
  • Batch processing with progress tracking and cost estimation

Technical Details

  • Language: Python 3.8+
  • Framework: Jupyter Notebook
  • OCR Engine: Mistral AI Vision API (mistral-ocr-latest)
  • License: MIT
  • Institution: BBF | Research Library for the History of Education in Berlin

Use Cases

Designed for researchers and digital humanities practitioners working with:

  • Historical periodicals and journals
  • Scanned newspaper archives
  • Magazine collections
  • Academic journal archives

Documentation

Complete documentation included:

  • Installation and setup guide
  • System architecture documentation
  • Prompt engineering details
  • Zenodo publication metadata

Requirements

  • Mistral AI API key (registration at console.mistral.ai)
  • Python 3.8 or higher
  • See requirements.txt for complete dependency list

Citation

See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).

Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)

v1.0.0 - Initial Public Release

17 Nov 14:07
037321c

Choose a tag to compare

OCR Pipeline for Historical Print Periodicals using Mistral AI

Publication-ready OCR pipeline for batch processing scanned historical periodicals.

Features

  • Structured OCR with Markdown formatting preserving document layout
  • Automatic PDF splitting for large files (>50 MB or >1000 pages)
  • SQLite checkpoint system for resumable processing after interruptions
  • Multiple output formats: Markdown (.md), Plain Text (.txt), JSON metadata
  • Robust error handling with exponential backoff retry logic
  • Batch processing with progress tracking and cost estimation

Technical Details

  • Language: Python 3.8+
  • Framework: Jupyter Notebook
  • OCR Engine: Mistral AI Vision API (mistral-ocr-latest)
  • License: MIT
  • Institution: BBF | Research Library for the History of Education in Berlin

Use Cases

Designed for researchers and digital humanities practitioners working with:

  • Historical periodicals and journals
  • Scanned newspaper archives
  • Magazine collections
  • Academic journal archives

Documentation

Complete documentation included:

  • Installation and setup guide
  • System architecture documentation
  • Prompt engineering details
  • Zenodo publication metadata

Requirements

  • Mistral AI API key (registration at console.mistral.ai)
  • Python 3.8 or higher
  • See requirements.txt for complete dependency list

Citation

See CITATION.cff for structured citation metadata or use the Zenodo DOI (assigned upon publication).

Author: Marco Lorenz (ORCID: 0000-0002-0903-2100)