Skip to content

Modernize OCR by Switching to an AI‑Based, Native‑PDF OCR Engine #89

@rffrasca

Description

@rffrasca

Summary
PDFKeeper’s current OCR workflow requires converting image‑based PDF pages to TIFF before processing. This extra conversion step increases processing time and is more resource‑intensive than modern AI‑based OCR engines that can operate directly on PDF files.

Proposed Solution
Adopt an AI‑based OCR engine that can:

  • Process image‑based PDF pages directly without rasterization
  • Support multiple languages
  • Provide higher accuracy on low‑quality scans
  • Offer a clean API suitable for integration into PDFKeeper’s existing architecture.

Benefits

  • Faster OCR processing
  • Higher accuracy, especially for complex or low‑quality documents
  • Reduced CPU and memory usage

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions