Skip to content

Improving performance for OCRing files with headers/footers #1600

@user1823

Description

@user1823

Describe the proposed feature

I have several scanned documents that have digital headers and footers. Header.pdf is a test file that I made using Word.

OCRmyPDF can't ocr them directly and fails with the error "PriorOcrFoundError: page already has text!"

It can OCR them using --redo-ocr, but this consumes significantly more time than removing the header and performing a normal OCR on the output file.

Example:

time ocrmypdf --output-type pdf --redo-ocr Header.pdf Header-redo.pdf
real    0m9.251s
user    0m9.446s
sys     0m1.471s
time gs -o Noheader.pdf -sDEVICE=pdfwrite -dFILTERTEXT Header.pdf
real    0m0.426s
user    0m0.326s
sys     0m0.040s

time ocrmypdf --output-type pdf Noheader.pdf Noheader-ocr.pdf
real    0m6.906s
user    0m7.481s
sys     0m0.960s

The difference is probably not that significant in this small file, but I have several similar PDFs with thousands of pages, where the difference becomes quite significant.

So, is it possible to optimize the OCR of such documents while preserving the headers and footers?

Maybe a new option flag (like --keep-text) for preserving digital text when there is no prior OCR text if trying to identify the OCR text is the cause for the overhead?
With this option, OCRmyPDF will strip all text using gs -sDEVICE=pdfwrite -dFILTERTEXT, perform OCR on the output PDF and then add the resulting OCR to the original PDF.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions