-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the proposed feature
I have several scanned documents that have digital headers and footers. Header.pdf is a test file that I made using Word.
OCRmyPDF can't ocr them directly and fails with the error "PriorOcrFoundError: page already has text!"
It can OCR them using --redo-ocr, but this consumes significantly more time than removing the header and performing a normal OCR on the output file.
Example:
time ocrmypdf --output-type pdf --redo-ocr Header.pdf Header-redo.pdf
real 0m9.251s
user 0m9.446s
sys 0m1.471stime gs -o Noheader.pdf -sDEVICE=pdfwrite -dFILTERTEXT Header.pdf
real 0m0.426s
user 0m0.326s
sys 0m0.040s
time ocrmypdf --output-type pdf Noheader.pdf Noheader-ocr.pdf
real 0m6.906s
user 0m7.481s
sys 0m0.960sThe difference is probably not that significant in this small file, but I have several similar PDFs with thousands of pages, where the difference becomes quite significant.
So, is it possible to optimize the OCR of such documents while preserving the headers and footers?
Maybe a new option flag (like --keep-text) for preserving digital text when there is no prior OCR text if trying to identify the OCR text is the cause for the overhead?
With this option, OCRmyPDF will strip all text using gs -sDEVICE=pdfwrite -dFILTERTEXT, perform OCR on the output PDF and then add the resulting OCR to the original PDF.