Improving performance for OCRing files with headers/footers

### Describe the proposed feature

I have several scanned documents that have digital headers and footers. [Header.pdf](https://github.com/user-attachments/files/23728824/Header.pdf) is a test file that I made using Word.

OCRmyPDF can't ocr them directly and fails with the error "PriorOcrFoundError: page already has text!"

It can OCR them using `--redo-ocr`, but this consumes significantly more time than removing the header and performing a normal OCR on the output file.

Example:
```sh
time ocrmypdf --output-type pdf --redo-ocr Header.pdf Header-redo.pdf
real    0m9.251s
user    0m9.446s
sys     0m1.471s
```

```sh
time gs -o Noheader.pdf -sDEVICE=pdfwrite -dFILTERTEXT Header.pdf
real    0m0.426s
user    0m0.326s
sys     0m0.040s

time ocrmypdf --output-type pdf Noheader.pdf Noheader-ocr.pdf
real    0m6.906s
user    0m7.481s
sys     0m0.960s
```

The difference is probably not that significant in this small file, but I have several similar PDFs with thousands of pages, where the difference becomes quite significant.

So, is it possible to optimize the OCR of such documents while preserving the headers and footers?

Maybe a new option flag (like --keep-text) for preserving digital text when there is no prior OCR text if trying to identify the OCR text is the cause for the overhead?
With this option, OCRmyPDF will strip all text using `gs -sDEVICE=pdfwrite -dFILTERTEXT`, perform OCR on the output PDF and then add the resulting OCR to the original PDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improving performance for OCRing files with headers/footers #1600

Describe the proposed feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improving performance for OCRing files with headers/footers #1600

Description

Describe the proposed feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions