Can I use hOCR files as source for OCR, thereby skipping Tesseract in the process flow? #1602

bluebox-steven · 2025-12-05T10:03:41Z

bluebox-steven
Dec 5, 2025

Hi all. I'm wanting to use another OCR engine for generating the text output, but without having to create a plugin for it. Therefore, is it possible to use hOCR files as the source and get OCRmyPDF to skip Tesseract (ie. Tesseract/OCRmyPDF assume the OCR is already done, so just use the hOCR file)? If so, how best to achieve that (either command line or programmatically). Thanks!

Answered by jbarlow83

Dec 6, 2025

Yes, there are APIs in ocrmypdf._api that provide a "pdf to hocr" and "hocr to ocr pdf" in separate steps. An open source PDF app asked for this feature so they could use it to implement intermediate hocr editing. I don't know if that every got implemented, which is why it remains a private API, since I wanted to make sure the intended consumer could use it in its current form. For the same reason, there's no command line interface; you have to script it. So if you obtain hocr from somewhere, you can get it rendered and applied.

There are plugins like OCRmyPDF-PaddleOCR if you just want to use a plugin.

OCRmyPDF is pretty tightly coupled to Tesseract even if you bypass it for OCR - it won…

View full answer

jbarlow83 · 2025-12-06T20:34:56Z

jbarlow83
Dec 6, 2025
Maintainer

Yes, there are APIs in ocrmypdf._api that provide a "pdf to hocr" and "hocr to ocr pdf" in separate steps. An open source PDF app asked for this feature so they could use it to implement intermediate hocr editing. I don't know if that every got implemented, which is why it remains a private API, since I wanted to make sure the intended consumer could use it in its current form. For the same reason, there's no command line interface; you have to script it. So if you obtain hocr from somewhere, you can get it rendered and applied.

There are plugins like OCRmyPDF-PaddleOCR if you just want to use a plugin.

OCRmyPDF is pretty tightly coupled to Tesseract even if you bypass it for OCR - it won't really work properly unless it's installed.

1 reply

bluebox-steven Dec 8, 2025
Author

Thanks for the reply. I found the discussions in #453 and API code over here: https://github.com/ocrmypdf/OCRmyPDF/blob/main/src/ocrmypdf/api.py

I'll test it out and see whether it will work for my implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can I use hOCR files as source for OCR, thereby skipping Tesseract in the process flow? #1602

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can I use hOCR files as source for OCR, thereby skipping Tesseract in the process flow? #1602

Uh oh!

Uh oh!

bluebox-steven Dec 5, 2025

Replies: 1 comment · 1 reply

Uh oh!

jbarlow83 Dec 6, 2025 Maintainer

Uh oh!

bluebox-steven Dec 8, 2025 Author

bluebox-steven
Dec 5, 2025

Replies: 1 comment 1 reply

jbarlow83
Dec 6, 2025
Maintainer

bluebox-steven Dec 8, 2025
Author