Can I use hOCR files as source for OCR, thereby skipping Tesseract in the process flow? #1602
-
|
Hi all. I'm wanting to use another OCR engine for generating the text output, but without having to create a plugin for it. Therefore, is it possible to use hOCR files as the source and get OCRmyPDF to skip Tesseract (ie. Tesseract/OCRmyPDF assume the OCR is already done, so just use the hOCR file)? If so, how best to achieve that (either command line or programmatically). Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Yes, there are APIs in ocrmypdf._api that provide a "pdf to hocr" and "hocr to ocr pdf" in separate steps. An open source PDF app asked for this feature so they could use it to implement intermediate hocr editing. I don't know if that every got implemented, which is why it remains a private API, since I wanted to make sure the intended consumer could use it in its current form. For the same reason, there's no command line interface; you have to script it. So if you obtain hocr from somewhere, you can get it rendered and applied. There are plugins like OCRmyPDF-PaddleOCR if you just want to use a plugin. OCRmyPDF is pretty tightly coupled to Tesseract even if you bypass it for OCR - it won't really work properly unless it's installed. |
Beta Was this translation helpful? Give feedback.
Yes, there are APIs in ocrmypdf._api that provide a "pdf to hocr" and "hocr to ocr pdf" in separate steps. An open source PDF app asked for this feature so they could use it to implement intermediate hocr editing. I don't know if that every got implemented, which is why it remains a private API, since I wanted to make sure the intended consumer could use it in its current form. For the same reason, there's no command line interface; you have to script it. So if you obtain hocr from somewhere, you can get it rendered and applied.
There are plugins like OCRmyPDF-PaddleOCR if you just want to use a plugin.
OCRmyPDF is pretty tightly coupled to Tesseract even if you bypass it for OCR - it won…