-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Description
Simple sanity checks
- This is an issue with an app that uses OCRmyPDF for OCR
- I am using a recent version of the third party app
- I will include a file that reproduces the issuse
Third party app name and version
Paperless-ngx v2.19.5
Describe the bug
Hello there,
I am using OCRmyPDF from within paperless-ngx and am having serious issues with German documents despite excellent input quality.
Config is set to German+English language ("deu+eng").
When using only "deu" the issue with this specific document does not occur. Seems like the language detection fails in this case.
Here's the affected page from the PDF.
"Hängeschränke" and "Unterschränke" is both representing ä as a resulting in "Hangeschranke" and "Unterschranke".
It works on other parts of the document, so it's not broken in general. For excample "Geschirrspüler" is being recognized with the ü intact.
Steps to reproduce
1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check plain content / ocr result
Files
OCRmyPDF version
16.11.0
Relevant log output
There's nothing obvious besides tesseract complaining about diacritics but when does it not...
[2025-11-08 12:10:37,797] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[84b68a89-4453-45c6-b090-33d6c15a5aae] received
[2025-11-08 12:10:37,838] [INFO] [paperless.tasks] ConsumerPreflightPlugin completed with no message
[2025-11-08 12:10:37,842] [INFO] [paperless.tasks] WorkflowTriggerPlugin completed with:
[2025-11-08 12:10:37,842] [INFO] [paperless.consumer] Consuming OCR-Problem.pdf
[2025-11-08 12:10:37,869] [INFO] [paperless.parsing.tesseract] pdftotext exited 0
[2025-11-08 12:10:39,615] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 13.99 - rotation appears correct
[2025-11-08 12:10:42,740] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2025-11-08 12:10:42,797] [INFO] [ocrmypdf._pipelines.ocr] Postprocessing...
[2025-11-08 12:10:42,826] [INFO] [ocrmypdf._pipeline] Image optimization ratio: 1.00 savings: 0.0%
[2025-11-08 12:10:42,826] [INFO] [ocrmypdf._pipeline] Total file size ratio: 0.91 savings: -9.7%
[2025-11-08 12:10:43,793] [INFO] [paperless.parsing] convert exited 0
Metadata
Metadata
Assignees
Labels
No labels