Skip to content

[3rdparty]: Wrong results in German despite excellent input quality #1595

@DReffects

Description

@DReffects

Simple sanity checks

  • This is an issue with an app that uses OCRmyPDF for OCR
  • I am using a recent version of the third party app
  • I will include a file that reproduces the issuse

Third party app name and version

Paperless-ngx v2.19.5

Describe the bug

Hello there,

I am using OCRmyPDF from within paperless-ngx and am having serious issues with German documents despite excellent input quality.

Config is set to German+English language ("deu+eng").
When using only "deu" the issue with this specific document does not occur. Seems like the language detection fails in this case.

Image

Here's the affected page from the PDF.

OCR-Problem.pdf

"Hängeschränke" and "Unterschränke" is both representing ä as a resulting in "Hangeschranke" and "Unterschranke".
It works on other parts of the document, so it's not broken in general. For excample "Geschirrspüler" is being recognized with the ü intact.

Steps to reproduce

1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check plain content / ocr result

Files

OCR-Problem.pdf

OCRmyPDF version

16.11.0

Relevant log output

There's nothing obvious besides tesseract complaining about diacritics but when does it not...



[2025-11-08 12:10:37,797] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[84b68a89-4453-45c6-b090-33d6c15a5aae] received
[2025-11-08 12:10:37,838] [INFO] [paperless.tasks] ConsumerPreflightPlugin completed with no message
[2025-11-08 12:10:37,842] [INFO] [paperless.tasks] WorkflowTriggerPlugin completed with: 
[2025-11-08 12:10:37,842] [INFO] [paperless.consumer] Consuming OCR-Problem.pdf
[2025-11-08 12:10:37,869] [INFO] [paperless.parsing.tesseract] pdftotext exited 0
[2025-11-08 12:10:39,615] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 13.99 - rotation appears correct
[2025-11-08 12:10:42,740] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2025-11-08 12:10:42,797] [INFO] [ocrmypdf._pipelines.ocr] Postprocessing...
[2025-11-08 12:10:42,826] [INFO] [ocrmypdf._pipeline] Image optimization ratio: 1.00 savings: 0.0%
[2025-11-08 12:10:42,826] [INFO] [ocrmypdf._pipeline] Total file size ratio: 0.91 savings: -9.7%
[2025-11-08 12:10:43,793] [INFO] [paperless.parsing] convert exited 0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions