Skip to content

Fix spurious interword spaces in CJK/non-space-delimited text output#4524

Open
XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
XHXIAIEIN:fix/cjk-interword-spaces
Open

Fix spurious interword spaces in CJK/non-space-delimited text output#4524
XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
XHXIAIEIN:fix/cjk-interword-spaces

Conversation

@XHXIAIEIN
Copy link
Copy Markdown

@XHXIAIEIN XHXIAIEIN commented Mar 12, 2026

Problem

When using GetUTF8Text() on CJK (Chinese, Japanese, Korean) or Thai content, Tesseract inserts a space between every recognized word. Since the LSTM engine segments non-space-delimited text into individual characters or short runs as separate WERD_RES objects, the output contains a spurious space after each character:

Input image: 床前明月光          (Li Bai, 静夜思)
Old output:  床 前 明 月 光
New output:  床前明月光

Root cause: IterateAndAppendUTF8TextlineText() in resultiterator.cpp unconditionally inserted one space between every pair of words (numSpaces = (words_appended > 0)), regardless of whether the surrounding scripts use spaces as word delimiters.

Fix

Add a static helper IsWordSpaceDelimited() that returns false when a word's first character:

  • Belongs to a non-space-delimited script (Han, Hangul, Hiragana, Katakana, Thai) as reported by unicharset::IsSpaceDelimited(), or
  • Falls in a Unicode block where some characters have Script=Common but contextually belong to a non-space-delimited writing system:
    • U+3000–U+303F CJK Symbols and Punctuation
    • U+3040–U+309F Hiragana block (covers U+3099/U+309A Inherited, U+309B/U+309C Common)
    • U+30A0–U+30FF Katakana block (covers U+30FC PROLONGED SOUND MARK, Script=Common)
    • U+FE30–U+FE4F CJK Compatibility Forms
    • U+FF00–U+FF60 Halfwidth and Fullwidth Forms

In IterateAndAppendUTF8TextlineText(), a space is suppressed only when both the previous and the current word are non-space-delimited. Spaces at script boundaries (CJK↔Latin, CJK↔digits) are intentionally preserved.

The preserve_interword_spaces code path is unaffected.

Performance

IsWordSpaceDelimited() is called once per word in the post-processing / output formatting phase (GetUTF8Text), which runs after recognition is complete. It has no effect on the recognition pipeline.

The function is designed with two fast exits:

  1. Script check (unicharset::IsSpaceDelimited): O(1) array lookup — returns immediately for Han/Hiragana/Katakana/Thai/Hangul without further work.
  2. ASCII fast-path: if the UTF-8 first byte is < 0x80, the character is ASCII (U+0000–U+007F), which is below all checked ranges (≥ U+3000). The UNICHAR decode is skipped entirely, keeping Latin/English text at near-zero overhead.

Only characters that pass both of the above checks (i.e., multi-byte UTF-8, Script=Common) reach the full Unicode range comparison.

Scope

This fix targets GetUTF8Text() plain-text output only. hOCR, TSV, and PDF output are handled by separate code paths and are left for follow-up.

Testing

Both the original and patched binaries were built from the same source tree (v5.5.2) and run against identical input images.

Script Input Before (original) After (patched) Result
Chinese 床前明月光 (Li Bai, 静夜思) 床 前 明 月 光 床前明月光 ✓ spaces removed
Japanese hiragana はるはあけぼの (Sei Shōnagon, 枕草子) は る は あ け ぼ の はるはあけぼの ✓ spaces removed
Japanese katakana アイスクリーム (two U+30FC, Script=Common) ア イ ス ク リ ー ム アイスクリーム ✓ spaces removed
Thai สวัสดีครับ (common greeting) ส ว ั ส ด ี ค ร ั บ สวัสดีครับ ✓ spaces removed
Chinese + Latin (script boundary) 开源OCR引擎 开 源 OCR 引 擎 开源 OCR 引擎 ✓ boundary space preserved
English phrase optical character recognition optical character recognition optical character recognition ✓ unchanged
English sentence The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog ✓ unchanged

@XHXIAIEIN XHXIAIEIN force-pushed the fix/cjk-interword-spaces branch from 573e24d to f77875a Compare March 12, 2026 11:50
When Tesseract segments CJK text into individual WERD_RES words, the
output layer in IterateAndAppendUTF8TextlineText() was unconditionally
inserting a space between every pair of words.  For languages like
Chinese, Japanese, and Korean that do not use spaces as word delimiters
this produces output such as "床 前 明 月 光" instead of "床前明月光".

Fix: add a static helper IsWordSpaceDelimited() that returns false when
a word's leading character belongs to a non-space-delimited script
(Han, Hangul, Hiragana, Katakana, Thai — via unicharset::IsSpaceDelimited)
or falls in a Unicode block containing Script=Common characters that are
used exclusively in CJK contexts (e.g. U+30FC KATAKANA-HIRAGANA
PROLONGED SOUND MARK, U+3002 IDEOGRAPHIC FULL STOP):
  U+3000-U+303F  CJK Symbols and Punctuation
  U+3040-U+309F  Hiragana block (Script=Common combining marks)
  U+30A0-U+30FF  Katakana block (Script=Common marks incl. U+30FC)
  U+FE30-U+FE4F  CJK Compatibility Forms
  U+FF00-U+FF9F  Halfwidth and Fullwidth Forms (incl. halfwidth Katakana)

id_to_unichar_ext() is used instead of id_to_unichar() so that any
private Tesseract encodings are resolved to their canonical Unicode form
before the codepoint check.

In IterateAndAppendUTF8TextlineText(), a space is suppressed when both
the previous and the current word are non-space-delimited, leaving
inter-script boundaries (CJK-Latin, CJK-digits) unchanged.

Performance: two fast exits keep non-CJK text at near-zero overhead.
(1) Script check via IsSpaceDelimited(): O(1) array lookup, returns
    immediately for Han/Hiragana/Katakana/Hangul/Thai.
(2) ASCII fast-path: first UTF-8 byte < 0x80 means codepoint < U+0080,
    below all checked ranges (>= U+3000), so UNICHAR decode is skipped
    entirely for Latin/ASCII words.

The preserve_interword_spaces path is unaffected.
@XHXIAIEIN XHXIAIEIN force-pushed the fix/cjk-interword-spaces branch from f77875a to f50f7bf Compare March 12, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant