Fix spurious interword spaces in CJK/non-space-delimited text output by XHXIAIEIN · Pull Request #4524 · tesseract-ocr/tesseract

XHXIAIEIN · 2026-03-12T11:14:57Z

Problem

When using GetUTF8Text() on CJK (Chinese, Japanese, Korean) or Thai content, Tesseract inserts a space between every recognized word. Since the LSTM engine segments non-space-delimited text into individual characters or short runs as separate WERD_RES objects, the output contains a spurious space after each character:

Input image: 床前明月光          (Li Bai, 静夜思)
Old output:  床 前 明 月 光
New output:  床前明月光

Root cause: IterateAndAppendUTF8TextlineText() in resultiterator.cpp unconditionally inserted one space between every pair of words (numSpaces = (words_appended > 0)), regardless of whether the surrounding scripts use spaces as word delimiters.

Fix

Add a static helper IsWordSpaceDelimited() that returns false when a word's first character:

Belongs to a non-space-delimited script (Han, Hangul, Hiragana, Katakana, Thai) as reported by unicharset::IsSpaceDelimited(), or
Falls in a Unicode block where some characters have Script=Common but contextually belong to a non-space-delimited writing system:
- U+3000–U+303F CJK Symbols and Punctuation
- U+3040–U+309F Hiragana block (covers U+3099/U+309A Inherited, U+309B/U+309C Common)
- U+30A0–U+30FF Katakana block (covers U+30FC PROLONGED SOUND MARK, Script=Common)
- U+FE30–U+FE4F CJK Compatibility Forms
- U+FF00–U+FF60 Halfwidth and Fullwidth Forms

In IterateAndAppendUTF8TextlineText(), a space is suppressed only when both the previous and the current word are non-space-delimited. Spaces at script boundaries (CJK↔Latin, CJK↔digits) are intentionally preserved.

The preserve_interword_spaces code path is unaffected.

Performance

IsWordSpaceDelimited() is called once per word in the post-processing / output formatting phase (GetUTF8Text), which runs after recognition is complete. It has no effect on the recognition pipeline.

The function is designed with two fast exits:

Script check (unicharset::IsSpaceDelimited): O(1) array lookup — returns immediately for Han/Hiragana/Katakana/Thai/Hangul without further work.
ASCII fast-path: if the UTF-8 first byte is < 0x80, the character is ASCII (U+0000–U+007F), which is below all checked ranges (≥ U+3000). The UNICHAR decode is skipped entirely, keeping Latin/English text at near-zero overhead.

Only characters that pass both of the above checks (i.e., multi-byte UTF-8, Script=Common) reach the full Unicode range comparison.

Scope

This fix targets GetUTF8Text() plain-text output only. hOCR, TSV, and PDF output are handled by separate code paths and are left for follow-up.

Testing

Both the original and patched binaries were built from the same source tree (v5.5.2) and run against identical input images.

Script	Input	Before (original)	After (patched)	Result
Chinese	`床前明月光` (Li Bai, 静夜思)	`床前明月光`	`床前明月光`	✓ spaces removed
Japanese hiragana	`はるはあけぼの` (Sei Shōnagon, 枕草子)	`はるはあけぼの`	`はるはあけぼの`	✓ spaces removed
Japanese katakana	`アイスクリーム` (two `ー` U+30FC, Script=Common)	`アイスクリーム`	`アイスクリーム`	✓ spaces removed
Thai	`สวัสดีครับ` (common greeting)	`ส ว ั ส ด ี ค ร ั บ`	`สวัสดีครับ`	✓ spaces removed
Chinese + Latin (script boundary)	`开源OCR引擎`	`开源 OCR 引擎`	`开源 OCR 引擎`	✓ boundary space preserved
English phrase	`optical character recognition`	`optical character recognition`	`optical character recognition`	✓ unchanged
English sentence	`The quick brown fox jumps over the lazy dog`	`The quick brown fox jumps over the lazy dog`	`The quick brown fox jumps over the lazy dog`	✓ unchanged

When Tesseract segments CJK text into individual WERD_RES words, the output layer in IterateAndAppendUTF8TextlineText() was unconditionally inserting a space between every pair of words. For languages like Chinese, Japanese, and Korean that do not use spaces as word delimiters this produces output such as "床前明月光" instead of "床前明月光". Fix: add a static helper IsWordSpaceDelimited() that returns false when a word's leading character belongs to a non-space-delimited script (Han, Hangul, Hiragana, Katakana, Thai — via unicharset::IsSpaceDelimited) or falls in a Unicode block containing Script=Common characters that are used exclusively in CJK contexts (e.g. U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK, U+3002 IDEOGRAPHIC FULL STOP): U+3000-U+303F CJK Symbols and Punctuation U+3040-U+309F Hiragana block (Script=Common combining marks) U+30A0-U+30FF Katakana block (Script=Common marks incl. U+30FC) U+FE30-U+FE4F CJK Compatibility Forms U+FF00-U+FF9F Halfwidth and Fullwidth Forms (incl. halfwidth Katakana) id_to_unichar_ext() is used instead of id_to_unichar() so that any private Tesseract encodings are resolved to their canonical Unicode form before the codepoint check. In IterateAndAppendUTF8TextlineText(), a space is suppressed when both the previous and the current word are non-space-delimited, leaving inter-script boundaries (CJK-Latin, CJK-digits) unchanged. Performance: two fast exits keep non-CJK text at near-zero overhead. (1) Script check via IsSpaceDelimited(): O(1) array lookup, returns immediately for Han/Hiragana/Katakana/Hangul/Thai. (2) ASCII fast-path: first UTF-8 byte < 0x80 means codepoint < U+0080, below all checked ranges (>= U+3000), so UNICHAR decode is skipped entirely for Latin/ASCII words. The preserve_interword_spaces path is unaffected.

XHXIAIEIN force-pushed the fix/cjk-interword-spaces branch from 573e24d to f77875a Compare March 12, 2026 11:50

XHXIAIEIN force-pushed the fix/cjk-interword-spaces branch from f77875a to f50f7bf Compare March 12, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spurious interword spaces in CJK/non-space-delimited text output#4524

Fix spurious interword spaces in CJK/non-space-delimited text output#4524
XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
XHXIAIEIN:fix/cjk-interword-spaces

XHXIAIEIN commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XHXIAIEIN commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Performance

Scope

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XHXIAIEIN commented Mar 12, 2026 •

edited

Loading