Fix spurious interword spaces in CJK/non-space-delimited text output#4524
Open
XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
Open
Fix spurious interword spaces in CJK/non-space-delimited text output#4524XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
XHXIAIEIN wants to merge 1 commit intotesseract-ocr:mainfrom
Conversation
573e24d to
f77875a
Compare
When Tesseract segments CJK text into individual WERD_RES words, the
output layer in IterateAndAppendUTF8TextlineText() was unconditionally
inserting a space between every pair of words. For languages like
Chinese, Japanese, and Korean that do not use spaces as word delimiters
this produces output such as "床 前 明 月 光" instead of "床前明月光".
Fix: add a static helper IsWordSpaceDelimited() that returns false when
a word's leading character belongs to a non-space-delimited script
(Han, Hangul, Hiragana, Katakana, Thai — via unicharset::IsSpaceDelimited)
or falls in a Unicode block containing Script=Common characters that are
used exclusively in CJK contexts (e.g. U+30FC KATAKANA-HIRAGANA
PROLONGED SOUND MARK, U+3002 IDEOGRAPHIC FULL STOP):
U+3000-U+303F CJK Symbols and Punctuation
U+3040-U+309F Hiragana block (Script=Common combining marks)
U+30A0-U+30FF Katakana block (Script=Common marks incl. U+30FC)
U+FE30-U+FE4F CJK Compatibility Forms
U+FF00-U+FF9F Halfwidth and Fullwidth Forms (incl. halfwidth Katakana)
id_to_unichar_ext() is used instead of id_to_unichar() so that any
private Tesseract encodings are resolved to their canonical Unicode form
before the codepoint check.
In IterateAndAppendUTF8TextlineText(), a space is suppressed when both
the previous and the current word are non-space-delimited, leaving
inter-script boundaries (CJK-Latin, CJK-digits) unchanged.
Performance: two fast exits keep non-CJK text at near-zero overhead.
(1) Script check via IsSpaceDelimited(): O(1) array lookup, returns
immediately for Han/Hiragana/Katakana/Hangul/Thai.
(2) ASCII fast-path: first UTF-8 byte < 0x80 means codepoint < U+0080,
below all checked ranges (>= U+3000), so UNICHAR decode is skipped
entirely for Latin/ASCII words.
The preserve_interword_spaces path is unaffected.
f77875a to
f50f7bf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When using
GetUTF8Text()on CJK (Chinese, Japanese, Korean) or Thai content, Tesseract inserts a space between every recognized word. Since the LSTM engine segments non-space-delimited text into individual characters or short runs as separateWERD_RESobjects, the output contains a spurious space after each character:Root cause:
IterateAndAppendUTF8TextlineText()inresultiterator.cppunconditionally inserted one space between every pair of words (numSpaces = (words_appended > 0)), regardless of whether the surrounding scripts use spaces as word delimiters.Fix
Add a static helper
IsWordSpaceDelimited()that returnsfalsewhen a word's first character:unicharset::IsSpaceDelimited(), orCommonbut contextually belong to a non-space-delimited writing system:U+3000–U+303FCJK Symbols and PunctuationU+3040–U+309FHiragana block (coversU+3099/U+309AInherited,U+309B/U+309CCommon)U+30A0–U+30FFKatakana block (coversU+30FCPROLONGED SOUND MARK, Script=Common)U+FE30–U+FE4FCJK Compatibility FormsU+FF00–U+FF60Halfwidth and Fullwidth FormsIn
IterateAndAppendUTF8TextlineText(), a space is suppressed only when both the previous and the current word are non-space-delimited. Spaces at script boundaries (CJK↔Latin, CJK↔digits) are intentionally preserved.The
preserve_interword_spacescode path is unaffected.Performance
IsWordSpaceDelimited()is called once per word in the post-processing / output formatting phase (GetUTF8Text), which runs after recognition is complete. It has no effect on the recognition pipeline.The function is designed with two fast exits:
unicharset::IsSpaceDelimited): O(1) array lookup — returns immediately for Han/Hiragana/Katakana/Thai/Hangul without further work.< 0x80, the character is ASCII (U+0000–U+007F), which is below all checked ranges (≥ U+3000). TheUNICHARdecode is skipped entirely, keeping Latin/English text at near-zero overhead.Only characters that pass both of the above checks (i.e., multi-byte UTF-8, Script=Common) reach the full Unicode range comparison.
Scope
This fix targets
GetUTF8Text()plain-text output only. hOCR, TSV, and PDF output are handled by separate code paths and are left for follow-up.Testing
Both the original and patched binaries were built from the same source tree (v5.5.2) and run against identical input images.
床前明月光(Li Bai, 静夜思)床 前 明 月 光床前明月光はるはあけぼの(Sei Shōnagon, 枕草子)は る は あ け ぼ のはるはあけぼのアイスクリーム(twoーU+30FC, Script=Common)ア イ ス ク リ ー ムアイスクリームสวัสดีครับ(common greeting)ส ว ั ส ด ี ค ร ั บสวัสดีครับ开源OCR引擎开 源 OCR 引 擎开源 OCR 引擎optical character recognitionoptical character recognitionoptical character recognitionThe quick brown fox jumps over the lazy dogThe quick brown fox jumps over the lazy dogThe quick brown fox jumps over the lazy dog