doc: Add comprehensive PARAMETERS section to the tesseract man page by Copilot · Pull Request #4526 · tesseract-ocr/tesseract

Copilot · 2026-03-16T17:15:30Z

The --print-parameters output provides only terse one-liner descriptions with no indication of which parameters apply to the LSTM engine vs. the legacy engine.

Adds a new PARAMETERS section to doc/tesseract.1.asc covering the most commonly used parameters, each annotated with type, default value, and engine compatibility ([Both], [LSTM], or [Legacy]), organized into subsections:

Output Format — tessedit_create_*, textonly_pdf, hocr_*, page_xml_*, preserve_interword_spaces
Character Set — tessedit_char_whitelist/blacklist/unblacklist
Image Processing — thresholding_method (Otsu / LeptonicaOtsu / Sauvola and their tuning knobs), invert_threshold, user_defined_dpi
Dictionary — load_*_dawg, user_words/patterns_file/suffix
LSTM Engine — lstm_use_matrix, lstm_choice_mode/iterations, lstm_rating_coefficient
Legacy Engine — language_model_penalty_*, tessedit_enable_*_correction, wordrec_enable_assoc
Debug — debug_file, tessedit_write_params_to_file

Also clarifies that tessedit_do_invert is deprecated and that invert_threshold=0 is the correct way to disable automatic image inversion, and that --dpi is equivalent to setting user_defined_dpi directly.

Original prompt

This section details on the original issue you should resolve

<issue_title>Better documentation for --print-parameters.</issue_title>
<issue_description>### Your Feature Request

The documentation for each parameter is super short and hard for an end user to understand. It would be good to understand which parameters apply to the new LSTM and which are the legacy OCR.

Changing --print-parameters might be a breaking change but adding a manual file would be harmless and useful.</issue_description>

Comments on the Issue (you are @copilot in this section)
@stweil Do you want to use a chatbot to do this task? It might be a real challenge to create correct documentation for each parameter.

Fixes Better documentation for --print-parameters. #4525

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

Document the most commonly used Tesseract parameters with: - Detailed descriptions explaining what each parameter does - Engine compatibility labels ([Both], [LSTM], [Legacy]) - Organization into functional sections: Output Format, Character Set, Image Processing, Dictionary, LSTM Engine, Legacy Engine, Debug All default values have been verified against the actual --print-parameters output. Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>

idrise · 2026-03-18T01:25:21Z

doc/tesseract.1.asc

+*tessedit_create_hocr* (bool, default: 0) [Both]::
+  Write hOCR output to a `.hocr` file.  hOCR is an HTML-based format that
+  encodes the OCR results together with their bounding boxes and confidences.
+  Use the *hocr* config file to enable this format.


Having a link here to the specification may be useful
https://kba.github.io/hocr-spec/1.2/

idrise · 2026-03-18T01:26:51Z

doc/tesseract.1.asc

+  Use the *hocr* config file to enable this format.
+
+*hocr_font_info* (bool, default: 0) [Both]::
+  Include font information in hOCR output.


what information and how reliable is it? I'd love to know that before I enabled this.

idrise · 2026-03-18T01:44:18Z

doc/tesseract.1.asc

+  Include font information in hOCR output.
+
+*hocr_char_boxes* (bool, default: 0) [Both]::
+  Add per-character bounding-box coordinates to hOCR output.


My knowledge here might be out of date, but while using tesseract in the past I found that the char boxes only really worked on the legacy OCR.
So I created a quick python script to extract letter 'a' using LSTM and get this...

and

So I had 26 failures out of 131 or a 20% failure rate using the LSTM. Worth noting that this is still under development maybe?

idrise · 2026-03-18T01:48:43Z

doc/tesseract.1.asc

+
+*tessedit_create_page_xml* (bool, default: 0) [Both]::
+  Write PAGE XML output to a `.page.xml` file.  PAGE is a standard XML format
+  for ground truth and OCR results used in document image analysis competitions.


PAGE XML, strange emphasis on competitions, it is a general purpose format used in

Digital humanities projects
Libraries and archives
Annotation tools (e.g. Transkribus, eScriptorium)

https://github.com/PRImA-Research-Lab/PAGE-XML

idrise · 2026-03-18T01:53:02Z

doc/tesseract.1.asc

+
+*tessedit_create_boxfile* (bool, default: 0) [Both]::
+  Write a Tesseract box file (`.box`) that lists each recognized character with
+  its bounding box.  Used mainly for legacy engine training.


If it is used for legacy engine training should it be both? I'm not sure, but worth a thought.

idrise · 2026-03-18T02:09:49Z

doc/tesseract.1.asc

+  of additional valid words.  For example, setting this to `user-words` causes
+  Tesseract to look for `eng.user-words` when using the English model.
+
+*user_patterns_file* (string, default: "") [Both]::


Should you wish to beef this up a tiny bit...

Defined in dict/trie.h, but in simplified terms:

A becomes uppercase letter
a becomes lowercase letter
0 becomes digit

Other symbols match themselves.
This is a structure template and not a regex.

idrise · 2026-03-18T02:13:20Z

doc/tesseract.1.asc

+
+*lstm_use_matrix* (bool, default: 1) [LSTM]::
+  Use the ratings matrix and beam search during LSTM decoding.  Disabling this
+  reverts to a simpler greedy decoding strategy.


Might be worth mentioning it's a little faster turned off and required very clean text? Can't imagine a use case for me disabling this one.

idrise · 2026-03-18T02:14:21Z

doc/tesseract.1.asc

+  Enables alternative character hypotheses in hOCR output:
+  0 = disabled (default);
+  1 = include per-timestep alternative choices;
+  2 = extract alternative choices from the CTC output mapped per character.


Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand.

idrise · 2026-03-18T02:15:57Z

doc/tesseract.1.asc

+
+*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]::
+  Apply bigram-based correction to improve recognition of adjacent words that
+  form common pairs.


IE 'is a' or 'lived at' etc. (For those not familiar with bigram corrections.)

idrise · 2026-03-18T02:21:40Z

doc/tesseract.1.asc

+[[PARAMETERS]]
+PARAMETERS
+----------
+


Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀

stweil · 2026-03-18T08:47:58Z

@copilot, please address the recent feedback.

stweil · 2026-03-18T08:52:38Z

doc/tesseract.1.asc

+  Blacklisted characters will never appear in the output.  The blacklist is
+  applied after the whitelist.
+
+*tessedit_char_unblacklist* (string, default: "") [Both]::


@copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.

- Add hOCR spec link to tessedit_create_hocr - Expand hocr_font_info to describe x_font/x_fsize metadata and LSTM caveat - Add LSTM accuracy note and cross-reference to hocr_char_boxes description - Fix tessedit_create_page_xml: broaden scope (digital humanities, libraries, annotation tools like Transkribus/eScriptorium) and link to PAGE-XML spec - Clarify tessedit_create_boxfile works with both engines - Clarify tessedit_char_unblacklist applies to both engines - Expand user_patterns_file with accurate backslash character-class syntax (\d=digit, \c=letter, \a=lower, \A=upper, \n=alphanum, \p=punc) and example - Note lstm_use_matrix disabling is faster but lower accuracy - Cross-reference hocr_char_boxes from lstm_choice_mode - Add word pair examples to tessedit_enable_bigram_correction description - Add textord_heavy_nr parameter for aggressive noise removal on dirty scans Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>

Copilot · 2026-03-18T08:55:48Z

@copilot, please address the recent feedback.

Addressed all 11 review comments from @idrise in commit f270828:

Added hOCR spec link (https://kba.github.io/hocr-spec/1.2/) to tessedit_create_hocr
Expanded hocr_font_info to detail x_font/x_fsize metadata and note LSTM reliability caveat
Added note that hocr_char_boxes character positions are approximated under LSTM; cross-referenced lstm_choice_mode
Fixed tessedit_create_page_xml to reflect broad use in digital humanities, libraries, and annotation tools (Transkribus, eScriptorium), with link to PAGE-XML spec
Clarified tessedit_create_boxfile can be produced by either engine but is primarily for legacy training ground truth
Clarified tessedit_char_unblacklist applies to both engines
Corrected user_patterns_file to document the actual backslash character-class syntax (\d, \c, \a, \A, \n, \p) with an example from the existing man-page config-file section
Noted lstm_use_matrix is faster when disabled but lower accuracy; works best on very clean images
Cross-referenced hocr_char_boxes from lstm_choice_mode
Added word-pair examples ("is a", "in the", "New York") to tessedit_enable_bigram_correction
Added textord_heavy_nr for aggressive noise removal on heavily degraded scans

@stweil

Replace "blacklist"/"whitelist" in descriptive text with "exclusion list", "allowlist", and "Characters in this list" formulations as requested by @stweil. Parameter names (tessedit_char_whitelist, tessedit_char_blacklist, tessedit_char_unblacklist) are unchanged. Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>

Copilot · 2026-03-18T08:59:05Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Initial plan

da5ea05

Copilot AI assigned Copilot and stweil Mar 16, 2026

Copilot started work on behalf of stweil March 16, 2026 17:15 View session

Copilot AI changed the title ~~[WIP] Add better documentation for --print-parameters~~ doc: Add comprehensive PARAMETERS section to the tesseract man page Mar 16, 2026

Copilot AI requested a review from stweil March 16, 2026 17:28

Copilot finished work on behalf of stweil March 16, 2026 17:28

stweil mentioned this pull request Mar 16, 2026

Better documentation for --print-parameters. #4525

Open

idrise reviewed Mar 18, 2026

View reviewed changes

Copilot started work on behalf of stweil March 18, 2026 08:48 View session

stweil requested changes Mar 18, 2026

View reviewed changes

Copilot AI requested a review from stweil March 18, 2026 08:56

Copilot finished work on behalf of stweil March 18, 2026 08:56

Copilot started work on behalf of stweil March 18, 2026 08:56 View session

Copilot finished work on behalf of stweil March 18, 2026 08:59

Conversation

Copilot AI commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments on the Issue (you are @copilot in this section)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stweil commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 16, 2026 •

edited

Loading