Skip to content

doc: Add comprehensive PARAMETERS section to the tesseract man page#4526

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/better-documentation-print-parameters
Draft

doc: Add comprehensive PARAMETERS section to the tesseract man page#4526
Copilot wants to merge 4 commits intomainfrom
copilot/better-documentation-print-parameters

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 16, 2026

The --print-parameters output provides only terse one-liner descriptions with no indication of which parameters apply to the LSTM engine vs. the legacy engine.

Adds a new PARAMETERS section to doc/tesseract.1.asc covering the most commonly used parameters, each annotated with type, default value, and engine compatibility ([Both], [LSTM], or [Legacy]), organized into subsections:

  • Output Formattessedit_create_*, textonly_pdf, hocr_*, page_xml_*, preserve_interword_spaces
  • Character Settessedit_char_whitelist/blacklist/unblacklist
  • Image Processingthresholding_method (Otsu / LeptonicaOtsu / Sauvola and their tuning knobs), invert_threshold, user_defined_dpi
  • Dictionaryload_*_dawg, user_words/patterns_file/suffix
  • LSTM Enginelstm_use_matrix, lstm_choice_mode/iterations, lstm_rating_coefficient
  • Legacy Enginelanguage_model_penalty_*, tessedit_enable_*_correction, wordrec_enable_assoc
  • Debugdebug_file, tessedit_write_params_to_file

Also clarifies that tessedit_do_invert is deprecated and that invert_threshold=0 is the correct way to disable automatic image inversion, and that --dpi is equivalent to setting user_defined_dpi directly.

Original prompt

This section details on the original issue you should resolve

<issue_title>Better documentation for --print-parameters.</issue_title>
<issue_description>### Your Feature Request

The documentation for each parameter is super short and hard for an end user to understand. It would be good to understand which parameters apply to the new LSTM and which are the legacy OCR.

Changing --print-parameters might be a breaking change but adding a manual file would be harmless and useful.</issue_description>

Comments on the Issue (you are @copilot in this section)

@stweil Do you want to use a chatbot to do this task? It might be a real challenge to create correct documentation for each parameter.

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

Document the most commonly used Tesseract parameters with:
- Detailed descriptions explaining what each parameter does
- Engine compatibility labels ([Both], [LSTM], [Legacy])
- Organization into functional sections: Output Format, Character
  Set, Image Processing, Dictionary, LSTM Engine, Legacy Engine, Debug

All default values have been verified against the actual
--print-parameters output.

Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
Copilot AI changed the title [WIP] Add better documentation for --print-parameters doc: Add comprehensive PARAMETERS section to the tesseract man page Mar 16, 2026
Copilot AI requested a review from stweil March 16, 2026 17:28
*tessedit_create_hocr* (bool, default: 0) [Both]::
Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that
encodes the OCR results together with their bounding boxes and confidences.
Use the *hocr* config file to enable this format.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a link here to the specification may be useful
https://kba.github.io/hocr-spec/1.2/

Use the *hocr* config file to enable this format.

*hocr_font_info* (bool, default: 0) [Both]::
Include font information in hOCR output.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what information and how reliable is it? I'd love to know that before I enabled this.

Include font information in hOCR output.

*hocr_char_boxes* (bool, default: 0) [Both]::
Add per-character bounding-box coordinates to hOCR output.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge here might be out of date, but while using tesseract in the past I found that the char boxes only really worked on the legacy OCR.
So I created a quick python script to extract letter 'a' using LSTM and get this...

Image

and

Image

So I had 26 failures out of 131 or a 20% failure rate using the LSTM. Worth noting that this is still under development maybe?


*tessedit_create_page_xml* (bool, default: 0) [Both]::
Write PAGE XML output to a `.page.xml` file. PAGE is a standard XML format
for ground truth and OCR results used in document image analysis competitions.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAGE XML, strange emphasis on competitions, it is a general purpose format used in

Digital humanities projects
Libraries and archives
Annotation tools (e.g. Transkribus, eScriptorium)

https://github.com/PRImA-Research-Lab/PAGE-XML


*tessedit_create_boxfile* (bool, default: 0) [Both]::
Write a Tesseract box file (`.box`) that lists each recognized character with
its bounding box. Used mainly for legacy engine training.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is used for legacy engine training should it be both? I'm not sure, but worth a thought.

of additional valid words. For example, setting this to `user-words` causes
Tesseract to look for `eng.user-words` when using the English model.

*user_patterns_file* (string, default: "") [Both]::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you wish to beef this up a tiny bit...

Defined in dict/trie.h, but in simplified terms:

A becomes uppercase letter
a becomes lowercase letter
0 becomes digit

Other symbols match themselves.
This is a structure template and not a regex.


*lstm_use_matrix* (bool, default: 1) [LSTM]::
Use the ratings matrix and beam search during LSTM decoding. Disabling this
reverts to a simpler greedy decoding strategy.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning it's a little faster turned off and required very clean text? Can't imagine a use case for me disabling this one.

Enables alternative character hypotheses in hOCR output:
0 = disabled (default);
1 = include per-timestep alternative choices;
2 = extract alternative choices from the CTC output mapped per character.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand.


*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]::
Apply bigram-based correction to improve recognition of adjacent words that
form common pairs.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IE 'is a' or 'lived at' etc. (For those not familiar with bigram corrections.)

[[PARAMETERS]]
PARAMETERS
----------

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀

@stweil
Copy link
Copy Markdown
Member

stweil commented Mar 18, 2026

@copilot, please address the recent feedback.

Blacklisted characters will never appear in the output. The blacklist is
applied after the whitelist.

*tessedit_char_unblacklist* (string, default: "") [Both]::
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.

- Add hOCR spec link to tessedit_create_hocr
- Expand hocr_font_info to describe x_font/x_fsize metadata and LSTM caveat
- Add LSTM accuracy note and cross-reference to hocr_char_boxes description
- Fix tessedit_create_page_xml: broaden scope (digital humanities, libraries,
  annotation tools like Transkribus/eScriptorium) and link to PAGE-XML spec
- Clarify tessedit_create_boxfile works with both engines
- Clarify tessedit_char_unblacklist applies to both engines
- Expand user_patterns_file with accurate backslash character-class syntax
  (\d=digit, \c=letter, \a=lower, \A=upper, \n=alphanum, \p=punc) and example
- Note lstm_use_matrix disabling is faster but lower accuracy
- Cross-reference hocr_char_boxes from lstm_choice_mode
- Add word pair examples to tessedit_enable_bigram_correction description
- Add textord_heavy_nr parameter for aggressive noise removal on dirty scans

Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot, please address the recent feedback.

Addressed all 11 review comments from @idrise in commit f270828:

  • Added hOCR spec link (https://kba.github.io/hocr-spec/1.2/) to tessedit_create_hocr
  • Expanded hocr_font_info to detail x_font/x_fsize metadata and note LSTM reliability caveat
  • Added note that hocr_char_boxes character positions are approximated under LSTM; cross-referenced lstm_choice_mode
  • Fixed tessedit_create_page_xml to reflect broad use in digital humanities, libraries, and annotation tools (Transkribus, eScriptorium), with link to PAGE-XML spec
  • Clarified tessedit_create_boxfile can be produced by either engine but is primarily for legacy training ground truth
  • Clarified tessedit_char_unblacklist applies to both engines
  • Corrected user_patterns_file to document the actual backslash character-class syntax (\d, \c, \a, \A, \n, \p) with an example from the existing man-page config-file section
  • Noted lstm_use_matrix is faster when disabled but lower accuracy; works best on very clean images
  • Cross-referenced hocr_char_boxes from lstm_choice_mode
  • Added word-pair examples ("is a", "in the", "New York") to tessedit_enable_bigram_correction
  • Added textord_heavy_nr for aggressive noise removal on heavily degraded scans

Replace "blacklist"/"whitelist" in descriptive text with "exclusion list",
"allowlist", and "Characters in this list" formulations as requested by
@stweil. Parameter names (tessedit_char_whitelist, tessedit_char_blacklist,
tessedit_char_unblacklist) are unchanged.

Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 18, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better documentation for --print-parameters.

3 participants