doc: Add comprehensive PARAMETERS section to the tesseract man page#4526
doc: Add comprehensive PARAMETERS section to the tesseract man page#4526
Conversation
Document the most commonly used Tesseract parameters with: - Detailed descriptions explaining what each parameter does - Engine compatibility labels ([Both], [LSTM], [Legacy]) - Organization into functional sections: Output Format, Character Set, Image Processing, Dictionary, LSTM Engine, Legacy Engine, Debug All default values have been verified against the actual --print-parameters output. Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
| *tessedit_create_hocr* (bool, default: 0) [Both]:: | ||
| Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that | ||
| encodes the OCR results together with their bounding boxes and confidences. | ||
| Use the *hocr* config file to enable this format. |
There was a problem hiding this comment.
Having a link here to the specification may be useful
https://kba.github.io/hocr-spec/1.2/
doc/tesseract.1.asc
Outdated
| Use the *hocr* config file to enable this format. | ||
|
|
||
| *hocr_font_info* (bool, default: 0) [Both]:: | ||
| Include font information in hOCR output. |
There was a problem hiding this comment.
what information and how reliable is it? I'd love to know that before I enabled this.
doc/tesseract.1.asc
Outdated
| Include font information in hOCR output. | ||
|
|
||
| *hocr_char_boxes* (bool, default: 0) [Both]:: | ||
| Add per-character bounding-box coordinates to hOCR output. |
There was a problem hiding this comment.
My knowledge here might be out of date, but while using tesseract in the past I found that the char boxes only really worked on the legacy OCR.
So I created a quick python script to extract letter 'a' using LSTM and get this...
and
So I had 26 failures out of 131 or a 20% failure rate using the LSTM. Worth noting that this is still under development maybe?
doc/tesseract.1.asc
Outdated
|
|
||
| *tessedit_create_page_xml* (bool, default: 0) [Both]:: | ||
| Write PAGE XML output to a `.page.xml` file. PAGE is a standard XML format | ||
| for ground truth and OCR results used in document image analysis competitions. |
There was a problem hiding this comment.
PAGE XML, strange emphasis on competitions, it is a general purpose format used in
Digital humanities projects
Libraries and archives
Annotation tools (e.g. Transkribus, eScriptorium)
doc/tesseract.1.asc
Outdated
|
|
||
| *tessedit_create_boxfile* (bool, default: 0) [Both]:: | ||
| Write a Tesseract box file (`.box`) that lists each recognized character with | ||
| its bounding box. Used mainly for legacy engine training. |
There was a problem hiding this comment.
If it is used for legacy engine training should it be both? I'm not sure, but worth a thought.
| of additional valid words. For example, setting this to `user-words` causes | ||
| Tesseract to look for `eng.user-words` when using the English model. | ||
|
|
||
| *user_patterns_file* (string, default: "") [Both]:: |
There was a problem hiding this comment.
Should you wish to beef this up a tiny bit...
Defined in dict/trie.h, but in simplified terms:
A becomes uppercase letter
a becomes lowercase letter
0 becomes digit
Other symbols match themselves.
This is a structure template and not a regex.
doc/tesseract.1.asc
Outdated
|
|
||
| *lstm_use_matrix* (bool, default: 1) [LSTM]:: | ||
| Use the ratings matrix and beam search during LSTM decoding. Disabling this | ||
| reverts to a simpler greedy decoding strategy. |
There was a problem hiding this comment.
Might be worth mentioning it's a little faster turned off and required very clean text? Can't imagine a use case for me disabling this one.
| Enables alternative character hypotheses in hOCR output: | ||
| 0 = disabled (default); | ||
| 1 = include per-timestep alternative choices; | ||
| 2 = extract alternative choices from the CTC output mapped per character. |
There was a problem hiding this comment.
Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand.
doc/tesseract.1.asc
Outdated
|
|
||
| *tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]:: | ||
| Apply bigram-based correction to improve recognition of adjacent words that | ||
| form common pairs. |
There was a problem hiding this comment.
IE 'is a' or 'lived at' etc. (For those not familiar with bigram corrections.)
| [[PARAMETERS]] | ||
| PARAMETERS | ||
| ---------- | ||
|
|
There was a problem hiding this comment.
Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀
|
@copilot, please address the recent feedback. |
| Blacklisted characters will never appear in the output. The blacklist is | ||
| applied after the whitelist. | ||
|
|
||
| *tessedit_char_unblacklist* (string, default: "") [Both]:: |
There was a problem hiding this comment.
@copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.
- Add hOCR spec link to tessedit_create_hocr - Expand hocr_font_info to describe x_font/x_fsize metadata and LSTM caveat - Add LSTM accuracy note and cross-reference to hocr_char_boxes description - Fix tessedit_create_page_xml: broaden scope (digital humanities, libraries, annotation tools like Transkribus/eScriptorium) and link to PAGE-XML spec - Clarify tessedit_create_boxfile works with both engines - Clarify tessedit_char_unblacklist applies to both engines - Expand user_patterns_file with accurate backslash character-class syntax (\d=digit, \c=letter, \a=lower, \A=upper, \n=alphanum, \p=punc) and example - Note lstm_use_matrix disabling is faster but lower accuracy - Cross-reference hocr_char_boxes from lstm_choice_mode - Add word pair examples to tessedit_enable_bigram_correction description - Add textord_heavy_nr parameter for aggressive noise removal on dirty scans Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
Addressed all 11 review comments from @idrise in commit f270828:
|
Replace "blacklist"/"whitelist" in descriptive text with "exclusion list", "allowlist", and "Characters in this list" formulations as requested by @stweil. Parameter names (tessedit_char_whitelist, tessedit_char_blacklist, tessedit_char_unblacklist) are unchanged. Co-authored-by: stweil <6734573+stweil@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. |
The
--print-parametersoutput provides only terse one-liner descriptions with no indication of which parameters apply to the LSTM engine vs. the legacy engine.Adds a new
PARAMETERSsection todoc/tesseract.1.asccovering the most commonly used parameters, each annotated with type, default value, and engine compatibility ([Both],[LSTM], or[Legacy]), organized into subsections:tessedit_create_*,textonly_pdf,hocr_*,page_xml_*,preserve_interword_spacestessedit_char_whitelist/blacklist/unblacklistthresholding_method(Otsu / LeptonicaOtsu / Sauvola and their tuning knobs),invert_threshold,user_defined_dpiload_*_dawg,user_words/patterns_file/suffixlstm_use_matrix,lstm_choice_mode/iterations,lstm_rating_coefficientlanguage_model_penalty_*,tessedit_enable_*_correction,wordrec_enable_assocdebug_file,tessedit_write_params_to_fileAlso clarifies that
tessedit_do_invertis deprecated and thatinvert_threshold=0is the correct way to disable automatic image inversion, and that--dpiis equivalent to settinguser_defined_dpidirectly.Original prompt
📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.