A synthetic Linegenerator for OCR applications. Based on Belval's TextRecognitionDataGenerator (https://github.com/Belval/TextRecognitionDataGenerator) and NVlabs' ocrodeg (https://github.com/NVlabs/ocrodeg).
Clone this repository and use pip install -r requirements.txt
Tested on Windows 10 and Ubuntu 16.04, requires Python 3.x.
In TextRecognitionDataGenerator, run python run.py -c 1000 -i <path>/text/all_strings_and_web.txt to get 1000 randomly generated images with a font of your choice. You can set the font in TextRecognitionDataGenerator/fonts, where historic fonts should be placed in the historic-folder. If more than one font resides in the historic-folder, the generator will switch between the given fonts randomly. Currently, the generator only uses characters and ligatures that were present in the original 1557-dataset (1557-true_character_occurence.ttf)
In FontForge, you will find the .sfd (FontForge projectfile) and .ttf-files for a historic font generated from the 1557-Methodus-Clenardus dataset. Note that 1557-artifically_enhanced_all_chars contains characters that were not present in the original dataset but have been "composed" of others, e.g. W is composed from 2x V. Note that as of now, only TrueType-Fonts are supported.
In text, you will find .txt-files with all words from the dataset. All words from the 1557-dataset can be found in TextRecognitionGenerator/dicts/hist.txt. For best performance, all_strings_and_web.txt should be used as input, as it contains randomly shuffled lines of length 5 that consist of words from the 1557 dataset and are enriched with latin text from the internet.
The default parameters will make the LineGenerator output all the generated lines in a format with height and margin that fit Calamari (https://github.com/Calamari-OCR/calamari) and 5 words per line. The files will be written to /out.
For OCR applications, the most important parameters will be listed below. However, there are many more, as you will see when running python run.py -h.
-bspecify the background. Defaults to white. Might be a feature for future extension (e.g. with old vocal pages).-cspecify the amount of images that are to be generated. Defaults to1000.-fspecify the format (==height, if text is horizontal) of the generated lines. Defaults to65 px.-especify the extension for the produced images. Defaults to.png.-ispecify the inputfile. If none is used, words from the hist-dict will be used.-mspecify the margins for the text with respect to the border. The format is (upper, left, lower, right). Defaults to a format that is well suited for the 1557-dataset.-wspecify the word-count of the generated lines. Defaults to 5 words per line.-ztoggle for the creation of a zip-file at the end, for easier handling and upload of the generated lines.-tcspecify the textcolor. Defaults to#000000black.-swspecify the spacing between words. Defaults to 0.5.-sftoggle for the show-font prompt to see the current font in matplotlib. Only supported for historic fonts.-rotoggle for rename-output: When set, the output-files will be given unique hex-filenames instead of incremental filenames. Useful when data from several runs will be merged later.-rmtoggle for deleting old files in the/out-folder before generating new ones. Use with care.
Some minor augmentation features:
-rktoggle for random skewing of the images, using an angle in the interval[-x,+x], where x is specified with-k-rbltoggle for a random blurring with intensity in the interval[-x,+x], where x is specified with-bl
The script augment_images.py will apply image augmentation to the given input. The following augmentations will be used:
Use -f to control the intensity of the augmentation in a range from ]0,10.0]. Note that higher factors make the rescaling-process slow and somewhat useless, as the image will be cropped at the border. This can be resolved by using the -r toggle, that allows to exclude the rotation from the augmentation.
All augmented images will be written to a folder specified by -o (if none is given, /augmentations will be used) and can be zipped by using -z. Use the -s toggle if you want the augmented images to be written to their respective separate folders. To use more than one augmentation run per file, simply run augment_images.py again with the former output as input.
- make character spacing variable (not trivial)
- use background and textcolor from real data instead of plain white and black
- use handwritten model for more variance in generated data



