Training with Tesseract
Training with Tesseract:
For the eMOP project we are attempting to train Tesseract to OCR early-modern (15-18th Century) documents. In order to do that, our aim is to train Tesseract to recognize specific fonts or font families that we will take directly from early-modern documents. First we acquire high quality images of documents with printed with representative fonts. Then we use Aletheia to identify the character glyphs of these documents. That output is then used with Franken+ to create the kinds of training files that Tesseract expects so that we can train Tesseract on these fonts. Those results are put into a "language" and used to OCR documents that we identify as using that specific font.
box/tiff File Pairs:
The typical Tesseract training procedure is to use Tesseract to create box files for each tiff page image you have. Tesseract attempts to identify each glyph on the page and its corresponding Unicode value. This information is used to create a “box file” identifying the coordinates of each box, the image of the glyph contained in the box, and the Unicode value of the corresponding character for each boxed glyph.
The coordinate system used by Tesseract has (0,0) at the bottom-left.
You could then optionally use something like the JTessBoxEditor tool to verify the accuracy of the box files Tesseract created and correct any errors.
However, the eMOP project is NOT using this method to create box files. We have pursued two separate methods, both of which can be used to produce box files, which Tesseract then can use in subsequent training steps.
We have done all of our font training using Prima Reasearch’s Aletheia tool. It does pretty much the same thing as Tesseract’s box file generator, however, we believe Aletheia is more accurate at identifying each glyph, and its output can be used to train multiple OCR engines. Furthermore, Aletheia allows us to identify and block off different regions on a page other than simply characters, including text block, column, paragraph, and line. We can also use Aletheia to identify regions on a page that should be excluded from OCRing, for example images.
The only issue is that Aletheia’s output uses a different coordinate system than Tesseract. (I think Aletheia’s (0,0) point is the upper-left.) So we have created an XSLT to convert Aletheia’s box file output to Tesseract’s box file. The file is called xml_to_box.xsl and is available at the eMOP github page.
We soon realized however, that using early-modern page images (even newly produced, high-quality photos) did not provide training data that met Tesseract’s requirements. What we needed was a way to produce training pages in the way that Tesseract required but with early-modern fonts (i.e. fonts that aren’t available as modern TTF sets).
In response to this need Bryan Tarpley created the tool that we are calling Franken+ (source code available via github soon). Starting with a font training set created using Aletheia, a user can select one or more representative, good quality character forms for each glyph identified. These character forms can then be applied to any text file to generate a new set of page images suitable for training Tesseract.
The output of Franken+ is as many box/tiff file pairs that you want for training Tesseract.
One issue that came up with this process involved ligatures. Both our training documents and the documents we want to OCR contain a variety of ligatures and the long-s character. However, the TCP transcriptions of EEBO and ECCO docs, which we were using as our text doc input into Franken+ contained no ligatures. This meant that the training we were creating for Tesseract lacked any ligatures, because Franken+ had no ligatures in the base text files with which to match corresponding ligature glyph images.
I created a shell script that turns certain character combinations in a text document into ligatures, and turns half of the document’s s glyphs into long-s’s.
Once the necessary files have been put into the appropriate attempt<#>/ folder, then we’re ready to train Tesseract. The end result of issuing all the training commands is a file called <lang>.traineddata. I have created another shell script that will issue all necessary commands to do this.
Is a bash shell script that will run all commands needed to create the .traineddata file.
The only parameter(s) used in this command is the name of one or more file prefixes corresponding to the box/tiff file pair being used to do the training.
For example: if you want to train Tesseract on a page represented in your attempt<#>/ folder by emop.mfle.exp17.box and emop.mfle.exp17.tif, then run the script with “emop.mfle.exp17” as the only parameter.
If you want to create Tesseract training using multiple box/tiff pairs then use something like “emop.mfle.exp17 emop.mfle.exp18 emop.mfle.exp19 emop.mfle.exp20” as the parameters.
In these examples emop is the name of our “language” and so a completed training process will produce a file called emop.traineddata.
From within the attempt<X>/ folder issue:
sh ./tess-script.sh <inputfile(s)>
box/tiff File Pairs:
These files must be named with the following convention as specified by the Tesseract Training document<link>:
This file must have a prefix that matches the <lang> value used in the box/tiff file pairs.
The contents of the .font_properties file is one line of the form:
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
for each <fontname> used to train Tesseract, i.e. for each box/tiff file pair. Where <italic>, <bold>, <fixed>, <serif> and <fraktur> are all simple 0 or 1 flags indicating whether the font has the named property.
The <fontname> value used in this file must match the <fontname> used in the specified naming convention for the box/tif file pairs shown above.
Be sure to give .font_properties the correct <lang> prefix of the language you’re creating before running the shell script. Just having “.font_properties” in your training folder won’t work.
The script only has lines for producing .word-dawg and .freq-dawg files. In addition, the word-list text files use to produce these DAWG files must be named:
- frequent-words-list.txt --> emop.freq-dawg
- words-list.txt --> emop.word-dawg
If you want to produce other DAWG files or you want to use word-list text files with a different naming convention, then you can easily edit the shell script to do this.
Once the shell script has completed, the end result is a file named <lang>.traineddata. (In our example it’s emop.traineddata.) This file then has to be moved into the tessdata/ folder that was created when Tesseract was installed. On my Mac, with a Homebrew installation, that’s
With the full SVN installation it's where ever the tessdata/ folder is that's identified by your $TESSDATA_PREFIX environment variable.
Actually, you can put your .traineddata file anywhere on your system with the following caveats:
- The .traineddata must reside in a folder named tessdata/.
- The $TESSDATA_PREFIX environment variable must be set to the parent directory of this tessdata/, NOTE: Don't point to the tessdata/ folder itself, point to it's parent.
If that’s not correct for your installation of Tesseract then simply change the last (cp) command of tess_script.sh to use the correct folder path.