Tesseract Training
TesseractTraining
Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)
Dictionaries:
- Raw: This folder contains raw, word list files gathered from a variety of sources
- addToDictionary.txt
- gazetteer.txt
- maindict.txt
- romannumerals.txt
- SyncopeRules.txt
- VariantSpellings.txt
-
The following files were collected from various sources by Ted Underwood and pass on to us. Most are not simple word lists and contain words counts, alternate spellings, etc.
-
The following files were generated by eMOP using an R program to count word frequecy from the Text Creation Partnership (TCP) corpus of hand-transcribed texts.
- ecco-TCP-word-freq.txt
- eebo-TCP-word-freq.txt
-
The following file was gathered from the VARD tool of early modern variant spelling.
- variants.txt
-
The following file was sent to us by Martin Muller who extracted this file of alternate spellings using MorphAdorner. It is broken up by time periods.
- emopspellings.txt
- FullDict-184k.txt: A combination of all cleaned files received from Ted Underwood (addToDictionary, gazetteer, maindict, romannumerals, SyncopeRules, and VariantSpellings).
- FullDict-wVariants-CSU: A combination of FullDict-184k.txt and the cleaned variants.txt culled from VARD (variants-CSU.txt). Cleaned, sorted, and unique.
- TCP-words-btwn100-499.txt, TCP-words-gt100.txt, TCP-words-gt500.txt: Are combinations of the word lists we compiled from the TCP for both EEBO and ECCO documents, and split into various word-frequency ranges.
- TCP-words-GT-100-wVariants-CSU.txt: A combination of EEBO and ECCO TCP words, with variants-CSU.txt. Cleaned, sorted and unique.
After much testing with Tesseract of all these various files, combination files, and other combinations, we finally decided on using:
- FullDict-wVariants-CSU.txt as our main word list (word-dawg), and
- TCP-words-gt500.txt as our frequently used word list (freq-dawg.
FontTraining:
- forEMOP: This folder contains .traineddata, .box, and .tif files used to create the font training for Tesseract on the eMOP project. There are also sample images of each typeface used to create the training as binarized .tif images.
- notUsed: This folder contains .traineddata, .box, and .tif files used to create the font training for Tesseract, which, after extensive testing, we did not end up using on the eMOP project for various reasons. There are also sample images of each typeface used to create the training as binarized .tif images.
- combos: This folder contains .traineddata files of every Blackletter typeface we created training for (BL5-All-D2.traineddata) and every Roman & Italic typeface we created training for (RI5-All-D2.traineddata). Both files were created using training from the forEMOP and notUsed folders.
- <font type> can be any combination of
- R: Roman
- I: Italic
- B: Blackletter
- <year(s) of creation>is one or more years, separated by a '-'.
- A '.traineddata' file, which can be placed in any local tessdata file and used to OCR page images using that typeface or a similar one.
- A '.tar.gz' file, which contains the box/tif file pairs we created with Franken+ (https://github.com/Early-Modern-OCR/FrankenPlus) for each of these typefaces. For typefaces with multiple styles (roman, italic, blackletter, alternate roman, etc.), and/or multitple styles the differences can be seen in the names of the box/tif file pairs.
- A 'sample/' folder of binarized tifs of some of the page images used to create the training for that typeface.
- Typeface training for all of the typefaces listed in the 'forEMOP' folder.
- Dictionary files created using https://github.com/Early-Modern-OCR/TesseractTraining/blob/master/Dictio... to generate the word-dawg file and https://github.com/Early-Modern-OCR/TesseractTraining/blob/master/Dictio... to generate freq-dawg.
The forEMOP and notUsed folders both contain multiple folders cover typefaces created by/for specific individuals. The name convention for the folders is
<name>-<font type>-<year(s) of creation>
where
Each of these folders contains
Inside the forEMOP folder is a 'combos/' folder containing SC8b-R7-D2b.traineddata, which is the training file we used to OCR the eMOP corpus. It contains:
MiscFiles:
This folder contains two other files that we used to create the Tesseract typeface training in this repo:
- F+TrainingText-4.txt: This is a text file used with Franken+ to create tif/box pair files for a typeface. Once Franken+ has been used to identify the glyphs to be used to create training for Tesseract, this text file is used to create a set of "Franken-tifs" that are this text "printed" with the glyphs identified in Franken+. The corresponding box files are then created from those tifs and are used to create the training (.tr files) for Tesseract. This text file contains mulitple uses of ever character glyph that we identified in our work including special characters and ligatures. Most of these are identified and given Unicode values in the MUFI set.
- emop.unicharambigs: This is a file used when creating the final .traineddata file of Tesseract training for a typeface or set of typefaces. It is used by eMOPM to convert all special characters and ligatures into standard, modern equivalent characters (i.e. those found on a keyboard).