Tesseract Training

TesseractTraining

Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)

Dictionaries:

  • Raw: This folder contains raw, word list files gathered from a variety of sources
    • The following files were collected from various sources by Ted Underwood and pass on to us. Most are not simple word lists and contain words counts, alternate spellings, etc.
    • addToDictionary.txt
    • gazetteer.txt
    • maindict.txt
    • romannumerals.txt
    • SyncopeRules.txt
    • VariantSpellings.txt

    The following files were generated by eMOP using an R program to count word frequecy from the Text Creation Partnership (TCP) corpus of hand-transcribed texts.
  • ecco-TCP-word-freq.txt
  • eebo-TCP-word-freq.txt

    The following file was gathered from the VARD tool of early modern variant spelling.
  • variants.txt
    The following file was sent to us by Martin Muller who extracted this file of alternate spellings using MorphAdorner. It is broken up by time periods.
  • emopspellings.txt
  • Cleaned: This folder contains cleaned up versions of the Raw files, with just the words left.
  • Combos: This folder contains alphabetically sorted word lists that are combinations of the cleaned lists with all duplicates removed. Files with a suffix of "-CSU" have been cleaned, sorted, and all entries are unique (no duplicates). Files that are underlined and bold, are what we have used in eMOP as the basis for our freq-dawg and word-dawg word lists when training Tesseract to run on the eMOP data set of 45 million page images.
    • FullDict-184k.txt: A combination of all cleaned files received from Ted Underwood (addToDictionary, gazetteer, maindict, romannumerals, SyncopeRules, and VariantSpellings).
    • FullDict-wVariants-CSU: A combination of FullDict-184k.txt and the cleaned variants.txt culled from VARD (variants-CSU.txt). Cleaned, sorted, and unique.
    • TCP-words-btwn100-499.txt, TCP-words-gt100.txt, TCP-words-gt500.txt: Are combinations of the word lists we compiled from the TCP for both EEBO and ECCO documents, and split into various word-frequency ranges.
    • TCP-words-GT-100-wVariants-CSU.txt: A combination of EEBO and ECCO TCP words, with variants-CSU.txt. Cleaned, sorted and unique.

    After much testing with Tesseract of all these various files, combination files, and other combinations, we finally decided on using:

    • FullDict-wVariants-CSU.txt as our main word list (word-dawg), and
    • TCP-words-gt500.txt as our frequently used word list (freq-dawg.



    FontTraining:

    • forEMOP: This folder contains .traineddata, .box, and .tif files used to create the font training for Tesseract on the eMOP project. There are also sample images of each typeface used to create the training as binarized .tif images.
    • notUsed: This folder contains .traineddata, .box, and .tif files used to create the font training for Tesseract, which, after extensive testing, we did not end up using on the eMOP project for various reasons. There are also sample images of each typeface used to create the training as binarized .tif images.
    • combos: This folder contains .traineddata files of every Blackletter typeface we created training for (BL5-All-D2.traineddata) and every Roman & Italic typeface we created training for (RI5-All-D2.traineddata). Both files were created using training from the forEMOP and notUsed folders.
    • The forEMOP and notUsed folders both contain multiple folders cover typefaces created by/for specific individuals. The name convention for the folders is

      <name>-<font type>-<year(s) of creation>

      where

      • <font type> can be any combination of
        • R: Roman
        • I: Italic
        • B: Blackletter
      • <year(s) of creation>is one or more years, separated by a '-'.

      Each of these folders contains

      • A '.traineddata' file, which can be placed in any local tessdata file and used to OCR page images using that typeface or a similar one.
      • A '.tar.gz' file, which contains the box/tif file pairs we created with Franken+ (https://github.com/Early-Modern-OCR/FrankenPlus) for each of these typefaces. For typefaces with multiple styles (roman, italic, blackletter, alternate roman, etc.), and/or multitple styles the differences can be seen in the names of the box/tif file pairs.
      • A 'sample/' folder of binarized tifs of some of the page images used to create the training for that typeface.

      Inside the forEMOP folder is a 'combos/' folder containing SC8b-R7-D2b.traineddata, which is the training file we used to OCR the eMOP corpus. It contains:



    MiscFiles:

    This folder contains two other files that we used to create the Tesseract typeface training in this repo:

    • F+TrainingText-4.txt: This is a text file used with Franken+ to create tif/box pair files for a typeface. Once Franken+ has been used to identify the glyphs to be used to create training for Tesseract, this text file is used to create a set of "Franken-tifs" that are this text "printed" with the glyphs identified in Franken+. The corresponding box files are then created from those tifs and are used to create the training (.tr files) for Tesseract. This text file contains mulitple uses of ever character glyph that we identified in our work including special characters and ligatures. Most of these are identified and given Unicode values in the MUFI set.
    • emop.unicharambigs: This is a file used when creating the final .traineddata file of Tesseract training for a typeface or set of typefaces. It is used by eMOPM to convert all special characters and ligatures into standard, modern equivalent characters (i.e. those found on a keyboard).