eMOP Directory/File Structure and Naming Conventions

eMOP Directory/File Structure and Naming Conventions:

This post covers the outlay of the eMOP testing structure with Tesseract, including where to put the output files, what the naming convention is, and how to document each attempt folder.


The data for each font trained in Aletheia resides on dh-portal in /data/shared/TesseractTrainingData. The training attempts for each font reside within the specific font folder in this directory.

The directory structure is:

/data/shared/TesseractTrainingData/<font folder>/Training/attempt<#>
*where <#> is an integer.

In these folders are all training data generated for and by Tesseract for a specific font and each attempt to train Tesseract on that font:

  • .box files
  • .tif files
  • .font_properties<link> file
  • .unicharambigs<link> file
  • word lists<link>
  • tess_script.sh<link>


Readme File:

In each Training/ folder create an attempts_README.txt file to keep track of the work done in each attempt<#>/ folder.

For each attempt<#>/ be sure to record:

  • What are you testing with this attempt? How is it different from other attempts?
  • For files that exist in multiple attempt<#>/ folders, note any changes to files that you make in this attempt.
  • What output files are produced? How are the results?
  • What conclusions, if any, do you come to looking at the results?


Output Files:

When running the Tesseract command to generate text output, use the following naming convention:

out.a<X>.<img file>.<traing file(s)>.txt

where:

  •  <#> matches attempt<#>/
  • <imge file> is an indicator of the name of the image file being OCRed
  • <training file(s)> is an indicator of the training file or files used in this test

examples:

  • out.a2.m94-17.mfle17.txt
  • out.a6.m94-17.mfle17-20.txt
    (The output of this Tesseract run is the 6th attempt using this font training, which was trained using MFLE94 pages 17-20, and then used to OCR MFLE94 page 17)
  • out.a4.f83_06.guyot_Fplus1-37.txt