eMOP Releases its Full Set of Early Modern Typeface Training for Tesseract

Submitted by mchristy on Mon, 02/16/2015 - 12:23

In accordance with Andrew W. Mellon Foundation grant requirements and IDHMC guiding principles, the Early Modern OCR Project has released all of the Early Modern Typeface Training we created for use with the Tesseract OCR engine.

As described in an earlier post, eMOP created it's own early modern typeface training for use in Google's Tesseract OCR engine. Tesseract's native training method did not work for early modern typefaces available via high quality page images of period printed works. We developed a training system that utilizes PRImA Lab's groundtruth creation tool Aletheia, and a tool we created called Franken+.

We created Tesseract training for 47 different typefaces--Roman, Italic, and Blackletter--ranging from 1559 to 1769. The collection includes typefaces created by John Baskerville, William Caslon, Johannes Enschede, and more. Please check out our Github repo of Tesseract Training at https://github.com/Early-Modern-OCR/TesseractTraining to find early modern dictionaries, Tesseract training files (.traineddata, and .tif/.box pair files), sample typeface tif images, and more. Included are several combination training files including

all of our Blackletter typefaces in one training file,
all of our Roman & Italic typefaces in one training file, and
"Super Combo 7", which is the final, thoroughly tested combination of Roman, Italic, and Blackletter typefaces that we are using in our OCRing of the ECCO and EEBO collections

The Github README should explain what's there, but if there are questions please let us know.

The IDHMC and eMOP would like to thank eMOP Graduate Research Assistant Kathy Torabi for all her work in managing the the creation of Tesseract training. We'd also like to thank Texas A&M undergraduate students Laura Matas, Gabriella Pallares, Stephany Lara Guzman, Taylor Phillips, Alyssa Rivers, Shelly Hubertus, and Texas A&M graduate student Tess Habbestad for all their hard work on the project. Also, thank you to IDHMC & eMOP Graduate Research Assistant Bryan Tarpley who created Franken+, which was so instrumental in our training creation process.

eMOP Releases its Full Set of Early Modern Typeface Training for Tesseract

About eMOP

LOGIN | Create an Account

Search form

eMOP Releases its Full Set of Early Modern Typeface Training for Tesseract

About eMOP

LOGIN | Create an Account