TxDHC Presentation of eMOP Workflows (April 11, 2014)

A text outline of a presentation given at the Texas Digital Humanities Consortium's (TxDHC) Conference at the University of Houston, April 10-12, 2014. This presentation provides an overview of the OCR training, OCRing, and post-processing analysis and correction processes being done by eMOP through a series of workflow diagrams created over the life of the project.


Follow eMOP at the 1st annual Texas Digital Humanities Consortium Conference via Twitter @ Storify:

Historical Typemaking and its Artifacts

In late 2013, Todd Samuelson traveled to Europe in search of typographical specimens for the eMOP initiative. In a series of dispatches, he will highlight his findings and discuss the significance of historical research in the development of the project.

eMOP Mellon Interim Report

Prepared by PI and IDHMC Director, Dr. Laura Mandell, and eMOP Co-Project Managers for year two, Matthew Christy and Elizabeth Grumbach, the following post contains the Mellon Interim Report for the Early Modern OCR Project.

Special Characters, Unicode, and Early Modern English

With a dataset of 45 million page images, the eMOP team is dealing with a lot of text output, and that means dealing with Unicode. As an early modern English project, we're also working with ligatures and other special characters specific to the period, and that means considering the MUFI (the Medieval Unicode Font Inititiave).

October OCR Testing & Training

eMOP progress continues as our team experiments to find the best method for training Tesseract to recognize various early modern fonts. The new Franken+ tool, developed by eMOP graduate student Bryan Tarpley, has passed through the alpha testing phase and dramatically improves our ability to create a variety of training sets for Tesseract. Now we're hard at work investigating various methods for creating “training sets,” for Tesseract to see what will give us the best OCR results.

eMOP's Zotero Page of OCR Readings

A eMOP library exists under the IDHMC Group in Zotero. It contains a variety of readings related to OCR in general and Tesseract in particular. Come check it out (at eMOP Zotero Library) and peruse our collection of OCR-related readings. You'll never want to know more than this about OCR.

This Fall on eMOP: Post Processing


In the near future, we intend to write up a post detailing our successes and goals for this fall, but we'd like to immediately share an interesting development at the beginning of Year Two. As our team and collaborators begin thinking towards the post-processing and triage stage of this project, we've been having a series of meetings here to rethink the granularity of our diagnostics and triage approach.

KB National Library of the Netherlands posts on eMOP

KB National Library of the Netherlands has recently given the Early Modern OCR Project some publicity on the other side of the Atlantic. Koninklijke Bibliotheek (KB) coordinates one of our international partner projects, IMPACT: Improving Access to Text.

eMOP Featured in Library Journal

Matt Enis, Associate Editor of Technology for the Library Journal, asks "OCR [optical character recognition] works great for paperbacks—but what about 15th Century texts set by hand?"

ProQuest Joins Forces with TAMU Scholars to Make 15th Century Books Behave Like Born-Digital Text

ANN ARBOR, Mich., November 6, 2012 - Information powerhouse ProQuest is participating in a project that will vastly accelerate research of 15th through 17th Century cultural history. The company will provide access to page images from the veritable Early English Books Online and newcomer Early European Books to the Early Modern OCR Project (eMOP) at Texas A&M. EMOP will use the content to create a database of typefaces used in the early modern era, train OCR software to read them and then apply crowd-sourcing for editing. The project will turn the rich corpus of works from this pivotal historical period into fully searchable digital documents.