The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University received a $734,000 grant from the Andrew W. Mellon Foundation in 2012 to make machine readable 45 million pages of data. By partnering with Gale and Proquest, eMOP combines open source OCR (Optical Character Recognition) software and book history in order to improve the accuracy of OCR for early modern (1473-1800) texts. The Early Modern OCR Project (eMOP) aims to publish an open source OCR workflow, improve the visibility of early modern texts by making them fully searchable, and form a community of scholars and institutions interested in the digital preservation of these texts. Our goal is to foster collaboration among various disciplines, and, in doing so, cultivate inter-institutional and international relationships that make possible new kinds of humanities research.
Our workflow (see image below) blends the disciplines of book history, digital humanities, textual analysis, and machine learning in order to create a corpus of keyed texts that are far more correct than is now possible with the current set of tools. These keyed texts will improve access to early modern texts that are currently only searchable through “dirty” OCR or metadata alone. The open source OCR workflow will contain, among other things, access to an early modern font database, customization guidelines for the Tesseract OCR engine, post-processing and diagnostic algorithms, and crowdsourcing correction tools.
In addition to the published workflow, all tools produced by the eMOP project will be published open source on our eMOP Github page. Many are now available for use in alpha or beta form.
- In conjunction with members of the KB National Library of the Netherlands, the published workflow will made available in an open source, electronic, flexible format via the Taverna Workflow Management System.
- The Franken+ tool, developed by Bryan Tarpley at Texas A&M, enables the creation of an “ideal” typeface using glyphs identified in scanned images of documents from the early modern period. Franken+ also exports these typefaces to a training library for the open-source OCR engine Tesseract.
- Aletheia Web Layout (AWL) Editor, developed by PRImA at the University of Salford, is a crowd-sourced correction tool for re-drawing regions on problematic OCR’d pages, such as Title pages, multi-columned texts, image-heavy documents, and more.
- The TypeWright software, developed by Performant Software Solutions and 18thConnect, enables users to correct the “dirty” OCR of an entire early modern document, and our partnership with ECCO allows 18thConnect to release fully corrected documents to their scholar-editors in plain text and TEI-A formats.
- The Cobre tool, developed by Dr. Anton DuPlessis and Cushing Memorial Library and Archives at Texas A&M, enables scholar-experts to compare, re-order pages, and annotate the metadata for multiple printings of documents in the eMOP dataset.
- Working with members of SEASR at the University of Illinois and the Perception, Sensing, and Instrumentation (PSI) Lab at the Texas A&M University, eMOP is developing a sophisticated post-processing diagnosis and treatment triage system for our OCR output. Utilizing machine learning techniques and the latest text-correction algorithms, this triage system will allow eMOP to examine the output of its OCR process, correct errors made by the OCR engine, and, in the case of page images that are too degraded to OCR well, determine what is wrong with the image for further pre-processing and re-OCR'ing later. The code for this triage system will be available open source, on the eMOP github page by the end of the grant period.
- The Anachronaut tool (prototype - please use Chrome to view demo), developed by a team of undergraduates and Dr. Ricardo Gutierrez-Osuna at Texas A&M, is a Facebook game that uses the power of Facebook (and many layers of user confidence testing) to correct single words and phrases.
- Why eMOP Matters
- Press Release
- Cyber Infrastructure
- Participating Institutions
- Team Members
- Mellon Grant Info Repository
- eMOP Conversations
Why eMOP Matters
The Early Modern OCR Project is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archive-able, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.
Linked here is a presentation given by Dr. Mandell that explains how eMOP and ARC further the move toward a sustainable cyberinfrastructure.
In high-bandwidth environments, you can find the whole video here.
Mellon Grant Info Respository
The Mellon Foundation has given us permission to publish the grant narrative and appendix, with financial details removed from it. You are welcome to peruse these documents, and we would appreciate any comments. Preliminary documents are also available on the IDHMC's commentpress. Send thoughts to mandell – at – tamu – dot– edu.