TxDHC Presentation of eMOP Workflows (April 11, 2014)
A text outline of a presentation given at the Texas Digital Humanities Consortium's (TxDHC) Conference at the University of Houston, April 10-12, 2014. This presentation provides an overview of the OCR training, OCRing, and post-processing analysis and correction processes being done by eMOP through a series of workflow diagrams created over the life of the project.
The Early Modern OCR Project (eMOP)
The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
In addition, all tools used and produced by eMOP must be open source.
There are a number of problems inherent with this project that we've had to deal with.
Early Modern Printing
Documents printed in this era present a number of challenges in and of themselves:
- Individual, hand-made typefaces can be unique and distinctive.
- Worn and broken type: expense means type is used even as age and damage changes its appearance.
- Inconsistent inking: too much and too little applied ink can change glyph shapes.
- Bleed-through from too much ink or thin paper creates noise on page reverses.
- Inconsistent line bases is difficult for OCR engines to handle.
- A mix of typefaces requires very large, diverse training sets.
- Decorative drop caps are not recognized by OCR engines and can create confusion for OCR engines.
- Unusual page layouts, music, astronomical figures, decorative page elements, all can create confusion for OCR engines.
- Special characters & ligatures have to be in the training set for the OCR to recognize them properly.
- Spelling variations require various period word lists with spelling variations in order to post-correct text.
- Non-English & mixed language documents can't be corrected with above word lists in English.
These documents were, for the most part photographed, then turned into microfilm, then digitized, all over the span of 30-40 years. These 3rd generation copies do not meet the ideal standards set for OCR engines and, in addition, contain problems related both to their original forms and their current digitizations.
- Torn and damaged pages
- Noise introduced to images of pages
- Skewed pages
- Warped pages
- Missing pages
- Inverted pages
- Incorrect metadata
- Extremely low quality TIFFs (~50K)
Gathering & Creating Data
The eMOP database is the heart of the Early Modern OCR Project. It is made up of metadata describing millions of pages of images, OCR results, and groundtruth compiled from 3 different sources. It required months of normalization and ingestion to get setup and initialized.
- EEBO: ~125,000 documents, ~13 million pages images, circa. 1475-1700
- ECCO: ~182,000 documents, ~32 million page images, circa. 1700-1800
- TCP: ~46,000 double-keyed hand transcriptions (44,000 EEBO, 2,200 ECCO)
- Total: >300,000 documents & 45 million page images.
- ECCO page images. Typically 1 document page per image.
- ECCO original OCR results in document-level XML files. These required conversion to page-level text files.
- ECCO TCP transcriptions in document-level XML and text files. These required conversion to page-level text files.
- EEBO page images. Typically 2 document pages per images. This required an updated page numbering system to allow us to keep document pages, page images, and groundtruth pages in synch.
- EEBO TCP transcriptions in document-level XML and text files. These required conversion to page-level text files.
ECCO, EEBO and the TCP all used several, different numbering schemes for these documents which required additional metadata documents to tie page images to their corresponding OCR result and transcription files.
Typeface Tools & Training
Tesseract must first be trained to recognize the typeface used in the document(s) it is OCRing. Tesseract has a native mechanism to create that training, but it relies on having training pages with characteristics that can typically only be created with modern word processors, which don't have early modern fonts. To create early modern font training for Tesseract we had to adapt and create our own tools.
- We start with high-quality images of texts printed in a desired typeface, created by eMOP collaborators at The Cushing Library and Archives.
- Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
- Franken+: Created by eMOP graduate student Bryan Tarpley. Used by the same team of undergraduates led by eMOP graduate student Kathy Torabi.
- Takes Aletheia's output files as input.
- Groups all glyphs with the same Unicode values into one window for comparison.
- Mistakenly coded glyphs are easily identified and re-coded.
- A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired.
- Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base.
- Outputs the same box files and TIFF images that Tesseract's first stage of native training.
- Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files.
- Outputs a .traineddata file used by Tesseract when OCRing page images.
eMOP Control Tools
- Query Builder: Users can build queries to find specific documents, or sets of documents, in the eMOP DB. Document sets can be labeled or grouped with an identifier.
- Data Downloader: Available via the QB. Allows identified items to be downloaded: page images, OCR results, associated groundtruth.
- eMOP Dashboard: Users select documents from the database, individually or with the identifier created in the QB, for OCRing with specific typeface training. The Dashboard displays results with Juxta/RETAS scores when groundtruth is available.
Due to the proprietary nature of the data contained in the eMOP DB, these tools all require authentication to access. However, the Dashboard code will be released by the end of the grant via the IDHMC Github page.
The Dashboard utilizes the emop_controller to schedule OCRing, remove noise from the results, score result accuracy for documents with groundtruth, and begin the post-processing triage workflow.
The emop_controller is a java program that runs on the Brazos High Performance Computing Cluster and ensures maximum utilization of the 128 processors available for our use by scheduling various functions and processes of the controller separately.
- The selected documents are marked in DB as scheduled and the documents are queued for processing.
- A cron job continuously checks the scheduled queue and OCRs any unscheduled pages.
- Each page is OCRd with Tesseract.
- After OCRing each page's hOCR output file is de-noised.
- hOCR pages that have matching Groundtruth are scored using Juxta and RETAS algorithms.
- File paths and scores are written to the eMOP DB.
- hOCR documents are examined again and given an estimated correctability (ECORR) score.
- hOCR page results are sent to post-processing triage.
Our triage system examines hOCR page results and sends pages to be either further corrected and/or to be diagnosed. An estimated correctability (ECORR) score determines which path(s) the page follows, and is based on looking at token lengths and composition. Correction of "good" OCR-created text is done with code created by SEASR, and utilizes period specific word lists with alternate spellings, 2 & 3-gram data from Google, and a series of algorithms. Diagnosis of "bad" OCR results uses machine-learning algorithms to identify characteristics of the hOCR output to determine whether the page is skewed, warped, noisy, etc.
- De-noise hOCR results
- Juxta/RETAS scores for docs w/Groundtruth
- Calculate ECORR
ECORR >25%: Attempt to further correct text (SEASR)
- Check bbox coords of consecutive lines/words to identify if any are out of order.
- If so, check to see if the page is too badly skewed.
- If so, tag page as skewed in the eMOP DB
- Otherwise, use bbox coords to make consecutive boxes adjacent.
- Otherwise, send page for correction.
- Correct page's text as much as possible.
- Count corrected words.
- Compute ratio of corrected words to total words on page.
- Send all processed pages to TypeWright for crowd-sourced hand correction.
- If ratio is less than 40%, send to PSI queue for analysis.
ECORR <75%: Analyze hOCR elements to diagnose page image problems (PSI)
- Machine-learning algorithms examine OCR results to determine what is wrong with the page image: skewed, warped, noisy, wrong font, etc.
- Create a character frequency distribution for pages diagnosed with wrong font.
- Does the character frequency distribution indicate the wrong typeface family was used: i.e. blackletter/roman/italic?
- If so, send back for re-OCRing with different font family.
- Otherwise, or if this page has already been OCRd with multiple typeface families, send to Cobre for typeface identification by an expert.
eMOP is a large project with several other efforts and goals not show here. Please browse this site for more information. The final outcomes related to these workflows are:
- Typewright: Any document that ends up with >50% of it pages having an ECORR score of <25% will end up being ingested into Typewright for crowd-sourced corrections.
- This will be the first time that any EEBO OCR text will be made available. It's import into Typewright also means that corrected documents can be released to scholars for use in digital editions or other scholarship.
- Tags DB: The following pages will be tagged in the eMOP DB with tags that indicate what problems the pages have which are preventing or degrading OCR output. The tags will allow us to apply the appropriate pre-processing algorithms at a later stage prior to re-OCRing, which should improve the output.
- Pages with an ECORR of <75%.
- Pages with an ECORR of >25%, and which had <40% of its words corrected by SEASR.
This will be the first time a comprehensive analysis of page images will be conducted for the EEBO and ECCO collections.
- Taverna: Versions of these workflows will be ported into Taverna for use by anyone wishing to setup a similar, open-source OCR system.
- Github: All code created by and for eMOP will be released open-source, under an Apache Foundation license at the IDHMC Github page