This Fall on eMOP: Post Processing

 

In the near future, we intend to write up a post detailing our successes and goals for this fall, but we'd like to immediately share an interesting development at the beginning of Year Two. As our team and collaborators begin thinking towards the post-processing and triage stage of this project, we've been having a series of meetings here to rethink the granularity of our diagnostics and triage approach.

Some of you may know that we worked with Performant Software to design an "eMOP Dashboard" tool, which allows us to assign fonts to batches of documents, schedule those documents to run through Tesseract on the Brazos High Computing Cluster, and then measure their performance using two OCR diagnostic algorithms: RETAS from our collaborators at the University of Massachusetts-Amherst and Juxta from Performant Software. These algorithms compare our OCR performance to ground truth documents (and this ground truth comes from the TCP).

While we perfect our font training workflow for Tesseract, parse imprint lines for our font history database, and plan the wide release of our Franken+ tool (which allows for easier creation of font libraries), we're also having meetings about what happens after we scan all of our 300,000 documents from EEBO and ECCO. Specifically, SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois and Ricardo Gutierrez-Osuna at Texas A&M University have been meeting with us to discuss next steps.

What we've realized is that the grant document didn't outline the level of granularity that we need to complete our goals. One of our grant tasks is to build a triage system to send documents that need human attention to our three crowdsourced correction tools: TypeWright, Cobre, and Aletheia Web. Over the past few weeks, we've had meetings to add nuance to this goal. For instance, in order to triage documents that do not achieve our OCR goals of 95-97% correct, we need to have a diagnostics system in place. Both of our post-processing collaborators have interesting approaches, and we're confident that we're quickly approaching a method that meets our needs.

In addition, we hope to be able to algorithmically identify common issues in documents that do not approach our percentage goal. Our IDHMC lead programmer and my co-project manager for Year Two, Matthew Christy, has been analyzing OCR'd documents to find "cues" for common issues. We've been trying to answer questions such as: What kind of patterns in OCR output arise for documents with bleedthrough? What do the hOCR bounding box values look like for a page that is skewed or warped? Can we identify, based on the prevalence of certain letters, that a document is printed in a language other than English?

Both Ricardo Gutierrez-Osuna and our SEASR collaborators, Loretta Auvil and Boris Capitanu, will be helping us answer these questions over the coming weeks. We're looking forward to developing a system that will allow scholars and projects to identify issues with a document that fails OCR based on the OCR output and allow for the identification of what pre-processing algorithms should be applied to give that document better results. We think that this can have wide application outside of the IDHMC and eMOP, especially for any institution engaging in large OCR projects.