Aletheia and Franken+ Demonstration Videos

We've added a couple of videos that we produced showing how eMOP team members used Aletheia and Franken+ to create training for Tesseract. Check them out.

Aletheia Demo

Pre-processing Page Images for OCR'ing

While we were not able to apply pre-processing to the eMOP corpus, it is nevertheless an important step in any OCR workflow. At the very least page images need to be binarized—turned into black and white images. There are also options for removing noise, fixing skew, etc. Pre-processing can help to mitigate or remove problems that could affect the quality of your OCR output. However you receive your page images, you should spend some time examining them and, if necessary, pre-processing them for an improved outcome.

Juxta-CL Text Comparison Tool is Available

Juxta-CL is a command line text comparison tool based on the online JuxtaCommons tool, created for eMOP by Performant Software Solutions, to do ground-truth comparison for testing the accuracy of our OCR processes. It is now available open-source through our Github page. Have fun.

eMOP Quick Start Guide for Aletheia and Franken+

Aletheia and Franken+ are key tools used by the eMOP team to create early-modern typeface training for Tesseract. Here are some quick tips on getting started with using them. They both have much more functionality though so continue to explore these two great tools.

Installing/Building Tesseract for Windows 8

Installing the latest release of Tesseract (3.02.02) on Windows 8 is pretty simple, but you'll have more work to do if you want to get the latest "beta" version (3.03) working on Windows. Don't be daunted however, we've found some easy-to-follow instructions to help you out.

Installing Tesseract

The Tesseract Windows Installer works pretty well and painlessly as long as you want to use v3.02.02, the latest official release.

Installing Franken+ on Windows 8

Franken+ is a pretty simple install on its own but it does have some prerequisite software that, for me at least, posed some challenges.

The main sticking point in installing Franken+ is installing the prerequisite software and setting up Franken+ to use the database. Please read below for more detailed information on those processes.

Using Gamera as an OCR Engine

While the open-source OCR engine of choice on eMOP is Tesseract we also want to keep our options open and are trying some other open-source OCR options. One of those is Gamera.

Testing with Tesseract

Testing with Tesseract:

Once we had our training completed we need to do some testing before going into limited, then full-scale production mode. We have 45 million page images to scan.

Training with Tesseract

eMOP Directory/File Structure and Naming Conventions

eMOP Directory/File Structure and Naming Conventions:

This post covers the outlay of the eMOP testing structure with Tesseract, including where to put the output files, what the naming convention is, and how to document each attempt folder.