Pre-processing Page Images for OCR'ing
While we were not able to apply pre-processing to the eMOP corpus, it is nevertheless an important step in any OCR workflow. At the very least page images need to be binarized—turned into black and white images. There are also options for removing noise, fixing skew, etc. Pre-processing can help to mitigate or remove problems that could affect the quality of your OCR output. However you receive your page images, you should spend some time examining them and, if necessary, pre-processing them for an improved outcome.
For eMOP, the page images we received for our corpus had already been binarized. We noticed during our work however, that many pages contained noise, skewing, warping, and other issues that would affect Tesseract's ability to give us good OCR text. But with 45 million page images, we were unable to perform blind pre-processing in bulk on the collection. Instead, we developed a post-processing system that would examine the output we got from Tesseract (in the form of hOCR files) and return measures for noisiness and skew, and identify multiple-columns on a page. Our intention is to then use that information, stored in the eMOP database, to apply the appropriate pre-processing before re-OCR'ing pages that had previously returned bad results.
In addition, the eMOP team has been busy helping other scholars and institutions apply the eMOP workflows to other OCR projects. For those projects, pre-processing page images has been a necessary and important step in the process. As such, we've created this post as a guide for the pre-processing tools and steps that we've been working out with various partners.
We have tried two different open-source image processing/editing software packages, ImageMagick and GIMP. There are many others. There are also many excellent proprietary image processing programs available for purchase, often with some kind of educational discount. You should use whatever software is available to you and that you feel comfortable using. Below are some recommendations that we've found to be helpful.
ImageMagick is an excellent open-source image processing program available for most computing platforms. It has a huge user-base and in is under continued development. The documentation on the website is not bad, but more importantly there is tons of help available via Google searches and support groups (check out some of these imagemagick.org/script/links). There are Graphical User Interfaces available for ImageMagick (here's some), but the below guide is using the command line on a Mac. Please make adjustments if you're using a Windows PC.
ImageMagick can actually be thought of as a suite of tools. There are many different commands available all with multiple options. To help manage processes that can consist of several steps or multiple commands many people have created scripts to do this work. One excellent source of such scripts is Fred's ImageMagick Scripts. We have used several of Fred's scripts, and they are the basis of the guide below.
There are some good install instructions for both UNIX(/Mac) and Windows at the ImageMagick site (imagemagick.org/script/install-source).
- $> port search imagemagick
Here I get a list of different options available based on the version of PHP on my computer. To find that, I do:
- $> php -v, then
- $> port install
using the package name of the appropriate version of ImageMagick for my version of PHP.
Any required software dependancies should be handled by Homebrew and MacPorts, though the ImageMagick Install page does not mention any. However, if you will be working with tif files, then you will want to be sure to install the libtiff library as well. This didn't seem to be necessary with MacPorts or the install from the ImageMagick site, but with Homebrew it is.
- $> brew install --with-libtiff --with-ghostscript imagemagick
You may also want to install Ghostscript, which among other things, allows you to view postscript formatted files. You can do that with either MacPorts or Homebrew :
- $> brew/port install ghostscript
- Homebrew: $> brew update
- MacPorts: $> port upgrade imagemagick
Then you may need to update your ImageMagick install again:
Fred's ImageMagick Scripts are shell scripts that will work with most versions of UNIX or Mac OSX (perhaps even a Windows environment with some kind of UNIX shell emulator installed). Installing scripts is not technically necessary, since, as scripts they just need to be downloaded and then can be run from that folder or by using the folder path in the invocation of the script. For example, I created a folder called ImageMagick-scripts in my home directory and then download all scripts to there. When I want to use the scripts then I can call them from any folder by invoking:
- $> sh ~/ImageMagick-scripts/<script-name> <option> <filename(s)>
Fred has some good information about using his scripts on the homepage, including making sure the files are executable and adding your scripts folder to your PATH system variable.
In addition, each script includes a detailed explanation, a description of each option, and several examples of the script being used and it's results on a sample image.
The GNU Image Manipulation Program (GIMP) is another good open-source tool for manipulating images. GIMP is available for Mac, UNIX and Windows. It comes with a graphical user interface for both Mac and UNIX as well. The documentation at the GIMP site is pretty good. Use it.
I've used GIMP a little and it's worked well for the most part. Having your documents open in a GUI while you work on them is certainly convenient and lets you see the results of what you've done right away.
There are a number of ways you can pre-process your documents to improve the ability of any OCR engine to be able to "read" them well. Mostly it will depend on your documents and what may be wrong with them. The number of documents you have is also a consideration and could limit your ability to do some or all of these. I'd recommend at least the below, if they are necessary.
Binarizing an image (also referred to as thresholding an image to binary) is turning it from color or grayscale to black and white. This is a necessary step for most OCR engines. Even if your page images look black and white, you should check to see if they're grayscale. On a Mac select an image and hit <cmd+i>. Look at the Color space and/or Color profile fields about half-way down.
There are many different methods for binarizing an image, and Fred has created scripts for many of them. On Fred's main page scroll down to the Scripts By Category and look at the Threshold Segment column. Most have several options that can be tweaked to improve performance. I have not tried them all, but I've had success with these:
- otsuthresh uses the Otsu method of binarizing. The nice thing about this script is that it uses no options. It's worked pretty well on some pages. Just call it with an input file and give it the name of the output file:
$> otsuthresh <infile> <outfile>
- 2colorthresh is another with no options: $> 2colorthresh <infile> <outfile>
- localthresh is the one I've used the most. There are a lot of options, some of which I haven't even tried yet. I've had pretty good luck with these settings on a number of documents:
$> localthresh -m 1 -r 15 -b 8 -n yes <infile> <outfile>
Try different scripts and play around with the optional parameters until you find something that works well for your page images.
Aim: Your ultimate aim is to reduce the image to black and white without losing glyph integrity. You want the edges of your character glyphs to remain smooth. A good binarizing algorithm will also be able to reduce, if not eliminate, bleedthrough and noise. On pages that show under-inking of some of the glyphs you may not be able to do all of the above and still maintain the integrity of these under-inked glyphs. You may have to make some tradeoffs, but try to end up with the cleanest page image you can for as much of the page as you can.
Any noise or non-text page elements that you can eliminate from the image by cropping it will help. You can use ImageMagick, GIMP, or even native imaging programs like Paint (Win) or Preview (Mac) to do that.
By noise here, we are referring to the speckles that can be seen on many page images after binarizing or that are old or of poor quality. ImageMagick has two native methods that can be used to remove noise, enhance and despeckle, but I have not had much luck with either as standalone methods. Often denoising algorithms can be called iteratively to remove more and more noise. Fred's noisecleaner script does just that using either method you choose. There are several denoising scripts on Fred's page which can be seen in the Scripts By Category table under the Noise Addition Removal column.
- denoise: I've had good luck with the following on several documents
$> denoise -f 6 -n 12 <infile> <outfile>
but again, you should play with these values to find what works best with your documents. There are additional options you can try as well, including an unsharp mask.
Skewing of pages images is a fairly common problem with image from books or which have been scanned. Tesseract can handle a certain amount of skew on page images without a problem, but too much skew can cause it to start reading parts of line out of order. Fred's skew script is easy to use and pretty straightforward. The options you use will depend on the amount of skew your page exhibits, and will probably be down to trial and error before you get it right.