Using Gamera as an OCR Engine


While the open-source OCR engine of choice on eMOP is Tesseract we also want to keep our options open and are trying some other open-source OCR options. One of those is Gamera.

Gamera:

Gamera is not an OCR engine on its own. It is a glyph recognition system whether those glyphs are musical notes, letters in a specific font, or whatever. Users can input page images and then use Gamera to scan the page and identify each glyph found. The end result is an XML file with training data that recognizes each of the glyphs identified. Once Gamera is installed there is extensive documentation on how to use Gamera to train for glyphs in a doc/ folder.


OCRing with Gamera

In order to use this training XML file to OCR documents, the Gamera OCR Toolkit (ocr4gamera) is needed. The OCR Toolkit uses the training XML file produced by Gamera and a page image as input and produces a text file of the OCR process as a output. My experience so far is that Gamera takes significantly longer to OCR a page image than Tesseract but that the results seem to be better. However, those results are based on extensive glyph training of the font in question. In this case the font is Baskerville and the training was done over several months by Michael Behrens [behrens4{at}illinois.edu]. (Thanks for the training data Mike!) A typical OCR time was slightly over 1 minute per page. Also, we've kind of figured that if it takes more than 3 minutes to scan a page, then it's probably going to fail. We're going to investigate setting up an internal time limit in the OCR Toolkit code to prevent runaway pages and hopefully reduce overall scan time on large documents with many pages.

You can get a copy of our Baskerville font training XML file done with Gamera at the eMOP github page.


Installing Gamera and ocr4gamera in a $HOME Directory

If you're like me then you have access to a server, but not the permissions required to install applications on it. Instead, I can only install the apps in my $HOME directory. These are the steps I used to get the apps installed after several attempts:

Prerequisites:

Gamera and the OCR Toolkit are written in Python and so run as python scripts (.py files). You'll need to make sure that you have Python installed on your server first. The setup files for both are also in python. Here's a list of the required packages from the Gamera page:
g++, python, python-dev, python-wxgtk2.8, python-wxversion, libtiff4-dev, libpng12-dev, python-docutils, python-pygments
However, I am told that my server does not have the wxpython package and I was still able to compile, install and use Gamera and the OCR Toolkit.


Gamera:

  1. From your home directory download the latest version of Gamera from SVN and put it in a folder named gamera/
    svn://svn.code.sf.net/p/gamera/code/trunk/gamera gamera

  2. cd gamera/ and compile the code
    python setup.py build

  3. Create a new folder in your $HOME directory in which to install Gamera. In my case it's called tools/
    mkdir tools
    python setup.py install --prefix=~/tools

  4. OCR Toolkit:

  5. Download the ocr toolkit tarball and move it to your $HOME directory.
  6. Unzip the tarball in your $HOME directory. It will create a new folder called ocr-1.0.6/.
    tar -zxvf ocr-1.0.6.tar.gz

  7. Add some system variable information that will let ocr4gamera know where to find the files it needs that Gamera has installed
    export PYTHONPATH=[home-dir]/tools/lib64/python2.4/site-packages/
    export CFLAGS=-I[home-dir]/tools/include/python2.4/gamera
  8. where: [home-dir] is the full path to your home directory. Don't use '~' as a shortcut. It didn't work for me.

  9. cd ocr-1.0.6/ to compile and install the toolkit
    python setup.py build
    python setup.py install --prefix=~/tools

  10. Add the folders that contain the python executables to some PATH system variables. The best thing to do is to add the following to a .profile file in your $HOME directory, so you don't have to re-issue them every time you connect to the server.
    export PATH=~/tools/bin:$PATH
    export PYTHONPATH=~/tools/lib64/python2.4/site-packages/

    NOTE: ocr4gamera.py is in the bin/ folder and gamera.py is in the site-packages/ folder.


Including eMOP Modifications:

Gamera's default output is a text file. Dave Woods [woodsdm2{at}miamioh.edu] at the University of Miami (Ohio) modified the OCR Toolkit code to add an additional option ('-18') for the ocr4gamera.py module. Specifying this option causes Gamera to create XML-like output that includes word coordinates for every word identified. I added a few small additions to include some metadata and make the output valid XML. The modified code is available in the eMOP github repository for use. Replace the current files in your OCR Toolkit install with these:

  • ocr4gamera.py -> ~/tools/bin/ocr4gamera.py
  • ocr_toolkit.py -> ~/tools/lib64/python2.4/site-packages/gamera/toolkits/ocr/ocr_toolkit.py


Using the OCR Toolkit

Using the OCR Toolkit with Gamera training is pretty straightforward, but there are a few things to know about first.

Non-standard Unicode Chars:

I didn't do the training on Gamera and I've only been introduced to. But my understanding is that when you create classifiers in Gamera for your glyphs, you're supposed to use the same names for glyphs as are used in the standard unicode set. For example, when you are telling Gamera to create a classifier for 'e', you'd create a classifier called latin.small.letter.e. Likewise, 'M' would be latin.capital.letter.m, and so on. You can, of course create your own classifier names if you prefer, and you'll have to if you find that your character set includes non-standard unicode characters like the c:t, long-s:h, long-s:i ligatures, etc. (NOTE: These are all defined in the extended MUFI unicode set and map to unicode values that are currently undefined in the standard set. See the MUFI unicode set for more information.)

This is not a problem, but you will have to create an extra file, a text file with .csv extension, that lets the OCR Toolkit know what unicode values to use in its output for any non-satandard classifier names. So, if you have something like italicized versions of your characters you can create a classifier name like italics.small.letter.a, and then in your .csv file you'll have to include this classifier name and a unicode character that should be displayed in the output for all matches of this classifier. The same goes for ligatures, or any non-standard classifier name. The file is a simple comma separated list with one line per character. Each line as the classifier name you created followed by the unicode character, separated by a comma. If the character in question is not defined in standard unicode then you may see a box, or dot or some other indication that this is a character that your editor doesn't understand or can't display. However, as long as you cut-and-paste the correct unicode character in here, it will work. The best way I've found to do that is to use this Unicode Range Viewer. Enter your unicode value in the search box at the top. Click on the character in the table. Scroll to the bottom and copy the character from the text box. Then just paste it into your .csv file.

You can download a copy of the .csv file I used for the Baskerville font training (also created by Dave Woods) from our eMOP github page. Notice that the list is mainly for italicized characters and ligatures.

With everything in place now, all you have to do is run the OCR Toolkit on some page images:

ocr4gamera.py -x [rpath]/baskerville_library.xml -c [rpath]/extra-chars.csv -18 -o [rpath]/[outfile] [rpath]/[page-image.tif]
where:

  • [rpath] is a relative path to the file from where ocr4gamera.py is being called.
  • [outfile] is the name of your output file.
  • [page-image.tif] is the name of the page image you want to scan with the OCR Toolkit.