Installing Tesseract on a Mac (OSX 10.8)

Despite finding several pages with instructions on how to install Tesseract, I found that I had to cobble together my own set of instructions using bits and pieces of information I gathered from all of them.
UPDATED - May, 2015: With the assistance of many fantastic participants in various OCR workshops we've held over the last year, these instructions have being updated. The following is what has worked best and most consistently for most people.


Please reference our handy UNIX command cheat sheet for some extra help with the Terminal commands.


Tesseract Setup:

MacPorts:

MacPorts is an open-source software package management tool that makes it relatively easy for Mac users to compile, install and upgrade open-source software and their dependencies. It's a great first step in installing Tesseract on a Mac.

  1. It will be helpful during this install process to be able to see your hidden files (those files and folders that start with a ".", and which normally aren't displayed in the Finder or Terminal.
    1. Open a Terminal window
    2. Enter: defaults write com.apple.finder AppleShowAllFiles YES
    3. Close and reopen any Finder or Terminal windows.

  2. Install XCode from the App store, or from the Mac Developer website if you need an older version.

    Xcode is a Mac Developer application. The version in the App Store (6.3.1) is only for Mac OSX Yosemite 10.10, or later. If you have an older version of the Mac OS then you'll need to create a Mac Developer ID at the link above and then find the appropriate version of Xcode for your OS:

    • OSX Mavericks 10.9: Xcode 6.2
    • OSX Mountain Lion 10.8: Xcode 5.1.1
    • Earlier versions are also available.

    Be sure to install the full Xcode package ("Xcode 6.2") rather than any of the smaller components like command line tools, etc.

    You'll need to accept the Xcode license agreement before you can use it or do some of the following steps:

    1. Open your Applications folder and find the new Xcode app
    2. Open Xcode.
    3. Accept the license agreement.
    4. Close Xcode.

  3. Install MacPorts.

  4. Install code and dependancies for Tesseract:
    1. sudo port install autoconf
    2. sudo port install automake
    3. sudo port install libtool
    4. sudo port install jpeg tiff libpng
    5. sudo port install leptonica

  5. Finally, make sure everything is up to date and properly installed: sudo port selfupdate



Installing Tesseract:

There are a couple of options here at this point. Using MacPorts is the easiest and fastest way to install Tesseract. This will install the latest "released" version of Tesseract, which is version 3.02.02. That version works fine, but does not include code which writes the confidence levels of each word (x_wconf) to the hOCR output files. The x_wconf values are necessary for eMOP post-processing algorithms to work. If you want to use eMOP's hOCR Denoising and or eMOP's Page Corrector, then you will need to install Tesseract version 3.03. To do that, you will need to install Tesseract from source using SVN.

with MacPorts: [3.02.02]

  1. sudo port install tesseract
  2. You can also install Tesseract's default english language training set (or any other language training set already available here) by doing sudo port install tesseract-eng


from Source (SVN): [3.03]

These instructions will install Tesseract in a folder called tesseract-ocr/ in your home folder (/Users/[your-username]/ or "~" or "$HOME" for short).

  1. cd ~

  2. svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
    This will download the Tesseract files into a folder called tesseract-ocr in your home directory
  3. cd tesseract-ocr

  4. sh autogen.sh

    Warning: If the autogen shell script fails due to aclocal you can fix it by adding to your $PATH system variable.
    PATH=$PATH\:~/devtools/autotools-bin/bin/; export $PATH

  5. ./configure

    If configure is successful you will see something like:

    Configuration is done.
    You can now build and install tesseract by running:
    ...

    If not, then you can scroll up to see where your failure is occurring.

    Warning: If configure fails because it can't find leptonica, then you can create a symlink that will tell the system where leptonica has been installed.
    ln -s /opt/local/include /usr/local/include

  6. make

  7. sudo make install

  8. Test to see if Tesseract installed properly by typing tesseract.

    Warning: If the command can not be found, then you need to move the tesseract executable into a folder that's part of the PATH system variable.
    copy ./api/tesseract and ./api/.libs to /opt/local/bin/


NOTE: If you read the Tesseract install instructions or paid close attention to the messages displayed with the above steps you will have seen mention of making install-langs. I have not been able to get the "make install-langs" command to work for quite some time. But it's not really something to be concerned about. All that command does is download and install language (i.e. typeface with language-specific dictionary) training from the Google website and install it in the tessdata/ folder in tesseract-ocr/. We can do the same thing by hand by downloading any language training from various websites (Google Code or eMOP Github for example) and putting it in the tessdata/ folder as needed.

Check your permissions
Some users may need to change the permissions of the downloaded .traineddata files in the tessdata/ folder in order to use them.

  1. cd ~/tesseract-ocr/tessdata
  2. ls -l to see the permission for all files in your folder.
  3. if your .traineddata file has something like -rw-r----- to the left of it, then
  4. sudo chmod 777 *.traineddata will give every user and every app permissions to do anything with all the .traineddata files in the folder. That will fix any permissions problems you might have.

TESSDATA_PREFIX

Finally, you have to set the $TESSDATA_PREFIX system variable so that the Tesseract command knows where to find the tessdata/ folder that contains the files it needs to run on the language training you create. Any Tesseract training that you create or download will include a .traineddata file which must be present in the tessdata/ folder, and the parent folder of tessdata/ must be identified by the $TESSDATA_PREFIX system variable.

  • To see the value of the $TESSDATA_PREFIX in your current Terminal session:
    echo $TESSDATA_PREFIX
    It should be blank at this point.

  • To set the value of the $TESSDATA_PREFIX in your current Terminal session:
    export TESSDATA_PREFIX="/Users/[your-username]/tesseract-ocr", or
    export TESSDATA_PREFIX="$HOME/tesseract-ocr"
  • NOTE: DO NOT use the '~' character as a shortcut to your home directory in the TESSDATA_PREFIX. It just doesn't work. Use the whole filepath.

  • Setting the TESSDATA_PREFIX with the export command will only set the system variable for this session of your terminal. To make this a permanent assignment that will be applied every time you open a new terminal window, you can add the above export command to the .profile file in your home directory.
    1. Open your Finder, and go to your home directory (/Users/[your-username]/
    2. Find the .profile file (which will be visible, but gray if you did step #1 above), and double-click.
    3. It should open in your default text editor. If not, then select a text editor to open the file with.
    4. Add the above export command to the end of the file and Save.
    5. Open another Terminal window and enter echo $TESSDATA_PREFIX. You should see the correct file path now.

1 comments

Sarah Allen (@ultrasaurus) has done some excellent work and added some modifications to the installation workflow for those of you who still can't get this working. http://www.ultrasaurus.com/sarahblog/2013/07/building-tesseract-from-sou....

Thanks Sarah