October OCR Testing & Training
eMOP progress continues as our team experiments to find the best method for training Tesseract to recognize various early modern fonts. The new Franken+ tool, developed by eMOP graduate student Bryan Tarpley, has passed through the alpha testing phase and dramatically improves our ability to create a variety of training sets for Tesseract. Now we're hard at work investigating various methods for creating “training sets,” for Tesseract to see what will give us the best OCR results.
Our workflow for training Tesseract has been established and is in full gear as we create training sets to test with Tesseract. The workflow, in brief:
- Identify documents that contain typefaces we want to train Tesseract to recognize.
eMOP collaborator Todd Samuelson and his team at Texas A&M University's Cushing Memorial Library create or purchase page images of 12-20 pages of identified documents and send them to us. Todd Samuelson will also be traveling to London, Amsterdam, and Antwerp next month to finalize our font history research and collect specimen sheets for identified typefaces.
- Process page images of those documents in Aletheia.
Aletheia attempts to auto-identify each glyph on the page images, and our crack team of undergraduate students, lead by eMOP graduate student Kathy Torabi, double checks the results. Aletheia outputs an XML file for each page that contains the unicode value and a set of box coordinates for each glyph identified in the corresponding page image.
- Ingest the Aletheia box files into Franken+.
Using the metadata supplied by Aletheia, Franken+ cuts out the image of each glyph using its box coordinates and organizes them by unicode value. The font editor window of Franken+ then allows users to view all the glyphs identified as a particular character--so all the lower-case a's, or all the upper-case L's, etc. That makes it easy for our team to identify mislabeled characters and to select only the best quality images to use for training. Previous testing revealed that Tesseract’s accuracy is improved when it is trained using the platonic ideal of glyphs.
- Create tiff/box pairs for Tesseract training.
Using a sample text document as a base, Franken+ will create tiff images and corresponding box files (an XML file similar to Aletheia's but with a different coordinate system) using only the samples for each glyph that were selected.
- Train Tesseract.
Franken+ can also automate the Tesseract training procedure so the user can create a Tesseract .traineddata file using the tiff/box file pairs it created. This saves the user from having to perform all the necessary steps to create Tesseract training, making it easy for our team to quickly create, and test a variety of training scenarios for Tesseract.
Using the above workflow, we have created training sets for Tesseract from several different typefaces, including both roman and italic variances. We have also been able to easily combine several typefaces into single Tesseract training sets using Franken+.
Kathy Torabi and her team are continuing to experiment with different variables in the training process--using different numbers of exemplar glyphs for each character, combining different typeface training, and removing ligatures--to see how that affects Tesseract's ability to recognize text while OCRing. Once we have Tesseract's character recognition up to the best place we can get it, we will add dictionaries to Tesseract's training. Previous testing has shown that the incorporation of period-specific word lists (that include many variant spellings common to documents of this period) increases Tesseract's accuracy, as it can use the dictionaries to correct some of its own character recognition errors. Each percentage point closer to our OCR correctness goal is important, as each percentage point could be the difference between finding “cafe” in a search instead of “case.”
While we are still processing typefaces and investigating various training variables, our team is making progress. After this short period of intensive testing, we should have a number of our Tesseract training questions answered and our OCR workflow established.