More Early Modern Word Lists Released by eMOP on Github

Submitted by mchristy on Thu, 10/02/2014 - 12:46

The eMOP team is happy to announce the release of more early modern word lists, which we have compiled, cleaned, and combined over the last 2 years. Our sources include Ted Underwood, Martin Mueller, Loretta Auvil, the VARD project, and the TCP transcriptions of EEBO and ECCO. Please see our Github page for more information.

The eMOP team has collected early modern words lists over the last two years from a variety of sources for use in eMOP and specifically for creating dictionaries used by the Tesseract OCR engine to do self correcting during the OCR process. The lists are released on our Github page. We have collected word lists from:

Ted Underwood (The Stone and the Shell), who collected his resource from various sources floating on the web, which he enlarged by "incorporating words common in 18-19c material that seemed to be missing."
Martin Mueller, who sent us a a list of words with alternate spellings extracted from MorphAdorner and broken up into time periods.
The Text Creation Partnership (a href="http://www.textcreationpartnership.org/" target="_blank">TCP), which supplied us with their entire corpus of hand-transcribed EEBO and ECCO documents. We were able to then extract the words used in these ~46,000 documents, along with word counts to construct various lists based on frequency-usage.
The VARD tool, from which we were able to extract a large alternate spelling word list from the early modern period.

We have done a great deal of testing with these various word lists, and combinations of them, as dictionaries in our Tesseract training. Dictionaries can be incorporated into Tesseract training to help Tesseract self-correct its OCR transcriptions as it goes. After a great deal of testing, looking at both results and timing, we have settled on the following files which produce the best results without increasing Tesseract run-time too much:

FullDict-wVariants-CSU.txt for the full word list: word-dawg.
TCP-words-gt500.txt for the frequently used word list: freq-dawg.

More Early Modern Word Lists Released by eMOP on Github

About eMOP

LOGIN | Create an Account

Search form

More Early Modern Word Lists Released by eMOP on Github

About eMOP

LOGIN | Create an Account