Brief intro to Tesseract

13.01.1844 Maamiehen Ystävä no 2 s. 4

13.01.1844 Maamiehen Ystävä no 2 s. 4

If you sometimes wonder all the tools that to relate to text recognition of e.g. digitized newspapers, you might have heard of a tool called Tesseract. Tesseract is a open-source tool for Optical Character Recognition (OCR).

In this post we go briefly through how to setup and use Tesseract.

Setup of Tesseract in unix-type system

Read below, or utilize the complete Tesseract instructions at Github.

  • Set-up Tesseract pre-requisite leptonica
    ./configure –with-jpeg –with-libpng –with-libtiff
  • make
  • make install

Then onwards to Tesseract itself.  With –prefix you can install tesseract to pre-defined location and if leptonica is in custom location add a path to it via the –with-extra-libraries. In normal case you can manage with just plain configure.

Obtain Tesseract source from github

  • /configure –prefix=$TESSERACTPATH –with-extra-libraries=$LEPTONICALIBS
  • make
  • make install

Then grab and extract the needed language files and ensure you have TESSDATA_PREFIX pointing to the tessdata directory where the language files reside.

Running tesseract

tesseract sortavala_10568340.jpg -l deu_frak stdout

This runs tesseract against an image, utilizing deu_frak  (German fraktur) as the language. As a result you get extracted text from the image to the standard output.


3 results of Tesseract

Because there is good quality digitalisation and the extract is just from one article in one column, the result is okish when just glancing the result. Even for this tiny example, it is possible to see in the image above some of the differences with different languages, which Tesseract supports. However, it depends of the usage purpose if the quality is good enough. There are also ways to do OCR evaluation like e.g. Succeed’s ocrevaluation tool , which could give clear numbers between alternatives — albeit requiring a clean sample for the basis.

For more comprehensive analysis of text recognition You can find a detailed analysis of the quality of the digitized collections from an earlier paper .