One of the tasks in the Digitalia project was to to work with Named Entity Recognition and to see how that information could be extracted from the digitized text and then enriched further. At the moment the interesting step is the text recognition phase is the project has created a novel way to process the XML files in order to improve the OCR quality. This step enhances the opportunities for the FiNER to grab the interesting part of the text out, namely names of the persons and locations, which we then integrate to the digi.nationallibrary.fi eventually.
Our dedicated machine is running hot, and we might be, at least for Finnish material, the record holder of executions of Tesseract software.
Not every page is created equal , so that is why it is hard to estimate the speed of this process, but the race is on. With enough CPU cores anything is possible!
For details of the process, do read the ICDAR 2018 short paper titled “Re-OCR in Action – Using Tesseract to Re-OCR Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals”.