OCR’ed text is often lacking in quality because of errors during the optical recognition process, especially when the source material is old or otherwise in a bad state. These errors make it hard to rely on the text for building a corpus or word lists and makes the source material less accessible to use for study or to incorporate into other tooling for language researchers. This is a problem that our OCR editor tries to eradicate, or at least contribute a possible solution towards.
This post gives a brief but slightly more detailed overview of the technical architecture of Revizor – the OCR editor used in the Digitization Project of Kindred Languages, provides an update on its current status and shares some challenges which we could use some help with in the future.
The editor takes in ALTO XML files provided by OCR software, as well as high quality pictures of the source material as an aid for the user during the manual correction process. The Python back-end of the software serves up the data to anybody authorised to make edits. It is also responsible for exporting the end result, which might be the corrected ALTO XML, plain text versions or word lists built from selected works. The exported product might be further processed by linguistic tools or imported into “corpus” sites specifically meant for facilitating searching and dissemination.
The front-end of the editor is the part where manual changes are made by real human beings, as opposed to the automatic bulk processing in the back-end. By means of a two-pane window, the user can check the original work and make corrections to the text, as well as mark the language or relevance of words – this will aid the back-end in building accurate word lists per language. Especially for languages with a small corpus,this process is very important as researchers and OCR software might not have any word lists or dictionaries to work from, so the accuracy of the corrected text will directly influence also the quality of future digitalisation by means of this feed-back loop.
Finally, as the time and funds alotted dwindle and the project draws to a close, we realise that although the editor works well, the [code] did perhaps not receive all features and polish that would be nice to have,and we hope that by providing the source of Revizor online other interested parties can help improve the editor and adapt it to their own specific needs. Apart from general code and layout polish, features that are still being implemented or might not make the deadline are automatic correction of common OCR mistakes, as well as stemming and lemmatisation– or at least smarter ways of narrowing down a large number of conjugated words to a more concise word list without necessarily knowing the intricate details of the source document’s language. Most of these improvements would likely require the help of computational linguists and detailed knowledge of language-specific tools and stemmers, and ideas of how to plug these tools efficiently into the editor’s correction and export process.
Here’s a link to the National Library of Finland’s Github page, where the code of Revizor (OCRUI) and many other projects is or will be made available.
Wouter van Hemel, Information System Specialist