What is the point of an online interactive OCR text editor?

The Digitization Project of Kindred Languages is not only about the publishing Fenno-Ugric material online,but it also aims to support the linguistic research by developing purposeful tools for its help. In this blog entry, Wouter Van Hemel of the National Library of Finland sheds the light over the OCR editor, which enables the editing of machine-encoded text for the benefit of linguistic research by crowdsourcing.


The PAS (long-time preservation) team at the Kansalliskirjasto concentrates its effort on the digital preservation of valuable historical and cultural materials so that while the physical form of works might deteriorate and even disappear over time, the information therein can live on forever in the digital realm, accessible to all.

With the OCRUI editor, the Kansalliskirjasto wants to extend this approach to the correction process of OCR material. The process of digitisation and painstaking correction of source material – often in a suboptimal physical state – can be a long and arduous one. Why not share the work? By making the material available in an online multi-user environment, researchers and students – or indeed anyone with interest in language or history – can improve the quality of the OCR text very quickly, and hence make it easy to search for and research these texts. Moreover, adequately digitised text enables language researchers to build a corpus or dictionary of languages long forgotten, further advancing the field of language research. As attested by for instance open-source software or organisations such as Wikipedia, the internet makes a great platform for sharing work and avoiding the duplication of effort inherently present in more insular approaches to the problem. We hope for the OCR editor to be a humble yet useful tool enabling such contribution.

The ALTO format

Before we dive into the more technical aspects of the software, it is important to have a basic understanding of the ALTO file format: link1 & link2. ALTO is a XML format commonly used for OCR’ed text; it is also supported and even preferred by most software that does the digitisation of original manuscripts. Next to the actual recognised characters, it contains language and coordinate attributes and conveys basic layout information regarding sentences and paragraphs. Being an XML format, it is trivially extensible to include e.g. spelling suggestions or custom editor marks to exclude certain words when the time comes to build a corpus out of the text.

The OCRUI application is in essence an editor for these ALTO XML files.


The software consists of two major components: a front-end and a back-end.

The front-end – or editor – is a JavaScript application that maps ALTO XML attributes such as language and word coordinates to words in a web page. Through its two-pane graphical interface, users can correct wrongly recognised words side by side with the actual scanned imagery, or improve meta-attributes such as the language of individual words.

The more visual editor part is supported by a server back-end application in Python. This component serves the actual data to the editor and handles revisions, users, collection listing and other tasks. It also takes care of “administrative” duties such as locking documents whose editing process has been considered finished so they can be exported – either as text or corpus. It is perhaps interesting for administrators or the more technical minded readers to mention the hierarchical permissions system: a user is given certain rights, such as editing OCR, locking documents or administrating users in a collection, and will recursively retain these permissions in all child collections and for the documents they contain. This way, it is trivial for the Kansalliskirjasto to add an administrator responsible for a collection, who then in turn can add users and even create other administrators and assign users rights for any sub-collections; this modus operandi hopefully results in a mostly self-managing and well-organised community.


How do we actually use this editor to correct language materials and build a corpus? What are the steps involved?

  • materials consisting of high resolution scanned images and (preferably) ALTO files are delivered to the team at the Kansalliskirjasto
  • in case the works are delivered “bare” without ALTO files, the images are processed through digitisation software to produce best-effort OCR text
  • the ALTO and image files are imported into the back-end as revision 0 under an appropriate collection, creating JPG and thumbnails for the web interface in the process
  • an administrator – i.e. a user who can create users and manage document state – is assigned to the containing collection if such a person does not already exist
  • the online editing process begins
  • when the quality of the OCR’ed text has reached satisfactory standards, the administrator can lock a document to prevent further editing and export the end result as text, ALTO XML or PDF
  • the end-result can optionally be (re-)imported into one of our PAS (long-term preservation) systems .


The development of the OCR editor at the Kansalliskirjasto is in full swing. This month has seen the first significant use of our application, and it is exciting and exhilarating to see researchers and laymen alike improve these works by leaps and bounds in such a short amount of time. While the software is still considered “in beta”, the basic concept seems to be holding up fine in the field, yielding valuable feed-back from the early adopters. It is always easy to tinker with things and try some things by yourself in the early stages of software development, but we really appreciate the input of users so we know how people are using our tool and what we should prioritise in the development process.

Hence, our roadmap has not been set in stone – but we do have plans. At the moment we are focussing our efforts on the user-friendliness and usability of the application, largely based on the feed-back from users. We are discussing how to improve the multi-user aspect by enabling communication between users in the form of notes and by avoiding editing conflicts (such as when two users edit the same page at exactly the same time). Development of export functionality has started, after which we will try to tackle the problem of building a corpus and dictionary for specific languages.

So, that concludes this short, slightly more technical overview of the project. I hope to see some of you at our presentation and keep the suggestions coming!

Invitation to the webinar as PDF here: OCR_Webinar_invitation_

4 thoughts on “What is the point of an online interactive OCR text editor?

  1. I’m sure that many libraries and archives who deal with digitised text would like to contribute to a project that helps curate the imperfect OCR-derived text. Is the code for the frontend and backend components online? Can others help bugfix and make the editor better?

    • Thanks for your comment, Ben. I will take a liberty to pass this question to my colleague, who is in charge of technical development of the back-end components. He’ll reply here to you tomorrow, but naturally, we encourage you to join the webinar and raise up the questions there too.

      / Jussi-Pekka

    • Hello Ben,

      It is our intention to make the source code publically available in the hope of fostering contribution from other organisations and individuals. Apart from practical considerations such as where and how, for any sort of semi-official “release” the code needs to be cleaned up and documented better so it is easier for others to get their hands dirty if they so please. Consider it a work in progress.

      The front-end source code is (unofficially) online at GitHub:

      I hope we can set up an official Kansalliskirjasto account at GitHub, Gitorious, Bitbucket or elsewhere where we can release the code of some of our projects that might be interesting to others.

      Are you working on something similar yourself?

  2. Thanks for this interesting post. At Wikimedia we already have an interactive OCR text editor, Proofread Page, which powers an entire project, Wikisource (4 millions pages in 60 languages, several hundreds active proofreaders).
    We’re interested in improving it with pieces of software and experience like yours, so if you know a student with PHP experience, or you can reach the CS students of Helsinki universities, please encourage them to make a proposal for MediaWiki’s Google Summer of Code: https://www.mediawiki.org/w/index.php?title=Mentorship_programs/Possible_projects&oldid=931969#Merge_proofread_text_back_into_Djvu_files

Leave a Reply

Your email address will not be published. Required fields are marked *