The Digitization Project of Kindred Languages is not only about the publishing Fenno-Ugric material online,but it also aims to support the linguistic research by developing purposeful tools for its help. In this blog entry, Wouter Van Hemel of the National Library of Finland sheds the light over the OCR editor, which enables the editing of machine-encoded text for the benefit of linguistic research by crowdsourcing.
The PAS (long-time preservation) team at the Kansalliskirjasto concentrates its effort on the digital preservation of valuable historical and cultural materials so that while the physical form of works might deteriorate and even disappear over time, the information therein can live on forever in the digital realm, accessible to all.
With the OCRUI editor, the Kansalliskirjasto wants to extend this approach to the correction process of OCR material. The process of digitisation and painstaking correction of source material – often in a suboptimal physical state – can be a long and arduous one. Why not share the work? By making the material available in an online multi-user environment, researchers and students – or indeed anyone with interest in language or history – can improve the quality of the OCR text very quickly, and hence make it easy to search for and research these texts. Moreover, adequately digitised text enables language researchers to build a corpus or dictionary of languages long forgotten, further advancing the field of language research. As attested by for instance open-source software or organisations such as Wikipedia, the internet makes a great platform for sharing work and avoiding the duplication of effort inherently present in more insular approaches to the problem. We hope for the OCR editor to be a humble yet useful tool enabling such contribution.
The ALTO format
Before we dive into the more technical aspects of the software, it is important to have a basic understanding of the ALTO file format: link1 & link2. ALTO is a XML format commonly used for OCR’ed text; it is also supported and even preferred by most software that does the digitisation of original manuscripts. Next to the actual recognised characters, it contains language and coordinate attributes and conveys basic layout information regarding sentences and paragraphs. Being an XML format, it is trivially extensible to include e.g. spelling suggestions or custom editor marks to exclude certain words when the time comes to build a corpus out of the text.
The OCRUI application is in essence an editor for these ALTO XML files.
The software consists of two major components: a front-end and a back-end.
The more visual editor part is supported by a server back-end application in Python. This component serves the actual data to the editor and handles revisions, users, collection listing and other tasks. It also takes care of “administrative” duties such as locking documents whose editing process has been considered finished so they can be exported – either as text or corpus. It is perhaps interesting for administrators or the more technical minded readers to mention the hierarchical permissions system: a user is given certain rights, such as editing OCR, locking documents or administrating users in a collection, and will recursively retain these permissions in all child collections and for the documents they contain. This way, it is trivial for the Kansalliskirjasto to add an administrator responsible for a collection, who then in turn can add users and even create other administrators and assign users rights for any sub-collections; this modus operandi hopefully results in a mostly self-managing and well-organised community.
How do we actually use this editor to correct language materials and build a corpus? What are the steps involved?
- materials consisting of high resolution scanned images and (preferably) ALTO files are delivered to the team at the Kansalliskirjasto
- in case the works are delivered “bare” without ALTO files, the images are processed through digitisation software to produce best-effort OCR text
- the ALTO and image files are imported into the back-end as revision 0 under an appropriate collection, creating JPG and thumbnails for the web interface in the process
- an administrator – i.e. a user who can create users and manage document state – is assigned to the containing collection if such a person does not already exist
- the online editing process begins
- when the quality of the OCR’ed text has reached satisfactory standards, the administrator can lock a document to prevent further editing and export the end result as text, ALTO XML or PDF
- the end-result can optionally be (re-)imported into one of our PAS (long-term preservation) systems .
The development of the OCR editor at the Kansalliskirjasto is in full swing. This month has seen the first significant use of our application, and it is exciting and exhilarating to see researchers and laymen alike improve these works by leaps and bounds in such a short amount of time. While the software is still considered “in beta”, the basic concept seems to be holding up fine in the field, yielding valuable feed-back from the early adopters. It is always easy to tinker with things and try some things by yourself in the early stages of software development, but we really appreciate the input of users so we know how people are using our tool and what we should prioritise in the development process.
Hence, our roadmap has not been set in stone – but we do have plans. At the moment we are focussing our efforts on the user-friendliness and usability of the application, largely based on the feed-back from users. We are discussing how to improve the multi-user aspect by enabling communication between users in the form of notes and by avoiding editing conflicts (such as when two users edit the same page at exactly the same time). Development of export functionality has started, after which we will try to tackle the problem of building a corpus and dictionary for specific languages.
So, that concludes this short, slightly more technical overview of the project. I hope to see some of you at our presentation and keep the suggestions coming!
Invitation to the webinar as PDF here: OCR_Webinar_invitation_