What is the point of an online interactive OCR text editor?

The Digitization Project of Kindred Languages is not only about the publishing Fenno-Ugric material online,but it also aims to support the linguistic research by developing purposeful tools for its help. In this blog entry, Wouter Van Hemel of the National Library of Finland sheds the light over the OCR editor, which enables the editing of machine-encoded text for the benefit of linguistic research by crowdsourcing.

***

The PAS (long-time preservation) team at the Kansalliskirjasto concentrates its effort on the digital preservation of valuable historical and cultural materials so that while the physical form of works might deteriorate and even disappear over time, the information therein can live on forever in the digital realm, accessible to all.

With the OCRUI editor, the Kansalliskirjasto wants to extend this approach to the correction process of OCR material. The process of digitisation and painstaking correction of source material – often in a suboptimal physical state – can be a long and arduous one. Why not share the work? By making the material available in an online multi-user environment, researchers and students – or indeed anyone with interest in language or history – can improve the quality of the OCR text very quickly, and hence make it easy to search for and research these texts. Moreover, adequately digitised text enables language researchers to build a corpus or dictionary of languages long forgotten, further advancing the field of language research. As attested by for instance open-source software or organisations such as Wikipedia, the internet makes a great platform for sharing work and avoiding the duplication of effort inherently present in more insular approaches to the problem. We hope for the OCR editor to be a humble yet useful tool enabling such contribution.

The ALTO format

Before we dive into the more technical aspects of the software, it is important to have a basic understanding of the ALTO file format: link1 & link2. ALTO is a XML format commonly used for OCR’ed text; it is also supported and even preferred by most software that does the digitisation of original manuscripts. Next to the actual recognised characters, it contains language and coordinate attributes and conveys basic layout information regarding sentences and paragraphs. Being an XML format, it is trivially extensible to include e.g. spelling suggestions or custom editor marks to exclude certain words when the time comes to build a corpus out of the text.

The OCRUI application is in essence an editor for these ALTO XML files.

Architecture

The software consists of two major components: a front-end and a back-end.

The front-end – or editor – is a JavaScript application that maps ALTO XML attributes such as language and word coordinates to words in a web page. Through its two-pane graphical interface, users can correct wrongly recognised words side by side with the actual scanned imagery, or improve meta-attributes such as the language of individual words.

The more visual editor part is supported by a server back-end application in Python. This component serves the actual data to the editor and handles revisions, users, collection listing and other tasks. It also takes care of “administrative” duties such as locking documents whose editing process has been considered finished so they can be exported – either as text or corpus. It is perhaps interesting for administrators or the more technical minded readers to mention the hierarchical permissions system: a user is given certain rights, such as editing OCR, locking documents or administrating users in a collection, and will recursively retain these permissions in all child collections and for the documents they contain. This way, it is trivial for the Kansalliskirjasto to add an administrator responsible for a collection, who then in turn can add users and even create other administrators and assign users rights for any sub-collections; this modus operandi hopefully results in a mostly self-managing and well-organised community.

Process

How do we actually use this editor to correct language materials and build a corpus? What are the steps involved?

  • materials consisting of high resolution scanned images and (preferably) ALTO files are delivered to the team at the Kansalliskirjasto
  • in case the works are delivered “bare” without ALTO files, the images are processed through digitisation software to produce best-effort OCR text
  • the ALTO and image files are imported into the back-end as revision 0 under an appropriate collection, creating JPG and thumbnails for the web interface in the process
  • an administrator – i.e. a user who can create users and manage document state – is assigned to the containing collection if such a person does not already exist
  • the online editing process begins
  • when the quality of the OCR’ed text has reached satisfactory standards, the administrator can lock a document to prevent further editing and export the end result as text, ALTO XML or PDF
  • the end-result can optionally be (re-)imported into one of our PAS (long-term preservation) systems .

Future

The development of the OCR editor at the Kansalliskirjasto is in full swing. This month has seen the first significant use of our application, and it is exciting and exhilarating to see researchers and laymen alike improve these works by leaps and bounds in such a short amount of time. While the software is still considered “in beta”, the basic concept seems to be holding up fine in the field, yielding valuable feed-back from the early adopters. It is always easy to tinker with things and try some things by yourself in the early stages of software development, but we really appreciate the input of users so we know how people are using our tool and what we should prioritise in the development process.

Hence, our roadmap has not been set in stone – but we do have plans. At the moment we are focussing our efforts on the user-friendliness and usability of the application, largely based on the feed-back from users. We are discussing how to improve the multi-user aspect by enabling communication between users in the form of notes and by avoiding editing conflicts (such as when two users edit the same page at exactly the same time). Development of export functionality has started, after which we will try to tackle the problem of building a corpus and dictionary for specific languages.

So, that concludes this short, slightly more technical overview of the project. I hope to see some of you at our presentation and keep the suggestions coming!

Invitation to the webinar as PDF here: OCR_Webinar_invitation_

2,562 thoughts on “What is the point of an online interactive OCR text editor?

  1. Female sexual health is a complex and often misunderstood area of medicine.

    Many women experience difficulties with arousal, desire, and orgasm, which can significantly impact their
    overall well-being and quality of life. In recent years,
    there has been a growing interest iin the use of Viagra,
    a medication traditionally used for erectile dysfunction in men, for the treatment of fesmale sexual dysfunction.
    This artile aims tto provide an objective and evidence-based overview oof Viagra foor
    women, exploring its potential benefits, mechanism of action,
    and potential side effects and risks. Additionally, alternative treatment options and real-life experiences
    wwill be discussed. It is important to note that consulting with a healthcare professional is essential before considering any medication for female
    sexual health concerns. By understanding the facts annd evidence surrounding Viagra
    for women, individuals cann mae informed decisions regarding their sexual health.

  2. I gave thc and cbd gummies a whack at with a view the first previously, and I’m amazed! They tasted excessive and provided a sense of calmness and relaxation. My importance melted away, and I slept well-advised too. These gummies are a game-changer for me, and I highly commend them to anyone seeking spontaneous emphasis relief and think twice sleep.

  3. CBD, or cannabidiol, has been a meeting changer for me. prozac and cannabis I’ve struggled with hunger for years and secure tried assorted different medications, but nothing has worked as well as CBD. It helps me to surface calm and at ease without any side effects. I also find that it helps with catch and hurt management. I’ve tried various brands, but I’ve found that the ones that are lab tested and take a careful reputation are the most effective. Overall, I importantly vouch for CBD on the side of anyone who struggles with foreboding, be in the land of nod issues, or chronic pain.

  4. Excellent post. I was checking continuously this blog and I
    am impressed! Extremely helpful info specifically the last
    part 🙂 I care for such information much. I was seeking
    this certain information for a long time. Thank you and best of luck.