Anis Moubarik, an information system specialist at the National Library and a member of DPKL team, will introduce you to that procedure what happens to a digitized book in our post-production processes. During the project, Anis has been in charge of creating both, OCR’ed PDFs that are available in our Fenno-Ugrica collection and Alto XML files per book, which are made available for editing in Revizor, the text editor for enhancing the data.
OCR’ed text is often lacking in quality because of errors during the optical recognition process, especially when the source material is old or otherwise in a bad state. These errors make it hard to rely on the text for building a corpus or word lists and makes the source material less accessible to use for study or to incorporate into other tooling for language researchers. This is a problem that our OCR editor tries to eradicate, or at least contribute a possible solution towards.
thanks for participating in our OCR Webinar on the 5th of March 2014. Please, find here all four presentations, which were held during the webinar, the transcript of chat comments and link to the Webinar recording.
- Hakkarainen: An Introduction to the OCR Webinar: Making the Impact on Research and Society
- Van Hemel: OCRUI: Interface for the correction of OCR text material
- Rueter: Subversioned OCR Editing, an Opening for Community Involvement
- Vanhasalo: Kotus Experience: Crowdsourcing the Old Literary Finnish for the Research’s Benefit
- Chat Transcript from OCR Webinar
- Webinar recording (Opens in Adobe Connect)
In the case of further enquiries, please, don’t hesitate to write your comment on our KIWI page or contact us by e-mail: email@example.com
The Digitization Project of Kindred Languages is not only about the publishing Fenno-Ugric material online,but it also aims to support the linguistic research by developing purposeful tools for its help. In this blog entry, Wouter Van Hemel of the National Library of Finland sheds the light over the OCR editor, which enables the editing of machine-encoded text for the benefit of linguistic research by crowdsourcing.