We are getting there

In 2015, our mission has been the improvement of the usage and usability of digitized content. During the project, we have advanced methods that will refine the raw data for further use, especially in the linguistic research. Finally, we have reached the stage, in which we may start producing the word lists out of digitized and edited content and we have now released some word lists (arranged by frequency and “all hits”) in 12 languages in Fenno-Ugrica.

What’s behind this action? The machined-encoded text (OCR) contains quite often too many mistakes to be used as such in research. Therefore, the mistakes in OCR texts must be corrected. For enhancing the OCR texts, the National Library of Finland developed an open source code text editor Revizor that enables the editing of machine-encoded text for the benefit of linguistic research.

This tool was necessary to implement, since these rare and peripheral prints often included perished characters, which are sadly neglected by the modern OCR software developers, but belong to the historical context of kindred languages and thus are an essential part of the linguistic heritage.

In 2015, we have conducted a series of proof-reading actions with the Revizor. For enhancing the digitized content, the National Library of Finland has assigned the tasks for the crowd. We have described this method as nichesourcing, which is a specific type of crowdsourcing, in which the tasks are distributed amongst a small crowd of citizen scientists (communities). Although communities provide smaller pools to draw resources, their specific richness in skill is suited for the complex tasks with high-quality product expectations found in nichesourcing.

Communities have purpose, identity and their regular interactions engender social trust and reputation. These communities can correspond to research more precisely. Instead of repetitive and rather trivial tasks, we are trying to utilize the knowledge and skills of citizen scientists to provide qualitative results. The outcomes of our nichesourcing aspirations are now available in Fenno-Ugrica in a form of word lists.

In my opinion, our mission has not yet been completed, but we are heading towards the goal. One should treat these word lists as a mere sample out of all digitized content than as a full description of each language. They contain mostly material from the educational material from the 1920s and to the 1950s. This means, that there is a plenty of work ahead of us and we are constantly looking for opportunities to improve our procedures, and the data. In addition to the word lists, we will release the data of edited works in Korp too. Also, the master pictures and all XML files (edited and unedited) will be available on Fenno-Ugrica later this year too.

If you have any enquiries related to our procedures or generated word lists, please, don’t hesitate to contact us for additional information: kk-fennougrica@helsinki.fi

Yours &c.,
Jussi-Pekka

Leave a Reply

Your email address will not be published. Required fields are marked *