The National Library of Finland (NLF) has been awarded by the Kone Foundation to realize the Minority Languages Project in 2016. The project will produce digitized materials mostly in the Uralic languages, but in Yiddish and Romani too. NLF’s objective is to make sure that new corpora in these languages will be made available for the open and interactive use of both the academic community and the language societies.
Project supports the language documentation
The language documentation by digitization is an emerging field within linguistics. NLF sees its role in language documentation as an organization, which is able to run the actions that are often quite impossible for individual researchers or even larger research groups. NLF considers itself as a raw or primary data producer to support the linguistic research and language revitalization.
The primary data, in the context the digitized items, are made available for a wider range of users though the open access Fenno-Ugrica collection. Due to the fact that the NLF will clear the possible copyrights, the access to the material can be provided for different audiences, like native-speaking or research communities. NLF has also a great tradition on long-term preservation of data and the Minority Languages Project does follow the national definitions on long-term preservation, releasing the data in recommended and acceptable file formats alongside the administrative and structural metadata and material packaging. This procedure ensures that the digitized content is made available to potential users now and into the future.
NLF has executed the Digitization Project of Kindred Languages in 2012-2015. The methods of international co-operation, digitization, citizen science and language documentation, which have been developed during this project, will be utilized in this follow-up project.
Materials to be digitized
During the Minority Languages Project in 2016, the NLF will digitize around 600 monographs, close to 50 periodicals and approximately 10500 manuscript pages in following languages:
- Inari Sami
- Northern Sami
- Southern Sami
The majority of the literature was originally published in Soviet Union in the 1920s and the 1930s, which was an era when the many Uralic languages were converted into a medium of popular education, enlightenment and dissemination of information pertinent to the developing political agenda of the Soviet state.
The ‘deluge’ of popular literature, 1920s-1930s, suddenly challenged the lexical orthographic norms of the limited ecclesiastical publications from the 1880s. Newspapers were now written in orthographies and in word forms that the locals would understand. Textbooks were written to address the separate needs of both adults and children. New concepts were introduced in the language.
The Romani Literature from the early Soviet period does reflect the same pattern, but in addition to the Russian prints in Romani (Xaladytka), we also will digitize manuscripts from the mid- and late 19th century in Scandoromani and Finnish Kalo (Finnish Romani) in order to track the early stages of their developments.
Yiddish is an exception to the Romani and Uralic languages and the NLF has selected 20 textbooks from various subjects to reflect the use of Yiddish in the interwar period in Soviet Union. In addition to these textbooks, NLF will digitize 16 issues of Pravda newspaper in Yiddish from October-November 1918 and around 140 prose works from Yiddish-writing authors from the 1880s to the 1910s. The majority of prose works were originally printed in Vilnius and Warsaw, but there is also a good amount of literary pieces from Odessa, Piotrków Trybunalski, Saint Petersburg, Chisinau, Pińsk, Minsk and Zhytomyr.
In the course of the Minority Languages Project, the material is mostly deposited at the libraries and archives outside of Finland. During the project, National Library of Finland will digitize required materials from the collections of the National Library of Russia, the Literary Museum of Estonia, the National Library of Sweden, Kungliga Vitterhertsakademien and the Archives of Peter the Great Museum of Anthropology and Ethnography. By co-operating with these partners, previously unstudied or less-studied material will made available for the great public and the academia.
In addition to this, some material in Sami language from the Lapponica collection of NLF will be digitized as well. See the attachments below for more detailed information.
Language Technology and Crowdsourcing
The Minority Languages Project is also linked with the research of language technology. The mission is to improve the usage and usability of digitised content. During the project, NLF will advance methods that will refine the raw data for further use, especially in the linguistic research.
The machined-encoded text (OCR) contains quite often too many mistakes to be used as such in research. The mistakes in OCR texts must be corrected. For enhancing the OCR texts, the National Library of Finland developed an open source code editor, Revizor, which enables the editing of machine-encoded text for the benefit of linguistic research.
For enhancing the digitized content, the NLF will assign the tasks for the crowd. NLF has described this method as nichesourcing, which is a specific type of crowdsourcing where tasks are distributed amongst a small crowd of citizen scientists (communities). Although communities provide smaller pools to draw resources, their specific richness in skill is suited for the complex tasks with high-quality product expectations found in nichesourcing.
Communities have purpose, identity and their regular interactions engender social trust and reputation. These communities can correspond to research more precisely. Instead of repetitive and rather trivial tasks, we are trying to utilise the knowledge and skills of citizen scientists to provide qualitative results.
Some selection must be made, since NLF is not aiming to correct all digitized content, but to give such assignments to citizen scientists that would precisely fill the gaps in linguistic research. A typical task would be editing and collecting the words in such fields of vocabularies, where the researchers do require more information.
From the perspective of nichesourcing, it is essential that altruism plays a central role, when language communities are involved. Our goal with nichesourcing is to reach a certain level of interplay where the language communities would benefit from the results. For instance, the corrected words in Ingrian could be added to the online dictionary, which is made freely available to the public and the society to benefit from as well. This objective of interplay can be understood as an aspiration to support the endangered languages and the maintenance of lingual diversity, but also as a servant of ‘two masters’, research and society.
App_1_Monographs_Sami_NLR (National Library of Russia)
App_2_Periodicals_Monographs_Sami_NLF (National Library of Finland)
App_3_Manuscripts_Romani_NBA (National Board of Antiquities; National Library of Sweden; Eesti Kirjandusmuuseum; Kungliga Vitterhetsakademien)
App_4_Monographs_Romani_NLR (National Library of Russia)
App_5_Monographs_Romani_NLF (National Library of Russia)
App_6_Monographs_Komi_NLR (National Library of Russia)
App_7.1_Monographs_Yiddish_NLR (National Library of Russia)
App_7.2_Monographs_Yiddish_NLF (National Library of Finland)
App_8_Monographs_Nenets_NLR (National Library of Russia)
App_9_Journals_NLR (National Library of Russia) *decision pending
App_10_Newspapers_NLR(National Library of Russia)
App_11_Manuscripts (Archives of Peter the Great Museum of Anthropology and Ethnography – the Kunstkamera) *decision pending
Contact Details and Additional Information
Jussi-Pekka Hakkarainen (Mr.)
Digitization Project of Kindred Languages
National Library of Finland
P.O. Box 15 (Fabianinkatu 35)
00014 University of Helsinki
+358 2941 40793