We recently published the first material produced in the continued Digitisation Project of Kindred Languages in the Fenno-Ugrica collection, a total of 75 monographs in the Mari languages. To discuss this material, we met with Finno-Ugrian researcher Mrs Julia Kuprina, a project researcher at the Morphological Analyzers for Minority Finno-Ugrian Languages project. We spoke with her about the material in the collections, her own research in language technologies, and naturally also the Hill Mari language.
Could you tell us a little about your own researcher background and your work at the moment?
I began my work with Hill Mari, my native language, in Finno-Ugrian studies at the University of Tartu, where I researched verb forms (Hill Mari has five of them) and their use for my seminar essay which later became my Master’s thesis. After graduation I moved to Finland in 2001, but I could not find work in my field, so I got an education in marketing and sales. That has been my occupation for approximately a decade. At the same time, however, I have continued researching Hill Mari in my spare time. During maternity and parental leave I translated children’s books and other texts into my native language.
When I was offered the opportunity to work on my native language just over a year ago, I was delighted to join the research group. Our project manager, Jack Rueter, encouraged me to overcome my initial hesitation towards language technology and my work started to go well. I am currently involved in glossing (processing lists of lexemes, e.g., translating Hill Mari words into Finnish) and the the definition of continual lexicons. More information can be found on the Giellatekno website maintained by the University of Tromso, more on the work on Hill Mari here
What questions does research in language technology answer; what is the aim of your project?
Our project is based on open-source language technology and caters to the needs of the general public in our digitised world. We are working on morphological analyzers that can be reused as sophisticated online dictionaries of the languages involved in the project, to help both linguists and language enthusiasts. At the moment, the dictionary of Hill Mari includes approximately 22,000 lexemes. The dictionary also features a morphological analyser, which allows users to search for a particular word in an extensive array of its inflections. Hill Mari school pupils and learners of the language will also surely benefit from the bilingual online dictionary which includes examples of how the words are used and sample phrases. Language professionals such as teachers and journalists are also served by a derivative of our analyzer in the Voikko proofreading application for Hill Mari. Our current work on the analyzer is immediately seen in the improvement of the Hill Mari spellchecker beta for testing.
The project has also been able to utilize development at Giellatekno in a browser-based application that makes it possible for any one fluent in Finnish to read old Finno-Ugrian newspapers or books just by alt-doubleclicking on a word in the pdfs. The National Library has now provided a great deal of material in the kindred languages of Finland for open use in the Fenno-Ugrica collection. The online application, a bookmarklet is already helping users read the material.
The materials in Fenno-Ugrica for both the pilot and continuation stage of the Digitisation Project for Kindred Languages were selected in cooperation with researchers. The intention has been to engage researchers in the cooperation already in the planning stage of the project, since they have the most comprehensive understanding of the accessibility of relevant materials as well as their research impact. Last August, the Digitisation Project for Kindred Languages arranged a researcher meeting which led to the drafting of a grant application to continue the project. The application was submitted to the funder at the end of September.
Several criteria, defined in cooperation between the National Library, its partner libraries and researchers, were employed in the selection of the materials. The key criterion was the creation and establishment of the contemporary written language. Material from the time when the written language was being established is also important for activists seeking to preserve the language today. Neologisms from the 1920s and 1930s as well as texts that use them serve as both source material and a source of innovation and inspiration for the developers of the contemporary languages. The works were proposed for digitisation so that they would not only represent the innovative 1920s accurately, but also reflect the changes in language policy which occurred in the 1930s.
Mrs Kuprina, you participated in proposing material for digitisation, and many of the works you proposed have now been published in Fenno-Ugrica. What type of language material is this?
At the time I wanted to propose more books in my native language for digitisation, but I’m really grateful to you and Kone-Foundation – that we researchers now have as much Hill Mari material at our disposal as we do. It’s a wonderful collection of rare material from the 1920s and 1930s! The Hill Mari language has not been widely researched, so there’s a lot of ground to cover.
Personally I’m interested in the development of the Hill Mari orthography. Its development can be roughly divided into four stages, which emphasises the contrast between the way words in minority languages were written in the beginning of the last century and contemporary spellings. I based my proposals on books written or edited by Mari people when selecting the material to ensure a natural linguistic expression, as free from outside influences as possible. Russia began developing its minority languages in the 1920s, and the books reflect the consequent excitement of discovering and creating words and phrases to describe new concepts.
You mentioned you discover linguistic treasures every day?
Every time I open a new newspaper or monograph, I make fascinating discoveries. My latest discovery was when I read the first official Hill Mari orthography from 1940. I had been waiting to read the book for a long time, and was surprised how similar it was to later orthographies. The end of the 1930s saw a total departure from the previous spellings which were more Mari-based. The efforts to Russify the Mari language are already evident in the 1940 orthography. The word “finger” went from парньа to парня, “to study” from тымэньӓш to тыменяш, and “pants” from йалаш to ялаш, etc.
What significance does the Fenno-Ugrica collection hold for your work? How do you use the collection in your research?
The collection has tremendous significance for my work. I am currently in the process of adding my lexical discoveries to our online dictionary in hopes that they will be used in the future. I’m hoping to find specialist terminology from old textbooks. For example, it would be interesting from the perspective of linguistic history to study the anatomical vocabulary of Hill Mari. We currently have no Hill Mari word for the lens of the eye – it is referred to with the Russian word. This is of course the result of the 1938 decision that banned minority peoples from using terminology in their own languages in school textbooks.
It is important for our online dictionary to discover more (native) words which cannot be found in printed dictionaries. This will also improve the example sentences and phrases. The unique Fenno-Ugrica collection will hold even greater significance for future research. It is also likely that school children will find the textbooks of their great-grandparents interesting along with the other digitised material, since they offer an avenue for analysing their own identity and gaining more information on the history of the Hill Mari region. As a reader, I found the history of my native region springing to life as in a documentary film when I read old Hill Mari newspapers.
Julia Kuprina was interviewed by Jussi-Pekka Hakkarainen
Additional Information Regarding the Project Open-source Language Technology for Uralic Minority Languages:
The project is based on open-source language technology and tries to cater to the needs of the general public in our digitised world. We are working on a sophisticated online dictionary of the languages involved in the project, which aims to help linguists, language enthusiasts and users of the language in general.
We were surprised to learn how wide the range of applications for morphological analysers is, as compared to our expectations, in the course of our work process. The plan for 2013-2014 was to create a morphological transductor based on the Giellatekno user interface for the languages covered by our project (Olonets Karelian, Livonian, Tundra Nenets, Moksha, Hill Mari), using Finnish open-source technology, following in the footsteps of work carried out for Finnish and Northern Saami. Every language’s transductor has a lexicon, on the basis of which we create a multi-faceted derivationn system.
At the moment, the dictionary of Hill Mari includes approximately 22,000 lexemes and serves as free web dictionary, in constellation with the analyser and the glossing. It can be utilised in reading Hill Mari websites, including Wikipedia and blog articles and others. The morphological analyser is a good tool for linguists aiming to verify their hypotheses and for panguage enthusiasts aiming ti improve their language skills. Earlier this year we created a version of the Finnish-Hill Mari dictionary including declensional information for the Finno-Ugric Winter School in Szeged where a Hill Mari course was taught.
Local school children and other learners of the language will also surely benefit from the bilingual online dictionary which includes examples of how the words are used and sample phrases. Language professionals such as teachers and journalists are also served by the Voikko proofreading application in the languages of project. The project is currently working on the application and has released a beta-version for Linux, MacOS ja Windows.
By the end of the year, we will also publish a pedagogical demo, taking solutions created for other languages to consideration.
The project has also worked on a browser-based application which would enable any Finn to read old Finno-Ugrian newspapers or books just by clicking on a source word. The National Library has now provided a great deal of material in the kindred languages of Finland for open use in the Fenno-Ugrica collection. The online application will help users read the material.
Additional information was provided by the project-lead Jack Rueter