CIFU XII, Day 3

The day was dedicated to our symposium, Language Technology through Citizen Science, which was consisted of nine fine presentations, which were either (1) presenting the open-source language technological achievements and tools directed at the documentation of minority Uralic languages through the application of Citizen Science methods and crowdsourcing possibilities or (2) present and develop innovations for advancing the utilization of Citizen Science and crowdsourcing in open-source language technology.

I am not going to analyze all individual presentations in this blog entry, but I merely try to give a brief overview of my thoughts on some selected discussions.

As the first speaker was unable to attend, we decided to have a brief discussion with the participating attendees in order to use the available time properly. Jeremy Bradley also held a brief intro on interface development for linguistic tools. In my opinion, this discussion reminded me a lot about the paper of Joris Van Zundert, kept at the DH2015 in Sydney a few weeks ago. The key question for Van Zundert, in my interpretation, was that since lot of scholarly decisions are made by the code, thus it is vital to understand what’s inside. In the DH field, it is probably hard to combine the textual criticism and critical code studies. According to Van Zundert, there’s a distinction between scholarly code and code for scholars and there’s a merging trend to combine two fields, but it takes two to tango, so people in DH (researchers) should get more acquainted with code and people who are developing the code, shouldn’t define themselves as mere servants for scholars, as they have been educated to do. I firmly do believe, that this is the approach to take into consideration when designing the interfaces, which should appease the needs of a hard-core linguist on one hand and the curiosity of a layman in another. Also, this issue should be discussed in my home organization, especially when we try to provide services for the research.

Due to absence of the first speaker, I had the pleasure to be the first speaker of the symposium. As usual, I briefly introduced our project and the Fenno-Ugrica collection for the audience, but I mainly tried to focus on the concept of nichesourcing. I have been rallying for this method for past 18 months or so, but unfortunately the results have been disappointing so far. Despite the noble aspiration of interplay with the lingual societies, we haven’t been successful enough to edit enough material that could be utilized in research. Perhaps, we need to look for other methods in order to accelerate the production of edited data. The slides of my presentation can be retrieved here.

Heidi (and Tommi) Jauhainen presented their research project The Finno-Ugric languages and the Internet, which aim is to build an automated system that searches the Internet for text written in small Uralic languages. Heidi Jauhiainen also showcased their Wanca collection, which is available as Beta at the moment, but the native speakers and scholars are invited to create a personal account and help them to verify the language labels given to the pages by our automatic language identifier. The site is mesmerizing! It is great to browse the collection and find some text extracts in so many Finno-Ugric languages, like Kven, Livvi, Nganasan or Ludic. It would be nice to help them to find some audience to ease their work. Ask what you could for them and drop a line to Heidi in order to get additional information: name.surname@helsinki.fi

Antti Kanner of the University of Helsinki had a paper, which was titled as Multilingual terminology work and lexicography on virtual open source collaboration platforms. Kanner introduced the possibilities offered by wiki platforms for terminology and lexicography work and presented two relevant projects, the Bank of Finnish Terms in Arts and Sciences (BFT) and the lexicographical wiki platform sanat.csc.fi, which contains only Ludic Karelian dictionary so far. What interested me in his presentation was the use of the crowd to enrich the both, the terminology of BFT and the Ludic dictionary. The first one reminds the concept of nichesourcing to me, since they have engaged the professionals of every field to define the terms. A good example of a qualitative crowdsourcing.

The remaining speakers of the symposium were:

  • Esther Simon (Research Institute for Linguistics, Hungarian Academy of Sciences) Language technology support for Finno-Ugric digital communities.
  • Marja-Liisa Olthuis and Trond Trosterud (University of Tromsø) Aanaar Saami e-lexicography
  • Tommi Pirinen (Dublin City University) Omorfi – A free and open source lexical database for computational linguistics of Finnish through combination of expert and crowdsourced data
  • Sven-Erik Soosaar (The University of Tartu) Creating open source language technology for Tundra Nenets – development problems and future prospects
  • Jeremy Bradley (Ludwig Maximilian University of Munich) A corpus-based analysis of syntactic structures: Postpositional constructions in Mari
  • Jack Rueter (University of Helsinki) On the development of open-source morphological analyzers for Uralic minority languages

On my personal behalf, I would like to thank all the presenters as well as the various attendees for your activity. It was a pleasure and see you soon again.

Yours &c.,
Jussi-Pekka

Leave a Reply

Your email address will not be published. Required fields are marked *