Helsinki-NLP at the EACL workshops: VarDial and SlavNLP

The Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) celebrated its tenth edition in 2023 with a record number of submissions. Various members of the Helsinki-NLP group have contributed to this success, most notably Jörg Tiedemann as a co-founder of the workshop series, and Yves Scherrer and Tommi Jauhiainen as current main organizers. This year, three research papers were co-authored by researchers from Helsinki (Olli Kuparinen, Aleksandra Miletic, Janine Siewert, Yves Scherrer). Tommi and Yves also contributed to the three shared tasks proposed in the evaluation campaign this year. After two fully online and one hybrid edition, the 2023 edition was characterized by a growing on-site participation, with about two thirds of talks and posters presented in person.

One of the highlights of VarDial was the panel discussion (initially called “round table”, but we were assigned a fairly small room without tables, let alone round ones) that reflected on the past of VarDial and its future in the era of large language models. Our experts (Antonis Anastasopoulos, Gabriel Bernier-Colborne, Preslav Nakov, Tanja Samardzic, Ivan Vulic) agreed that even though the methods changed drastically over the last ten years, the VarDial themes were more relevant than ever. VarDial has also been known for its evaluation campaign primarily focusing on “hard” language identification problems such as those between closely related language varieties. The panelists were happy to see the continued interest in this campaign, but also wished for more varied downstream tasks to be included in the evaluation campaign. As organizers, we have always been strived to propose a wide variety of tasks, but struggled to attract sufficient numbers of participants. Furthermore, it was highlighted that dialects are first and foremost spoken varieties and that we should therefore focus more on spoken data in the future. The panelists viewed the ongoing consolidation of methods and the dominance of Transformer-based paradigms as a potential silver bullet that would hopefully make it easier for everybody to participate in future shared tasks.

This year Slavic NLP (formerly Balto-Slavic NLP) included two papers from the members of our unit (Roman Yangarber and Anna Dmitrieva) who also contributed to the organization of the workshop and the shared task. The Slav-NER shared task is a multilingual named entity recognition challenge for Czech, Polish, and Russian, which also includes name normalization and entity linking. The top system’s performance on NER and normalization this year reached an F1 score of 90. Entity linking, a more challenging task, had an F1 score of 72-80, while cross-lingual entity linking had an F1 score of ~67, which is a great improvement compared to the previous challenge. The workshop’s best paper, “Resources and Few-shot Learners for In-context Learning in Slavic Languages” (Štefánik et al.), also touched on Polish, Czech, and Russian NER. The authors created an evaluation benchmark for in-context learning for these languages, supported tasks being NER, Classification, QA, and NLI.

Tommi Nieminen joins Helsinki-NLP

My name is Tommi Nieminen, and I recently joined the Helsinki-NLP research group as a new PhD student. For the past two decades, I have worked in the translation industry, starting as a translator and gradually drifting to more technical roles, such as CAT tool support, localization engineering, translation process automation, and machine translation development. Due to my work history, my research focuses mostly on the use of language technology in professional translation.

I have a long history with the University of Helsinki. I enrolled on an MA Philosophy course in the university in 2001, and after a long period of academic absence and part-time study (and a change of disciplines) I finally graduated with an MA in language technology in 2018. Since then I have participated in two academic projects involving the university, Fiskmö and OPUS-MT: Open Translation Models, Tools and Services. In the course of these projects I developed the OPUS-CAT tool, which enables translators to use machine translation models from the OPUS-MT project in their normal working environments. The motivation behind OPUS-CAT is to make open-source machine translation technology and resources directly available to the individual translators, so that they may have more control over how machine translation is integrated into their profession.

I am thrilled to be part of the team in Helsinki and the GreenNLP project, and to work on issues that have recently become more significant than ever before. I live far from Helsinki, so I am usually at the university only one day a week. I look forward to meeting all of you that I have not met yet.

Introducing Elaine Zosa

profile-fotoHello there! I’m Elaine, a new postdoctoral researcher in the HelsinkiNLP Research Group. To start off, I have not always worked in NLP. I worked in the financial technology sector before I decided to study for a master’s degree. I obtained my MSc in Computer Science at the University of Helsinki where my concentration was on algorithmic bioinformatics. After that, I was a research assistant in computational genomics at the Technical University of Munich. Then in late 2018, I started my doctoral research at the University of Helsinki, in the Discovery Research Group led by Prof. Hannu Toivonen.

During my PhD, I worked on two EU Horizon 2020 projects: NewsEye (https://www.newseye.eu/) and EMBEDDIA (http://embeddia.eu/).  Both these projects involved building tools to help analyse large-scale news collections. In the former, we focused on historical news collections from Finland, France, and Austria, and in the latter, on news media from less-represented European languages such as Finnish, Estonian, and Croatian. I worked on various tasks in the projects and helped develop new methods in topic modeling, lexical semantic change, news headline generation, and multilingual news matching. Methodological innovations aside, these projects exposed me to the inherently interdisciplinary nature of NLP and language technology and that, I think, is the most exciting thing about this field. I enjoy building tools that could be useful to researchers in the humanities and social sciences, and beyond.

Now I am investigating methods to quantify and model uncertainty in various linguistic tasks. You can also find out more about my work on my homepage, https://ezosa.github.io/!

2022 Steven Krauwer Award for OPUS-MT for Ukrainian

Helsinki-NLP received the 2022 Steven Krauwer award for CLARIN achievements for the work on open machine translation for Ukrainian. Thank you very much for this award but especially also thanks to everyone who contributed data, software and help with putting this all together! And let us continue to help people in need recognizing the importance of open and transparent language technology and the responsibilities we have in society. Thank you!

Donate spoken language data in Finnish and Swedish

The Language Bank of Finland and the Swedish Literary Society in Finland are collecting Finland-Swedish speech data (https://doneraprat.fi)  (see YLE article with video: Vill du att röststyrning ska fungera på finlandssvenska? Kom med och donera prat [Do ​​you want voice control to work in Finland-Swedish? Come and donate speech]). Note that the campaign for collecting spoken Finnish also continues (https://lahjoitapuhetta.fi/).

If you wish to know how the database may affect everyday life and how it can be used in research, listen to the YLE podcast “Second Last Word” (YLE pod: Så här lär sig din dammsugare finlandssvensk dialekt [How your vacuum cleaner learns Finnish-Swedish dialect] and YLE article: Pratande kylskåp och smarta glasögon hjälper dig handla mat – det här är den röststyrda framtiden [Talking refrigerators and smart glasses help you buy food – this is the voice-controlled future]).

Language Technology group visible at DHN 2018

Helsinki Language Technology group had 5 papers in the recent Digital Humanities in the Nordic Countries (DHN 2018) Conference held on 7–9 March 2018 in Helsinki.

Jörg Tiedemann presenting his paper “Emerging Language Spaces Learned From Massively Multilingual Corpora” at DHN 2018

Distinguished Short Paper

  • Jörg Tiedemann: Emerging Language Spaces Learned From Massively Multilingual Corpora [pdf]

Long Paper

  • Emily Öhman, Kaisla Kajava: Sentimentator: Gamifying Fine-grained Sentiment Annotation [pdf]

Posters

  • Yves Scherrer, Tanja Samardžić: ArchiMob: A multidialectal corpus of Swiss German oral history interviews [pdf]
  • Seppo Nyrkkö: An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM) [pdf]
  • Mika Hämäläinen, Tanja Säily, Eetu Mäkelä: Normalizing Early English Letters for Neologism Retrieval [pdf]