News – Language Technology

26.7.2024

Visit by Dana Roemling

Portrait photo of Dana. A person presenting as a woman with bold black glasses, blonde long hair and facial piercings. They wear a black jumper and smile at the camera. A couple of months ago, I had the amazing opportunity to come to the University of Helsinki as a visiting researcher. This was part of my PhD journey to familiarise myself with machine learning some more. As a forensic linguist, we’re often focussing our work on less automated ways of analysis, so I wanted to explore how I can make use of (semi-)automated processes in the forensic context.

Since my PhD is exploring regional linguistic variation and its usefulness in authorship profiling, I wanted to work with someone who focuses their work in language technology on regional varieties. That is why I got in touch with Yves Scherrer and the Corpus-Based Computational Dialectology (CorCoDial) research group. Having read many of the publications coming out of VarDial, learning from and working with people active in this community was an excellent fit for me

In total, I spent three months in Helsinki, from the end of February to the end of May. This meant that I could join some of the modules being held during that time, which allowed me to learn even more than just through collaboration. Additionally, I was able to join the research seminars held by the Language Technology research group, which meant I had an easy contact point to meet other researchers, but also to introduce my work to the group. Generally, the work culture is excellent and the joint lunch breaks make it easy to get to know people.

The main focus of my visit was on a project within the CorCoDial group. We replicated a study which had come out only a couple of weeks before I arrived. In our experiments, we concentrated on extracting dialect features used by dialect classifiers. This was to evaluate whether dialect classifiers can be explained or interpreted, since, in forensic linguistics, one of the big problems with automated processes is that they are opaque and can’t be used in evidential contexts. We have submitted and presented our work at several conferences, for instance ICLaVE12, and were able to publish a preprint of our work (soon to be published in the NLPAICS proceedings).

Although I had some goals in mind for my time in Helsinki, this visit has exceeded them by far. I have learnt a great deal for my PhD, but I also acquired skills I will be able to use beyond the PhD. I have met outstanding researchers and was welcomed with genuine warmth and enthusiasm. I remain grateful that this visit was possible and I had this opportunity, but most of all I am thankful for the friends I have made and collaborations built for the future.

20.7.2024

Research visit by Michal Štefánik from Masaryk University in Brno

Why coming to Helsinki?

I was familiar with what the Helsinki group does from earlier, as I was building a lot upon their outcomes in my previous work; I’ve used OPUS for evaluating the distributional robustness of translation models, and in a few of my previous papers, I’ve also used Helsinki-NLP models as my base models. Additionally, I’ve had a chance to attend one of prof Tiedemann’s (the group leader) public talks, which left a good impression on me.

How was the arrangement? I wrote a message to prof Tiedemann and later talked to Timothee, one of the lab members, which allowed me to align with the group’s topics of interest. Eventually, I picked the dates from February to the end of April: these fitted well my plans, but also it was the time when the days in Helsinki grew longer but there was still enough snow for winter sports.

What was I working on? There were several interesting directions, but I eventually decided to focus on the topic of modularity in language models, which I’ve also spent most of my time working on.

I was looking into the effects of the modularization of languages in multilingual machine translation models: how modular language models disentangle language-specific features in their representations, and on the practical side, on the impact of modularization on downstream quality of translation in over 100 languages.

One of the rougher morning rides at the end of April 🙂

How was the stay, work-wise? Building on so much previous work, the group has great access to both data and compute, which was really helpful in scaling my experiments to so many languages. There is someone to turn to regarding anything resource-related, which allowed me not to get stuck essentially on anything along the way and kept me focused on my main goals.

Though I think, most importantly, the group has a very friendly culture of open communication about essentially anything. It is perfectly common to ask anyone for help, as well as to ask anyone about what they did over the weekend or if they want to join a beer after work!

How was the stay outside work? Helsinki, on the verge of spring (February-March), is a great place for anyone who likes the proximity of ever-present nature and winter sports. If the city were a cross-country ski resort, it would likely be the biggest one in Europe, with around 1,000km of maintained trails 🙂I was able to ski around 400km of them in March. On the other hand, people who don’t like snow and winter might be surprised by 30cm of new snow at the end of April 🙂 but then perhaps everyone would enjoy Helsinki’s mild summers, with daylights at least until 9:30 pm from April to October.

Helsinki is also a good starting spot for weekend trips nearby: I recommend taking overnight trains to Lapland for aurora and wild nature or a 2-hour ferry ride to Tallinn with not-too-terrible historical sights!

Trip to Tallinn is only 2-hour ferry ride

14.2.202426.2.2024

Public examination of Aarne Talman’s PhD thesis

Another FoTran team member, Aarne Talman will defend the doctoral dissertation entitled “Towards Natural Language Understanding: Developing and Assessing Approaches and Benchmarks”

The public examination will take place in the Faculty of Arts, University of Helsinki, on 23 February 2024 at 13:00 in the lecture hall at Unioninkatu 33, (room 303). Assistant Professor Vered Shwartz, University of British Columbia, will serve as the opponent, and Jörg Tiedemann as the custos.

The defense will also be streamed on Unitube: https://video.helsinki.fi/unitube/live-stream.html?room=l44

The dissertation will be published in the series Dissertationes Universitatis Helsingiensis (http://hdl.handle.net/10138/568130)

[press release]

15.11.202314.2.2024

Public examination of Raúl Vázquez’ PhD thesis

Our FoTran team member Raúl Vázquez will defend the doctoral dissertation entitled “Representation Learning in Multilingual Neural Machine Translation”

The public examination will take place in the Faculty of Arts, University of Helsinki, on 17 November 2023 at 10:00 in the lecture hall at Unioninkatu 33, (room 303). Senior Researcher Dr. Cristina España i Bonet, DFKI, will serve as the opponent, and Jörg Tiedemann as the custos.

The defense will also be streamed on Unitube: https://video.helsinki.fi/unitube/live-stream.html?room=l44

The dissertation will be published in the series Dissertationes Universitatis Helsingiensis (http://hdl.handle.net/10138/566538)

8.11.20238.11.2023

Helsinki-NLP is busy at the FCAI AI day

The language technology group from Helsinki has substantial presence at the AI day organized by the Finnish Center for AI (FCAI). Check out the program and see our presentations about

SHROOM, a shared-task on hallucinations and related observable overgeneration mistakes,
Unsupervised Feature Selection for Effective Parallel Corpus Filtering using OpusFilter,
Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging from the project on Uncertainty-Aware NLP,
The OPUS-MT Dashboard A Toolkit for a Systematic Evaluation of Open Machine Translation Models,
Grammatical Error Correction for Sentence-level Assessment in Language Learning
HPLT: High Performance Language Technologies
Releasing the MAMMOTH – a framework for modular multilingual translation models
Is implicit assessment of language learning during practice as accurate as assessment through testing?

Come and see us at the presentations and posters!

26.10.2023

Anna Dmitrieva is the Language Bank of Finland researcher of the month

researcher of the month Anna Dmitrieva Our doctoral student, Anna Dmitrieva, has been highlighted the researcher of the month at the Language Bank of Finland.

Anna is working on automatic text simplification with a focus on Russian and Finnish and created freely available resources and tools that can be retrieved from public repositories. One of the recent resources is a Parallel Corpus of Finnish and Easy-to-read Finnish compiled from public news at YLE.

28.8.202320.9.2023

New team member: Joseph Attieh

Hello everyone!

My name is Joseph Attieh, and I’m thrilled to introduce myself as the newest member joining the Language Technology research group led by Prof. Jörg Tiedemann. As I embark on my PhD journey, my primary focus will be on Green NLP, particularly in the areas of knowledge distillation and modularization.

My academic journey started in Lebanon, where I received my bachelor’s degree in Computer Engineering. Afterwards, I pursued a double master’s degree at Aalto University in Finland and KTH Royal Institute of Technology in Sweden. During the course of my studies, I’ve had the privilege of conducting research at EPFL, where I delved into decentralized and distributed systems. My journey then took me to Munich, where I worked with BMW, navigating the intricate world of data analytics. More recently, I’ve worked at Huawei Technologies in Helsinki, starting as an Automatic Speech Recognition researcher and later transitioning to a full-time NLP researcher role. The challenges and discoveries during these experiences solidified my decision to further my studies and research, leading me to pursuing this exciting PhD opportunity.

Outside the world of algorithms and research papers, I cherish my time swimming, a love I owe to the Mediterranean vibes of my Lebanese homeland. Roller and ice skating are my go-to escapes. And when I’m not in the research lab or on the rink, you might find me binge watching a TV show or reading a book, guilty pleasures I wholeheartedly embrace.

I am excited to join the research team and am looking forward to the collaborations, challenges, and innovations ahead. I am eager to contribute to the team’s vision and to further the advancements in the field of Green NLP. The research in Green NLP has the potential to revolutionize the way NLP models are built and used. I am eager to contribute to this vision and collaborate with the brilliant minds here.

Thank you for taking a moment to get to know me. I am excited to embark on this shared journey of discovery, challenges, and innovation with all of you.

Best regards,

Joseph Attieh

19.6.2023

Helsinki-NLP at TALN 2023

TALN (Traitement Automatique du Language Naturel) is an annual conference covering topics from computational linguistics, NLP and speech processing organized by ATALA, the French association for NLP. This year’s edition took place from 5 – 9 June 2023 in Paris and was collocated with CORIA (Conférence en Recherche d’Information et Applications

Two of our members participated: Timothee Mickus and Aleksandra Miletić. More details below!

Presentation on low-resourced languages
Aleksandra presented a paper on lemmatization experiments for Occitan in a session dedicated to low-resource settings. During the coffee break there was an interesting exchange of ideas with researchers working on Alsatian, another minority language spoken in France.

Timothee won Best Thesis Award
Timothee won ATALA’s Best Thesis Award for 2023 for his thesis On the Status of Word Embeddings as Implementations of the Distributional Hypothesis. ATALA’s Best Thesis Award is given to outstanding PhD theses defended in a francophone university during the last two years. Exceptionally, this year there were two awardees, the other being Tobias Mayer for his thesis Fouille d’arguments à partir des essais cliniques. The jury congratulated both authors on the high quality of their work.

Timothee’s thesis was in particular saluted for the depth of his theoretical and practical contributions. Timothee’s presentation during a plenary session prompted a long and lively discussion, showing the interest of the community in the topic.

Aleksandra was elected to the Board of ATALA
During the conference, the elections for the Board of ATALA also took place. Aleksandra was among the members elected for a period of three years. The Board coordinates ATALA’s various activities, such as the organization of the TALN conference, the edition of the TAL journal, the administration of the LN mailing list, the organization of workshops and the attribuiton of the Best Thesis Award.

19.6.2023

Jörg Tiedemann elected as the president of NEALT

Jörg Tiedemann will succeed Barbara Plank as the next president of the Northern European Association for Language Technology (NEALT). The elections were held during the NoDaLiDa conference in Faroe Islands (22-24 May 2023). Jörg will take office in January 2024 for a period of two years.

NEALT is a scientific, non-profit and non-political association focused on promoting language technologies developed in the Northern European countries and for the languages spoken in those countries. During his presidency, Jörg will be giving special attention to the following topics:

Supporting the organisation of the next edition of NoDaLiDa
NoDaLiDa (Nordic Conference on Computational Linguistics) is a bi-annual event showcasing work on all aspects of computational linguistics, NLP and speech processing, including interdisciplinary work with neighbouring disciplines such as linguistics or psychology. The next iteration will be a jubilee: it will mark the 25^th edition of the conference.

Promoting NEJLT
NEJLT (Northern European Journal of Language Technology) publishes high quality peer-reviewed research in computational linguistics and language technology for all natural languages. NEJLT’s goal is to provide an alternative to the high-pace publication process typical of the top language technology conferences. The hope is that making journal publications more accessible will reduce the reviewing load of the community and have a positive effect on the quality of published research.

Improving NEALT’s visibility, in particular for its Special Interest Groups (SIGs)
NEALT currently has five SIGs, which serve as forums for communities working on different topics to exchange ideas and share resources. The existing SIGs are dedicated to the Nordic Constraint Grammar, technology infrastructures in the North European region, language technology-related teaching, intelligent computer-assisted language learning, and high-performance computing for NLP in the Nordics. NEALT is also open to establishing new SIGs: instructions for starting a new one are available here.

Community building
All of NEALT’s activities above strive to support collaborations between people and teams working on language technologies in the Nordic countries. In the coming years, the association will be driving for more shared projects, co-organized scientific events and joint publications across the region.

Supporting young researchers and students
In the 2023 edition of NoDaLiDa, more than half of the accepted papers were student papers (i.e., the first author was a student). An important aspect of NEALT’s activity will be ensuring support for young researchers and especially students to attend scientific events such as NoDaLiDa.

24.5.2023

Helsinki-NLP at the EACL workshops: VarDial and SlavNLP

The Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) celebrated its tenth edition in 2023 with a record number of submissions. Various members of the Helsinki-NLP group have contributed to this success, most notably Jörg Tiedemann as a co-founder of the workshop series, and Yves Scherrer and Tommi Jauhiainen as current main organizers. This year, three research papers were co-authored by researchers from Helsinki (Olli Kuparinen, Aleksandra Miletic, Janine Siewert, Yves Scherrer). Tommi and Yves also contributed to the three shared tasks proposed in the evaluation campaign this year. After two fully online and one hybrid edition, the 2023 edition was characterized by a growing on-site participation, with about two thirds of talks and posters presented in person.

One of the highlights of VarDial was the panel discussion (initially called “round table”, but we were assigned a fairly small room without tables, let alone round ones) that reflected on the past of VarDial and its future in the era of large language models. Our experts (Antonis Anastasopoulos, Gabriel Bernier-Colborne, Preslav Nakov, Tanja Samardzic, Ivan Vulic) agreed that even though the methods changed drastically over the last ten years, the VarDial themes were more relevant than ever. VarDial has also been known for its evaluation campaign primarily focusing on “hard” language identification problems such as those between closely related language varieties. The panelists were happy to see the continued interest in this campaign, but also wished for more varied downstream tasks to be included in the evaluation campaign. As organizers, we have always been strived to propose a wide variety of tasks, but struggled to attract sufficient numbers of participants. Furthermore, it was highlighted that dialects are first and foremost spoken varieties and that we should therefore focus more on spoken data in the future. The panelists viewed the ongoing consolidation of methods and the dominance of Transformer-based paradigms as a potential silver bullet that would hopefully make it easier for everybody to participate in future shared tasks.

This year Slavic NLP (formerly Balto-Slavic NLP) included two papers from the members of our unit (Roman Yangarber and Anna Dmitrieva) who also contributed to the organization of the workshop and the shared task. The Slav-NER shared task is a multilingual named entity recognition challenge for Czech, Polish, and Russian, which also includes name normalization and entity linking. The top system’s performance on NER and normalization this year reached an F1 score of 90. Entity linking, a more challenging task, had an F1 score of 72-80, while cross-lingual entity linking had an F1 score of ~67, which is a great improvement compared to the previous challenge. The workshop’s best paper, “Resources and Few-shot Learners for In-context Learning in Slavic Languages” (Štefánik et al.), also touched on Polish, Czech, and Russian NER. The authors created an evaluation benchmark for in-context learning for these languages, supported tasks being NER, Classification, QA, and NLI.