Jörg Tiedemann – Language Technology

26.7.2024

Visit by Dana Roemling

Portrait photo of Dana. A person presenting as a woman with bold black glasses, blonde long hair and facial piercings. They wear a black jumper and smile at the camera. A couple of months ago, I had the amazing opportunity to come to the University of Helsinki as a visiting researcher. This was part of my PhD journey to familiarise myself with machine learning some more. As a forensic linguist, we’re often focussing our work on less automated ways of analysis, so I wanted to explore how I can make use of (semi-)automated processes in the forensic context.

Since my PhD is exploring regional linguistic variation and its usefulness in authorship profiling, I wanted to work with someone who focuses their work in language technology on regional varieties. That is why I got in touch with Yves Scherrer and the Corpus-Based Computational Dialectology (CorCoDial) research group. Having read many of the publications coming out of VarDial, learning from and working with people active in this community was an excellent fit for me

In total, I spent three months in Helsinki, from the end of February to the end of May. This meant that I could join some of the modules being held during that time, which allowed me to learn even more than just through collaboration. Additionally, I was able to join the research seminars held by the Language Technology research group, which meant I had an easy contact point to meet other researchers, but also to introduce my work to the group. Generally, the work culture is excellent and the joint lunch breaks make it easy to get to know people.

The main focus of my visit was on a project within the CorCoDial group. We replicated a study which had come out only a couple of weeks before I arrived. In our experiments, we concentrated on extracting dialect features used by dialect classifiers. This was to evaluate whether dialect classifiers can be explained or interpreted, since, in forensic linguistics, one of the big problems with automated processes is that they are opaque and can’t be used in evidential contexts. We have submitted and presented our work at several conferences, for instance ICLaVE12, and were able to publish a preprint of our work (soon to be published in the NLPAICS proceedings).

Although I had some goals in mind for my time in Helsinki, this visit has exceeded them by far. I have learnt a great deal for my PhD, but I also acquired skills I will be able to use beyond the PhD. I have met outstanding researchers and was welcomed with genuine warmth and enthusiasm. I remain grateful that this visit was possible and I had this opportunity, but most of all I am thankful for the friends I have made and collaborations built for the future.

20.7.2024

Research visit by Michal Štefánik from Masaryk University in Brno

Why coming to Helsinki?

I was familiar with what the Helsinki group does from earlier, as I was building a lot upon their outcomes in my previous work; I’ve used OPUS for evaluating the distributional robustness of translation models, and in a few of my previous papers, I’ve also used Helsinki-NLP models as my base models. Additionally, I’ve had a chance to attend one of prof Tiedemann’s (the group leader) public talks, which left a good impression on me.

How was the arrangement? I wrote a message to prof Tiedemann and later talked to Timothee, one of the lab members, which allowed me to align with the group’s topics of interest. Eventually, I picked the dates from February to the end of April: these fitted well my plans, but also it was the time when the days in Helsinki grew longer but there was still enough snow for winter sports.

What was I working on? There were several interesting directions, but I eventually decided to focus on the topic of modularity in language models, which I’ve also spent most of my time working on.

I was looking into the effects of the modularization of languages in multilingual machine translation models: how modular language models disentangle language-specific features in their representations, and on the practical side, on the impact of modularization on downstream quality of translation in over 100 languages.

One of the rougher morning rides at the end of April 🙂

How was the stay, work-wise? Building on so much previous work, the group has great access to both data and compute, which was really helpful in scaling my experiments to so many languages. There is someone to turn to regarding anything resource-related, which allowed me not to get stuck essentially on anything along the way and kept me focused on my main goals.

Though I think, most importantly, the group has a very friendly culture of open communication about essentially anything. It is perfectly common to ask anyone for help, as well as to ask anyone about what they did over the weekend or if they want to join a beer after work!

How was the stay outside work? Helsinki, on the verge of spring (February-March), is a great place for anyone who likes the proximity of ever-present nature and winter sports. If the city were a cross-country ski resort, it would likely be the biggest one in Europe, with around 1,000km of maintained trails 🙂I was able to ski around 400km of them in March. On the other hand, people who don’t like snow and winter might be surprised by 30cm of new snow at the end of April 🙂 but then perhaps everyone would enjoy Helsinki’s mild summers, with daylights at least until 9:30 pm from April to October.

Helsinki is also a good starting spot for weekend trips nearby: I recommend taking overnight trains to Lapland for aurora and wild nature or a 2-hour ferry ride to Tallinn with not-too-terrible historical sights!

Trip to Tallinn is only 2-hour ferry ride

14.2.202426.2.2024

Public examination of Aarne Talman’s PhD thesis

Another FoTran team member, Aarne Talman will defend the doctoral dissertation entitled “Towards Natural Language Understanding: Developing and Assessing Approaches and Benchmarks”

The public examination will take place in the Faculty of Arts, University of Helsinki, on 23 February 2024 at 13:00 in the lecture hall at Unioninkatu 33, (room 303). Assistant Professor Vered Shwartz, University of British Columbia, will serve as the opponent, and Jörg Tiedemann as the custos.

The defense will also be streamed on Unitube: https://video.helsinki.fi/unitube/live-stream.html?room=l44

The dissertation will be published in the series Dissertationes Universitatis Helsingiensis (http://hdl.handle.net/10138/568130)

[press release]

15.11.202314.2.2024

Public examination of Raúl Vázquez’ PhD thesis

Our FoTran team member Raúl Vázquez will defend the doctoral dissertation entitled “Representation Learning in Multilingual Neural Machine Translation”

The public examination will take place in the Faculty of Arts, University of Helsinki, on 17 November 2023 at 10:00 in the lecture hall at Unioninkatu 33, (room 303). Senior Researcher Dr. Cristina España i Bonet, DFKI, will serve as the opponent, and Jörg Tiedemann as the custos.

The defense will also be streamed on Unitube: https://video.helsinki.fi/unitube/live-stream.html?room=l44

The dissertation will be published in the series Dissertationes Universitatis Helsingiensis (http://hdl.handle.net/10138/566538)

8.11.20238.11.2023

Helsinki-NLP is busy at the FCAI AI day

The language technology group from Helsinki has substantial presence at the AI day organized by the Finnish Center for AI (FCAI). Check out the program and see our presentations about

SHROOM, a shared-task on hallucinations and related observable overgeneration mistakes,
Unsupervised Feature Selection for Effective Parallel Corpus Filtering using OpusFilter,
Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging from the project on Uncertainty-Aware NLP,
The OPUS-MT Dashboard A Toolkit for a Systematic Evaluation of Open Machine Translation Models,
Grammatical Error Correction for Sentence-level Assessment in Language Learning
HPLT: High Performance Language Technologies
Releasing the MAMMOTH – a framework for modular multilingual translation models
Is implicit assessment of language learning during practice as accurate as assessment through testing?

Come and see us at the presentations and posters!

26.10.2023

Anna Dmitrieva is the Language Bank of Finland researcher of the month

researcher of the month Anna Dmitrieva Our doctoral student, Anna Dmitrieva, has been highlighted the researcher of the month at the Language Bank of Finland.

Anna is working on automatic text simplification with a focus on Russian and Finnish and created freely available resources and tools that can be retrieved from public repositories. One of the recent resources is a Parallel Corpus of Finnish and Easy-to-read Finnish compiled from public news at YLE.

28.8.202320.9.2023

New team member: Joseph Attieh

Hello everyone!

My name is Joseph Attieh, and I’m thrilled to introduce myself as the newest member joining the Language Technology research group led by Prof. Jörg Tiedemann. As I embark on my PhD journey, my primary focus will be on Green NLP, particularly in the areas of knowledge distillation and modularization.

My academic journey started in Lebanon, where I received my bachelor’s degree in Computer Engineering. Afterwards, I pursued a double master’s degree at Aalto University in Finland and KTH Royal Institute of Technology in Sweden. During the course of my studies, I’ve had the privilege of conducting research at EPFL, where I delved into decentralized and distributed systems. My journey then took me to Munich, where I worked with BMW, navigating the intricate world of data analytics. More recently, I’ve worked at Huawei Technologies in Helsinki, starting as an Automatic Speech Recognition researcher and later transitioning to a full-time NLP researcher role. The challenges and discoveries during these experiences solidified my decision to further my studies and research, leading me to pursuing this exciting PhD opportunity.

Outside the world of algorithms and research papers, I cherish my time swimming, a love I owe to the Mediterranean vibes of my Lebanese homeland. Roller and ice skating are my go-to escapes. And when I’m not in the research lab or on the rink, you might find me binge watching a TV show or reading a book, guilty pleasures I wholeheartedly embrace.

I am excited to join the research team and am looking forward to the collaborations, challenges, and innovations ahead. I am eager to contribute to the team’s vision and to further the advancements in the field of Green NLP. The research in Green NLP has the potential to revolutionize the way NLP models are built and used. I am eager to contribute to this vision and collaborate with the brilliant minds here.

Thank you for taking a moment to get to know me. I am excited to embark on this shared journey of discovery, challenges, and innovation with all of you.

Best regards,

Joseph Attieh

15.5.202315.5.2023

Notes from EACL 2023

Hello! I’m Timothee Mickus, one of the postdocs in the FoTran project. I was at this year’s EACL to present a paper we submitted a year ago to TACL. The piece was about throwing linear algebra at Transformers and seeing what comes out. It was my first live conference since the pandemic, so that was a nice change of pace from Zoom-based and Underline-based conferences. It’s always easier to get feedback with live audiences.

I left the conference with a few tentative opportunities to collaborate with people in other labs, some new ideas for experiments, and, of course, numerous papers to read, with topics ranging from biases in annotation practices to the interpretability of contextualization in Transformer embeddings and from unsupervised machine translation evaluation to knowing when to expect failure and success in multitask scenarios.

The keynotes ranged through a wide variety of topics, and special attention had been paid so that they would also come from different backgrounds. Ed Grefenstette talked about large-language models and why instruction-based LMs seemed especially promising from a very NLP-centric standpoint. Kevin Munger provided a more media-studies oriented take on the same topic, and questioned how we should change our habits (from how we teach to how we write about AI to what new social policies we need) given the rise of the chatbots in today and tomorrow’s information landscape. Joyce Chai brought an overview of how advancements in NLP impact other AI research fields, and in particular embodied AI.

How do we keep ACL events affordable?

Coming into EACL, I expected that most of this blog post would be about the science. It turned out that EACL 2023 was also out of the ordinary in terms of organization.

One contentious point that was thoroughly discussed during the conference breaks was the price. It turned out to be one of the major points addressed during the business meeting (which was the only session where questions could be directed to the conference organizers). I tend to agree that $800 for a five day event (without counting the ACL membership dues) is a hefty sum; it contributed to the decisions of some of my colleagues and co-authors to not attend the conference. Next year’s venue has yet to be decided, since next year’s organizers want to explore more affordable options than what they initially had in mind. I would personally appreciate some transparency as to how the conference is funded (how much comes from sponsors? from the ACL dues? from the conference attendees?) and where that funding goes (how much are we paying for the conference venue and coffee break caterings? for the conference handbooks? for our EACL-branded tote-bags?).

The challenge of organizing EACL 2023

But maybe the most important thing to mention is the numerous rescheduling. The conference was initially supposed to happen in Kiev, and did not happen in Kiev for reasons. The backup venue in Dubrovnik was something of a last minute change of plans: the original plan once Kiev was ruled out was that the conference would just go online. I suppose this rushed relocation also explains some of the mishaps and communications issues during the final weeks before the conference. To take some concrete examples: the handbook we were provided contained duplicate entries for some of the papers presented at the conference; conversely, some poster sessions were not included. On a more personal note, I had no clear indication as to whether my talk would be presented in an oral or poster session: instead, I was sent a link to the underline page while it was still under construction, and directed to search for my name in this semi-functional website.

Presentations, posters, findings and all the confusion

This leads me to another unconventional organizational decision: every paper that was presented orally also had a poster presentation slot. This happened to be my case as well. I’m still not fully certain what I think of it: I’m not complaining about the extra opportunity to advertise my work, but it is extra work (both ahead of and during the conference). It’s also somewhat unfair that some of the presenters didn’t get the opportunity–and all the more jarring when it comes to Findings papers: Officially, Findings are works that are good enough but for which there’s not extra space in the conference itself, and yet organizers did manage to double-book quite a number of papers in the main conference. This means we have a tiered acceptance system: some of us got two presentation slots, some of us only got a poster, and some of us (Findings authors) only got virtual presentations. In and of itself, that’s not necessarily a bad thing, but I’m uncomfortable with this tiered system being left as implicit. Let’s hope that future editions of EACL and *ACL conferences will be more transparent on that front.

Where is NLP as a field and EACL as a conference?

The last point I’d like to mention comes from the opening session. As it turns out, most submissions to the European chapter of the association for computational linguistics did not come from Europe (with a good third of submissions stemming from the USA), and very few actually focused on linguistics. Most works submitted focused more on engineering, processing and dataset description rather than syntax, semantics or phonology modeling. While this state of affairs begs the question of what is EACL precisely, it also highlights how things have changed across the last few years: NLP has become a more international and engineering focused field of research. Despite these changes, it’s good to see that the community remains very directly involved in deciding where it is going–be it local volunteers in Dubrovnik picking up the ball at the last minute to ensure that EACL would not be an online-only event, or attendees openly discussing whether the current conference practices we have are a good fit for the community.

17.4.2023

CrowdMT workshop at EAMT in June

Are you tired of closed-source NLP models and the dependency on commercial services that drain us users of all our data and input?

Join our efforts to build transparent and open NLP with data and models that can be shared and used by anyone! The HPLT project organizes together with colleagues from the MaCoCu project a workshop on Open Community-Driven Machine Translation, crowdMT, at EAMT in Tampere in June.

27.3.202327.3.2023

Tommi Nieminen joins Helsinki-NLP

My name is Tommi Nieminen, and I recently joined the Helsinki-NLP research group as a new PhD student. For the past two decades, I have worked in the translation industry, starting as a translator and gradually drifting to more technical roles, such as CAT tool support, localization engineering, translation process automation, and machine translation development. Due to my work history, my research focuses mostly on the use of language technology in professional translation.

I have a long history with the University of Helsinki. I enrolled on an MA Philosophy course in the university in 2001, and after a long period of academic absence and part-time study (and a change of disciplines) I finally graduated with an MA in language technology in 2018. Since then I have participated in two academic projects involving the university, Fiskmö and OPUS-MT: Open Translation Models, Tools and Services. In the course of these projects I developed the OPUS-CAT tool, which enables translators to use machine translation models from the OPUS-MT project in their normal working environments. The motivation behind OPUS-CAT is to make open-source machine translation technology and resources directly available to the individual translators, so that they may have more control over how machine translation is integrated into their profession.

I am thrilled to be part of the team in Helsinki and the GreenNLP project, and to work on issues that have recently become more significant than ever before. I live far from Helsinki, so I am usually at the university only one day a week. I look forward to meeting all of you that I have not met yet.