New team member: Joseph Attieh

Hello everyone!

My name is Joseph Attieh, and I’m thrilled to introduce myself as the newest member joining the Language Technology research group led by Prof. Jörg Tiedemann. As I embark on my PhD journey, my primary focus will be on Green NLP, particularly in the areas of knowledge distillation and modularization.

My academic journey started in Lebanon, where I received my bachelor’s degree in Computer Engineering. Afterwards, I pursued a double master’s degree at Aalto University in Finland and KTH Royal Institute of Technology in Sweden. During the course of my studies, I’ve had the privilege of conducting research at EPFL, where I delved into decentralized and distributed systems. My journey then took me to Munich, where I worked with BMW, navigating the intricate world of data analytics. More recently, I’ve worked at Huawei Technologies in Helsinki, starting as an Automatic Speech Recognition researcher and later transitioning to a full-time NLP researcher role. The challenges and discoveries during these experiences solidified my decision to further my studies and research, leading me to pursuing this exciting PhD opportunity.

Outside the world of algorithms and research papers, I cherish my time swimming, a love I owe to the Mediterranean vibes of my Lebanese homeland. Roller and ice skating are my go-to escapes. And when I’m not in the research lab or on the rink, you might find me binge watching a TV show or reading a book, guilty pleasures I wholeheartedly embrace.

I am excited to join the research team and am looking forward to the collaborations, challenges, and innovations ahead. I am eager to contribute to the team’s vision and to further the advancements in the field of Green NLP. The research in Green NLP has the potential to revolutionize the way NLP models are built and used. I am eager to contribute to this vision and collaborate with the brilliant minds here.

Thank you for taking a moment to get to know me.  I am excited to embark on this shared journey of discovery, challenges, and innovation with all of you.

Best regards,

Joseph Attieh

Helsinki-NLP at TALN 2023

TALN (Traitement Automatique du Language Naturel) is an annual conference covering topics from computational linguistics, NLP and speech processing organized by ATALA, the French association for NLP. This year’s edition took place from 5 – 9 June 2023 in Paris and was collocated with CORIA (Conférence en Recherche d’Information et Applications

Two of our members participated: Timothee Mickus and Aleksandra Miletić. More details below!

Presentation on low-resourced languages
Aleksandra presented a paper on lemmatization experiments for Occitan in a session dedicated to low-resource settings. During the coffee break there was an interesting exchange of ideas with researchers working on Alsatian, another minority language spoken in France.

Timothee won Best Thesis Award
Timothee won ATALA’s Best Thesis Award for 2023 for his thesis On the Status of Word Embeddings as Implementations of the Distributional Hypothesis. ATALA’s Best Thesis Award is given to outstanding PhD theses defended in a francophone university during the last two years. Exceptionally, this year there were two awardees, the other being Tobias Mayer for his thesis Fouille d’arguments à partir des essais cliniques. The jury  congratulated both authors on the high quality of their work.

Timothee’s thesis was in particular saluted for the depth of his theoretical and practical contributions. Timothee’s presentation during a plenary session prompted a long and lively discussion, showing the interest of the community in the topic.

Aleksandra was elected to the Board of ATALA
During the conference, the elections for the Board of ATALA also took place. Aleksandra was among the members elected for a period of three years. The Board coordinates ATALA’s various activities, such as the organization of the TALN conference, the edition of the TAL journal, the administration of the LN mailing list, the organization of workshops and the attribuiton of the Best Thesis Award.


Jörg Tiedemann elected as the president of NEALT

Jörg Tiedemann will succeed Barbara Plank as the next president of the Northern European Association for Language Technology (NEALT). The elections were held during the NoDaLiDa conference in Faroe Islands (22-24 May 2023). Jörg will take office in January 2024 for a period of two years.

NEALT is a scientific, non-profit and non-political association focused on promoting language technologies developed in the Northern European countries and for the languages spoken in those countries. During his presidency, Jörg will be giving special attention to the following topics:

Supporting the organisation of the next edition of NoDaLiDa
NoDaLiDa (Nordic Conference on Computational Linguistics) is a bi-annual event showcasing work on all aspects of computational linguistics, NLP and speech processing, including interdisciplinary work with neighbouring disciplines such as linguistics or psychology. The next iteration will be a jubilee: it will mark the 25th edition of the conference.

Promoting NEJLT
NEJLT (Northern European Journal of Language Technology) publishes high quality peer-reviewed research in computational linguistics and language technology for all natural languages. NEJLT’s goal is to provide an alternative to the high-pace publication process typical of the top language technology conferences. The hope is that making journal publications more accessible will reduce the reviewing load of the community and have a positive effect on the quality of published research.

Improving  NEALT’s visibility, in particular for its Special Interest Groups (SIGs)
NEALT currently has five SIGs, which serve as forums for communities working on different topics to exchange ideas and share resources. The existing SIGs are dedicated to the Nordic Constraint Grammar, technology infrastructures in the North European region, language technology-related teaching, intelligent computer-assisted language learning, and high-performance computing for NLP in the Nordics. NEALT is also open to establishing new SIGs: instructions for starting a new one are available here.

Community building
All of NEALT’s activities above strive to support collaborations between people and teams working on language technologies in the Nordic countries. In the coming years, the association will be driving for more shared projects, co-organized scientific events and joint publications across the region.

Supporting young researchers and students
In the 2023 edition of NoDaLiDa, more than half of the accepted papers were student papers (i.e., the first author was a student). An important aspect of NEALT’s activity will be ensuring support for young researchers and especially students to attend scientific events such as NoDaLiDa.

Helsinki-NLP at the EACL workshops: VarDial and SlavNLP

The Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) celebrated its tenth edition in 2023 with a record number of submissions. Various members of the Helsinki-NLP group have contributed to this success, most notably Jörg Tiedemann as a co-founder of the workshop series, and Yves Scherrer and Tommi Jauhiainen as current main organizers. This year, three research papers were co-authored by researchers from Helsinki (Olli Kuparinen, Aleksandra Miletic, Janine Siewert, Yves Scherrer). Tommi and Yves also contributed to the three shared tasks proposed in the evaluation campaign this year. After two fully online and one hybrid edition, the 2023 edition was characterized by a growing on-site participation, with about two thirds of talks and posters presented in person.

One of the highlights of VarDial was the panel discussion (initially called “round table”, but we were assigned a fairly small room without tables, let alone round ones) that reflected on the past of VarDial and its future in the era of large language models. Our experts (Antonis Anastasopoulos, Gabriel Bernier-Colborne, Preslav Nakov, Tanja Samardzic, Ivan Vulic) agreed that even though the methods changed drastically over the last ten years, the VarDial themes were more relevant than ever. VarDial has also been known for its evaluation campaign primarily focusing on “hard” language identification problems such as those between closely related language varieties. The panelists were happy to see the continued interest in this campaign, but also wished for more varied downstream tasks to be included in the evaluation campaign. As organizers, we have always been strived to propose a wide variety of tasks, but struggled to attract sufficient numbers of participants. Furthermore, it was highlighted that dialects are first and foremost spoken varieties and that we should therefore focus more on spoken data in the future. The panelists viewed the ongoing consolidation of methods and the dominance of Transformer-based paradigms as a potential silver bullet that would hopefully make it easier for everybody to participate in future shared tasks.

This year Slavic NLP (formerly Balto-Slavic NLP) included two papers from the members of our unit (Roman Yangarber and Anna Dmitrieva) who also contributed to the organization of the workshop and the shared task. The Slav-NER shared task is a multilingual named entity recognition challenge for Czech, Polish, and Russian, which also includes name normalization and entity linking. The top system’s performance on NER and normalization this year reached an F1 score of 90. Entity linking, a more challenging task, had an F1 score of 72-80, while cross-lingual entity linking had an F1 score of ~67, which is a great improvement compared to the previous challenge. The workshop’s best paper, “Resources and Few-shot Learners for In-context Learning in Slavic Languages” (Štefánik et al.), also touched on Polish, Czech, and Russian NER. The authors created an evaluation benchmark for in-context learning for these languages, supported tasks being NER, Classification, QA, and NLI.

Notes from EACL 2023

Hello! I’m Timothee Mickus, one of the postdocs in the FoTran project. I was at this year’s EACL to present a paper we submitted a year ago to TACL. The piece was about throwing linear algebra at Transformers and seeing what comes out. It was my first live conference since the pandemic, so that was a nice change of pace from Zoom-based and Underline-based conferences. It’s always easier to get feedback with live audiences.

I left the conference with a few tentative opportunities to collaborate with people in other labs, some new ideas for experiments, and, of course, numerous papers to read, with topics ranging from biases in annotation practices to the interpretability of contextualization in Transformer embeddings and from unsupervised machine translation evaluation to knowing when to expect failure and success in multitask scenarios.

The keynotes ranged through a wide variety of topics, and special attention had been paid so that they would also come from different backgrounds. Ed Grefenstette talked about large-language models and why instruction-based LMs seemed especially promising from a very NLP-centric standpoint. Kevin Munger provided a more media-studies oriented take on the same topic, and questioned how we should change our habits (from how we teach to how we write about AI to what new social policies we need) given the rise of the chatbots in today and tomorrow’s information landscape. Joyce Chai brought an overview of how advancements in NLP impact other AI research fields, and in particular embodied AI.

How do we keep ACL events affordable?

Coming into EACL, I expected that most of this blog post would be about the science. It turned out that EACL 2023 was also out of the ordinary in terms of organization.

One contentious point that was thoroughly discussed during the conference breaks was the price. It turned out to be one of the major points addressed during the business meeting (which was the only session where questions could be directed to the conference organizers). I tend to agree that $800 for a five day event (without counting the ACL membership dues) is a hefty sum; it contributed to the decisions of some of my colleagues and co-authors to not attend the conference. Next year’s venue has yet to be decided, since next year’s organizers want to explore more affordable options than what they initially had in mind. I would personally appreciate some transparency as to how the conference is funded (how much comes from sponsors? from the ACL dues? from the conference attendees?) and where that funding goes (how much are we paying for the conference venue and coffee break caterings? for the conference handbooks? for our EACL-branded tote-bags?).

The challenge of organizing EACL 2023

But maybe the most important thing to mention is the numerous rescheduling. The conference was initially supposed to happen in Kiev, and did not happen in Kiev for reasons. The backup venue in Dubrovnik was something of a last minute change of plans: the original plan once Kiev was ruled out was that the conference would just go online. I suppose this rushed relocation also explains some of the mishaps and communications issues during the final weeks before the conference. To take some concrete examples: the handbook we were provided contained duplicate entries for some of the papers presented at the conference; conversely, some poster sessions were not included. On a more personal note, I had no clear indication as to whether my talk would be presented in an oral or poster session: instead, I was sent a link to the underline page while it was still under construction, and directed to search for my name in this semi-functional website.

Presentations, posters, findings and all the confusion

This leads me to another unconventional organizational decision: every paper that was presented orally also had a poster presentation slot. This happened to be my case as well. I’m still not fully certain what I think of it: I’m not complaining about the extra opportunity to advertise my work, but it is extra work (both ahead of and during the conference). It’s also somewhat unfair that some of the presenters didn’t get the opportunity–and all the more jarring when it comes to Findings papers: Officially, Findings are works that are good enough but for which there’s not extra space in the conference itself, and yet organizers did manage to double-book quite a number of papers in the main conference. This means we have a tiered acceptance system: some of us got two presentation slots, some of us only got a poster, and some of us (Findings authors) only got virtual presentations. In and of itself, that’s not necessarily a bad thing, but I’m uncomfortable with this tiered system being left as implicit. Let’s hope that future editions of EACL and *ACL conferences will be more transparent on that front.

Where is NLP as a field and EACL as a conference?

The last point I’d like to mention comes from the opening session. As it turns out, most submissions to the European chapter of the association for computational linguistics did not come from Europe (with a good third of submissions stemming from the USA), and very few actually focused on linguistics. Most works submitted focused more on engineering, processing and dataset description rather than syntax, semantics or phonology modeling. While this state of affairs begs the question of what is EACL precisely, it also highlights how things have changed across the last few years: NLP has become a more international and engineering focused field of research. Despite these changes, it’s good to see that the community remains very directly involved in deciding where it is going–be it local volunteers in Dubrovnik picking up the ball at the last minute to ensure that EACL would not be an online-only event, or attendees openly discussing whether the current conference practices we have are a good fit for the community.

Tommi Nieminen joins Helsinki-NLP

My name is Tommi Nieminen, and I recently joined the Helsinki-NLP research group as a new PhD student. For the past two decades, I have worked in the translation industry, starting as a translator and gradually drifting to more technical roles, such as CAT tool support, localization engineering, translation process automation, and machine translation development. Due to my work history, my research focuses mostly on the use of language technology in professional translation.

I have a long history with the University of Helsinki. I enrolled on an MA Philosophy course in the university in 2001, and after a long period of academic absence and part-time study (and a change of disciplines) I finally graduated with an MA in language technology in 2018. Since then I have participated in two academic projects involving the university, Fiskmö and OPUS-MT: Open Translation Models, Tools and Services. In the course of these projects I developed the OPUS-CAT tool, which enables translators to use machine translation models from the OPUS-MT project in their normal working environments. The motivation behind OPUS-CAT is to make open-source machine translation technology and resources directly available to the individual translators, so that they may have more control over how machine translation is integrated into their profession.

I am thrilled to be part of the team in Helsinki and the GreenNLP project, and to work on issues that have recently become more significant than ever before. I live far from Helsinki, so I am usually at the university only one day a week. I look forward to meeting all of you that I have not met yet.

Introducing Elaine Zosa

profile-fotoHello there! I’m Elaine, a new postdoctoral researcher in the HelsinkiNLP Research Group. To start off, I have not always worked in NLP. I worked in the financial technology sector before I decided to study for a master’s degree. I obtained my MSc in Computer Science at the University of Helsinki where my concentration was on algorithmic bioinformatics. After that, I was a research assistant in computational genomics at the Technical University of Munich. Then in late 2018, I started my doctoral research at the University of Helsinki, in the Discovery Research Group led by Prof. Hannu Toivonen.

During my PhD, I worked on two EU Horizon 2020 projects: NewsEye ( and EMBEDDIA (  Both these projects involved building tools to help analyse large-scale news collections. In the former, we focused on historical news collections from Finland, France, and Austria, and in the latter, on news media from less-represented European languages such as Finnish, Estonian, and Croatian. I worked on various tasks in the projects and helped develop new methods in topic modeling, lexical semantic change, news headline generation, and multilingual news matching. Methodological innovations aside, these projects exposed me to the inherently interdisciplinary nature of NLP and language technology and that, I think, is the most exciting thing about this field. I enjoy building tools that could be useful to researchers in the humanities and social sciences, and beyond.

Now I am investigating methods to quantify and model uncertainty in various linguistic tasks. You can also find out more about my work on my homepage,!

New team member: Shaoxiong Ji


My name is Shaoxiong Ji. This year, I started as a Postdoctoral Researcher at the University of Helsinki, working on high-performance language technology at the Language Technology research group led by Prof. Jörg Tiedemann.

My research focuses on multilingual NLP and machine translation. I will be working with some NLP resource development such as large-scale data and and train big models and knowledge distillation models with the HPLT project. I am also interested in some other topics such as modular neural networks, zero-shot cross-lingual tasks, and other low-resource problems.

I did my Ph.D. at Aalto University under the supervision of Assoc. Prof. Pekka Marttinen in the Machine Learning for Healthcare research group.  My Ph.D. thesis is about natural language processing for healthcare applications. During my doctoral candidature, I was a visiting researcher with Prof. Hinrich Schütze at the University of Munich (LMU Munich, Germany) and Dr. Mikko Peltola at the Finnish Institute for Health and Welfare (THL, Finland). Prior to my doctoral candidature, I did MPhil research with Prof. Xue Li and Prof. Helen Zi Huang at the University of Queensland (UQ Australia) working on NLP applications for social good. I did visiting research with Assoc. Prof. Erik Cambria at Nanyang Technological University (NTU Singapore), where I worked on sentiment analysis especially emotion recognition in conversations.

I also spent a half year as a research assistant and a visiting scholar with Dr. Guodong Long and Dr. Shirui Pan at the University of Technology Sydney (UTS Australia), working on federated learning and mobile internet applications.

I am happy to join the group and discuss more about interesting topics. Thank you for your time and for reading my introduction and looking forward to meeting you!


New team member: Ona de Gibert Bonet

Hello everyone!

My name is Ona de Gibert Bonet and I am thrilled to introduce myself as a new PhD student in the Department of Digital Humanities at the University of Helsinki. I joined the Helsinki-NLP group in January this year.

To tell you a bit about my background, I was born and raised in Barcelona. I received a B.A. in Modern Languages and Literature from the University of Barcelona (2016) and earned a M.S. in Language Analysis and Processing from the University of the Basque Country (2018).

I am passionate about Machine Translation (MT), which is what led me to pursue a PhD in Language Technology. I am excited to join the research team led by Jörg Tiedemann, where I will be exploring MT for low-resourced languages with a focus on knowledge distillation. I believe that this research area has great potential to make a positive impact on the democratization of MT, broadening its accessibility, and I am eager to contribute to this work.

Apart from my academic interests, I love dancing ballet and lindy hop. In my free time, you can often find me doing very Finnish things: knitting or in the sauna. Both activities help me relax and clear my mind after a long day of research.

Thank you for taking the time to read my presentation, and I cannot wait to see what this academic journey has in store for us all. If you see me around campus, please don’t hesitate to say hello!