This is the time table of the general research seminar in language technology for the spring 2022. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests We will alternate between online seminars (on zoom) and in-person seminars at the University of Helsinki. The latter may be broadcasted as well to make it possible to follow on-line. Details will be announced for each individual talk. Please, register to our mailing list below.
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to firstname.lastname@example.org with the following text in the body of your e-mail:
subscribe lt-research email@example.com
Tentative Schedule – Fall 2023
- September 7 – welcoming coffee with our new students?
- September 14 – reports and news from summer conferences
- September 28: Rob van der Goot, ITU Copenhagen (Zoom only)
Abstract: Robustness is key for any NLP application that is to be used in a real-world situation. The first part of this talk will be about lexical normalization of social media data; the progress made on this task over the past decade, its effect on downstream tasks and remaining challenges. The second part of the talk I will cover experiments with autoencoder language models in multi-task setups. We will start out with simple setups where 2 tasks are combined, but also look at a much more challenging setup where we consider all SemEval text-based tasks of 2022 and 2023. This setup is extremely challenging, because the tasks (and task-types) are very diverse, and the datasets differ among many dimensions (size, language, domain, etc.).
- October 12: Robert Östling, Stockholm University (Zoom only)
- Title: Neural models as typologists: Can neural NLP models discover language generalizations? [video]
- Abstract: This presentation summarizes our work on language representations derived from massively multilingual neural models. What can we really say about how these models generalize across languages? What problems arise when trying to answer this question? We found that only a few of the models we investigated appear to make generalizations that hold over the 132 different language families present in our data, and in only a few cases did these align with traditional generalizations from linguistic typology. As a point for discussion during the seminar, I would like to ask how we go forward with designing neural models on the scale of thousands of languages that are computationally efficient NLP models, but also contribute to our understanding of human languages.
- October 26: Joseph Attieh, University of Helsinki
- Title: From Basics to Breakthroughs: A Deep Dive into Knowledge Distillation in Neural Machine Translation [slides][video]
- Abstract: As Neural Machine Translation (NMT) models continue to grow in complexity, there’s an increasing need for efficient and compact models without loss in translation quality. Knowledge Distillation has emerged as a pivotal solution to address this challenge. It involves training a smaller student model to emulate the behavior of a large teacher model. This presentation provides a high-level survey of knowledge distillation techniques specifically designed for NMT models. The initial segment of the talk will provide an overview of the foundational principles of knowledge distillation. Subsequently, we will offer a high-level review of papers published on this topic, delving into the latest advancements and breakthroughs in the domain.
- November 2: Timothee Mickus, University of Helsinki
- Title: Distributional, yes—but semantics? [slides] [video]
- Abstract: Representations learned from word distributions have played an increasingly dominant role in NLP applications and research. The aim of this talk is to push back on the narrative that they capture what words mean, and that this aspect is what makes them so successful, and instead instill a more nuanced overview. Going through a series of experimental results, we will highlight practical cases where they fall short of what one expects of semantic representations, as well as some alternative explanations of what linguistic aspects they depict.
- November 9: Timothée Bernard, Université de Paris (Zoom only)
- Title: Studying the syntax and semantics of emergent languages [slides] [video]
- Abstract: In signaling games, two or more agents (human or otherwise) have to cooperate through information exchange in order to retrieve some target. In particular, we start with an experimental setup in which one agent has to describe an image using arbitrary symbols so that another agent can recognise it among a set of distractors. The agents are implemented as neural networks and their training is based on reinforcement. Our long term goal is to understand under which conditions the language developed by the two agents, similarly to natural language, exhibits a syntax and a compositional semantics. As first steps, (i) we study how many implementation choices affect the reliability of the training procedure and the performance of the agents, (ii) we introduce an array of automatic metrics designed to uncover syntactic and semantics aspects of the messages exchanged by successful pairs of agents and (iii) we show that these pairs of agents can acquire and communicate about high-level concepts even when apparently simpler strategies are available (and indeed discovered).
- November 16: Cristina España i Bonet, DFKI
- Title: Human biases in multilingual systems [slides]
- Abstract: Some human preferences are universal. The odor of vanilla is perceived as pleasant all around the world for example. We expect neural models trained on human texts to exhibit these kind of preferences, i.e. human biases, but this is not always the case. In this talk, I will explore 16 static and contextual embedding models in 9 languages and, when possible, compare them under similar training conditions. I will also introduce and motivate CA–WEAT, multilingual cultural aware tests to quantify biases, and compare them to previous English-centric tests. Our experiments confirm that monolingual static embeddings do exhibit human biases, but values differ across languages, being far from universal. Biases are less evident in contextual models, to the point that the original human associations might be reversed. Multilinguality proves to be another variable that attenuates and even reverses the effect of the bias, specially in contextual multilingual models. These findings highlight the importance of being cautious when choosing multilingual systems over monolingual ones and when employing literal translations in multilingual NLP applications.
- November 23: Guangpu Huang, Aalto University (zoom only)
- Title: Multimodal Representation Learning for Speech Emotion Recognition: A Review [slides] [video]
- Abstract: Human communication relies on multimodal representations: speech, language, facial expressions, gestures, etc. Could we learn something fundamental from these representations? Is it possible that multimodal representation learning could provide us some answers as to how to optimize the performance of machine learning models, and ultimately achieve the human-level performances on ASR, NLP and CV tasks? Transformers might just seem to be the right candidates. With the increase amount of multimodal data online, transformer-based methods have increasingly shown great potentials in multimodal machine learning. However, there are many unanswered questions as well as practical challenges, especially when deploying ‘in the wild’. In this talk we will review multimodal representation learning using transformers. We will design experimental studies on speech emotion recognition (alternatively facial expression recognition or affect recognition) to jointly study the modalities: speech, text, and video/image capture of human face (and hand gestures if possible). More importantly, we will raise some key research questions in multimodal representation learning. We will also discuss open challenges, potential research directions, and collaborations among ASR, NLP, CV research teams.
- November 30: Aman Sinha, Université de Lorraine, Nancy, France (zoom only)
- Title: Stepping towards Medical Language Understanding [slides] [video]
- Abstract: Medical language, characterized by specific vocabulary, heterogeneity, unstructuredness, and temporality, poses a significant challenge for modeling using general domain deep learning techniques. The sources of medical language (data) are not just limited to hospitals. My thesis extensively explores comprehending medical language across diverse sources. While each source’s information richness has been extensively studied, deriving insights from each presents challenges in terms of variance and verifiability. This issue underscores the crucial need to establish validity through the integration of multiple data sources. In this presentation, I will delve into the realm of medical language from the perspectives of deep learning and NLP, emphasizing its temporal aspects. I aim to showcase our project, SLAN, designed to manage irregular temporal medical data, along with its subsequent extension. Lastly, I will address the manifold challenges encountered in handling temporal medical data.
December 7: Sofoklis Kakouros, University of Helsinki(postponed)
- December 14: Christmas coffee