This is the time table of the general research seminar in language technology for the spring 2022. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests We will alternate between online seminars (on zoom) and in-person seminars at the University of Helsinki. The latter may be broadcasted as well to make it possible to follow on-line. Details will be announced for each individual talk. Please, register to our mailing list below.
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to email@example.com with the following text in the body of your e-mail:
subscribe lt-research firstname.lastname@example.org
Tentative Schedule – Fall 2022
- September 15: Mikko Kurimo, Aalto University
- Title: Developing ASR systems and resources for low-resourced languages [video]
- Abstract: In this talk I will present Aalto’s ASR group’s latest results on the two new 3500+ hours training and evaluation data, Finnish Parliament and Lahjoita Puhetta, which we have recently prepared. We studied some of the fundamental questions in developing ASR systems, such as:
- How much training data is needed and how to collect that effectively?
- How much manually transcribed spontaneous speech and how accurate can that be?
- How to measure the ASR accuracy for all speaker groups and how much it varies?
- How much help comes from untranscribed speech and big pre-trained transformers?
- And finally, how to process speech input if no ASR is available?
- September 22:
Janine Siewert, University of Helsinki(moved to October 20)
- September 29: Aarne Talman, University of Helsinki / silo.ai
- Title: How Does Data Corruption Affect Natural Language Understanding Models? [video]
- Abstract: A central question in natural language understanding (NLU) research is whether high performance demonstrates the models’ strong reasoning capabilities. In this talk I will present an extensive series of controlled experiments where pre-trained language models are exposed to data that have undergone specific corruption transformations. The transformations involve removing instances of specific word classes and often lead to non-sensical sentences. Our results show that performance remains high for most GLUE tasks when the models are fine-tuned or tested on corrupted data, suggesting that the models leverage other cues for prediction even in non-sensical contexts. Our proposed data transformations can be used as a diagnostic tool for assessing the extent to which a specific dataset constitutes a proper testbed for evaluating models’ language understanding capabilities.
- October 6: Mark Ellison, University of Cologne
- Title: NLP and Human Processing: prominence, referring expressions and metaphors [video]
- Abstract: Human language has adapted to our associative memory and our pragmatic reasoning. In this talk, I discuss some of these adaptations, and their implications for the machine processing of language, and in particular, referring-expression form. When do we refer to something with a name, a descriptive noun phrase, a pronoun or (in languages where this is possible) zero? I will discuss the extent to which there is a “right” form for a context, and ways in which we can evaluate machine-generated form choices. While the discourse prominence of a referent drives much of this selection, semantic factors – such as metaphoricity – may also limit the referring expressions appropriate for a given context.
- Dr. Mark Ellison is a postdoctoral fellow at the DFG Collaborative Research Centre on Prominence in Language, University of Cologne, Germany.
- October 13: Elisa Bassinagna, IT University of Copenhagen
- Title: Current challenges in Information Extraction[video]
- Abstract: With the increase of digitized data and extensive access to it, the task of extracting relevant information in respect to a given query has become crucial both for individuals and for companies. The variety of applications of this task and the impossibility of annotating data from every text domain, require models to be robust to data shifts. In this talk I will present the findings of my PhD project with respect to one of the most important challenges of information extraction: The ability of a model to work in unseen scenarios (i.e. unknown text domains and unknown queries). More specifically—given the time and data constraints for model training—the ability of an information extraction model to easily adapt to new text domains, and to new information types to extract.
- October 20: Janine Siewert, University of Helsinki
- Title: Low Saxon dialects across time and borders
- Abstract: The focus of my research is dialectal variation in Low Saxon and its change from the 19th century to today. For this, I use a corpus-based approach and perform different clustering methods on PoS and morphological feature n-grams in order to measure the distance of Low Saxon dialects to each other as well as to Standard Dutch and German.
Low Saxon has no interregional standard and, despite its official status in both Germany and the Netherlands, the number of speakers has been in decline for decades. Therefore, one might expect to observe a tendency of the Low Saxon dialects growing apart along the border and converging towards German or Dutch respectively. However, the level of PoS and morphological features does not reveal such a tendency. Instead, the Low Saxon dialects from both Germany and the Netherlands seem to be converging towards each other as well as to Standard Dutch.
- October 27: Seppo Nyrkkö, University of Helsinki
- November 3: Yves Scherrer, University of Helsinki
- Title: Dialect-to-Standard Normalization: A Multilingual Benchmark and Evaluation [video]
- Abstract: Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this talk, I propose to consider dialect-to-standard normalization — i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety — as a distinct sentence-level character transduction task. Concretely, we compile a multilingual benchmark dataset covering four languages from three different families, provide three different data splits corresponding to different use cases of automatic normalization, and present a large-scale evaluation of dialect-to-standard normalization methods.
- November 10: Elaine Zosa, University of Helsinki
- Title: Multilingual and Multimodal Topic Modelling with Contrastive Learning [video]
- Abstract: Topic models extract latent themes from a set of unstructured documents. Due to their interpretable outputs they are useful for exploring, summarising, and organising large datasets. Neural topic models have been proposed to improve on classical topic models, making it easier to incorporate information from external sources such as embeddings from pretrained language models. In this talk, I will present a neural topic model for comparable multilingual data that uses a contrastive objective to induce a multilingual topic space and learn topics aligned across multiple languages. I will discuss how this technique can also be applied to the multimodal setting with minimal adjustments. Our experimental results show that this technique improves on some classical and neural topic models for multilingual and multimodal data.
- November 17: Khalid Alnajjar, University of Helsinki
- Title: Utilizing Multimodalities in Recognizing Figurative Language
- Abstract: In this talk I will describe our recent research on considering multiple modalities for detecting figurative language such as metaphors, sarcasm and humor. I will also discuss the importance of multimodal approaches and the shortcomings of the models. Our results suggest that, most of the time, multimodal models enhance the detection as more context is being captured which yields higher accuracies.
- November 24: Aleksandra Miletić, University of Helsinki
- Title: Occitan in Wikipedia discussions: Corpus extraction, language identification and user practices [video]
- Abstract: Occitan is a regional language spoken in southern France and in parts of Italy and Spain. It exhibits a rich system of dialects and it is not standardized. As many such languages, it has only recently started to enter the digital era. Basic digital tools and resources (text databases, electronic dictionaries, text-to-speech tools) have been created and Occitan Wikipedia is also being developed. However, until recently, Occitan still lacked a large downloadable corpus. In my talk, I will present OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with four off-the-shelf tools, including HeLI (Jauhiainen et al., 2022) and a new fasttext-based language identification model from Meta AI’s No Language Left Behind initiative (Costa-jussà et al., 2022). I will also present an analysis of the usage of Occitan dialects and spelling norms on a corpus sample, which was conducted in an attempt to describe user practices on this medium.
- December 1: Christian Hardmeier, IT University Copenhagen
- Title: Understanding and Modelling Pronouns in Translation: Resources, Methods, Challenges and Insights [video]
- Abstract: The difficulty of pronoun translation is typically illustrated with examples of anaphoric pronouns requiring gender agreement in the target language. However, pronoun translation is more complex than that. In this talk, I present our efforts to understand and model the generation and interpretation of pronouns in translation. A core resource is the ParCorFull corpus, a multilingual parallel dataset with a rich annotation of coreferential phenomena going beyond simple anaphoric references. ParCorFull has found a range of applications to the cross-lingual study of texts, to machine translation evaluation, leading to insights into translation processes, but also uncovering challenges due to how corpus annotation resolves ambiguity, potentially creating conflicts in a parallel data. Additional insights can be gained from studies of pronoun generation and interpretation we’ve conducted with human participants, highlighting the variance of typical patterns across five European languages. I also present our work on modelling pronoun translation in the context of cross-lingual coreference resolution and neural machine translation with the help of cross-lingual mention attention, resulting in consistent, but rather modest performance gains.
- December 8: Maha Elbayad, Facebook AI Research
- NOTE! Different time! Start at 16:15
- The talk will be on zoom but we also live-stream it in Metsätalo, room 5
- Live streaming: Zoom link
- Title: The No-Language-Left-Behind project [video]
- Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. In No Language Left Behind, we took on the challenge of breaking the 200 language barrier by first contextualising the need for low-resource language translation support. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark in Flores-200. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.
- Bio: Maha Elbayad is an AI research scientist at FAIR (Meta AI) working on Low-resource and Massively Multilingual Machine Translation.
Prior to joining FAIR, she received a PhD in Applied Mathematics & Computer Science from Université Grenoble Alpes under the supervision of Prof. Laurent Besacier and Jakob Verbeek. Her research projects include work on convolutional sequence-to-sequence models, computationally efficient Transformer decoders with early-exits and simultaneous machine translation.
- NOTE! Different time! Start at 16:15
- December 15 – Denis Paperno, Utrecht University
- Title: To the limits of Distributional Semantics and Beyond
- Abstract: In contemporary NLP/AI research, there is an active discussion of the role of grounding in language processing. According to some, models trained on text data alone (aka distributional models), which form the basis for most applications nowadays, are not capable of characterizing semantics properly.
Aiming to contribute to this discussion, I will discuss concrete evidence on how far distributional signal from text can go, and what could help supplement it efficiently.