Research Seminar in Language Technology

20152016 / 20162017 / 20172018 / 20182019 / 20192020 / 2020-2021

General Information

This is the time table of the general research seminar in language technology for the spring 2021. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests Because of the current situation we will organize the seminars as on-line meetings via zoom. The link to the zoom session will be send together with announcement of each talk. Please, register to our mailing list below.

Place and Time:
Thursdays 14:00 – 15:00, zoom room

Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to majordomo@helsinki.fi with the following text in the body of your e-mail:

subscribe lt-research your.email@helsinki.fi

Tentative Schedule – Spring 2021

  • January 14: Khalid Alnajjar: Interpretation and Generation of Metaphors and Metaphorical Expressions
    • Metaphors are found in our everyday communications, we use them to convey creative comparisons and ideas. Coming up with apt metaphors or even comprehending them is a challenging task for computers. In this presentation, we’ll talk about a data-driven method for interpreting nominal metaphors, and how such a method could be utilized in generating nominal metaphors and metaphorical expressions, in English and Finnish.
  • January 21: Alessandro Raganato: Lexical Ambiguity and Attentive Patterns in Neural Machine Translation

    • Abstract: In this talk I will present some recent works on two different topics: i) on Word Sense Disambiguation (WSD), and ii) on Self-Attention Patterns in Transformer-Based Machine Translation.
      First, I will describe different multilingual benchmarks for the Word Sense Disambiguation task, assessing the ability of systems to correctly distinguish the meanings of a word with and without the need for a sense inventory. These multilingual benchmarks, named XL-WSD and XL-WiC, feature gold standards in several no-English languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. I will also present an evaluation benchmark on WSD for machine translation, named MuCoW, to measure the ability of a translation system to tackle the lexical ambiguity phenomena.
      Finally, I will move the topic of the talk to Neural Machine Translation, presenting how the encoder of the Transformer architecture can be substantially simplified with trivial attentive patterns at training time, reducing the parameter footprint without loss of translation quality, and even improving quality in low-resource scenarios.
      The talk is based on these papers:
    • Tommaso Pasini, Alessandro Raganato, Roberto Navigli.
      XL-WSD: An Extra-Large and Cross-Lingual Evaluation Framework for Word Sense Disambiguation. In AAAI 2021 (to appear).
    • Alessandro Raganato, Tommaso Pasini, José Camacho-Collados and Mohammad Taher Pilehvar.
      XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization. In EMNLP 2020.
    • Alessandro Raganato, Yves Scherrer and Jörg Tiedemann.
      An Evaluation Benchmark for Testing the Word Sense Disambiguation Capabilities of Machine Translation Systems. In LREC 2020.
      Alessandro Raganato, Yves Scherrer and Jörg Tiedemann.
      Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. In Findings of EMNLP. 2020.
  • January 28: no seminar today
  • February 4: no seminar today
  • February 11: Wilker Aziz, University of Amsterdam
    • Abstract: Neural sequence generation systems oftentimes generate sequences by searching for the most likely sequence under the learnt probability distribution. This assumes that the most likely sequence, i.e. the mode, under such a model must also be the best sequence it has to offer (often in a given context, e.g. conditioned on a source sentence in translation). Recent findings in neural machine translation (NMT) show that the true most likely sequence oftentimes is empty under many state-of-the-art NMT models. This follows a large list of other pathologies and biases observed in NMT and other sequence generation models: a length bias, larger beams degrading performance, exposure bias, and many more. Many of these works blame the probabilistic formulation of NMT or maximum likelihood estimation. We provide a different view on this: it is mode-seeking search, e.g. beam search, that introduces many of these pathologies and biases, and such a decision rule is not suitable for the type of distributions learnt by NMT systems. We show that NMT models spread probability mass over many translations, and that the most likely translation oftentimes is a rare event. We further show that translation distributions do capture important aspects of translation well in expectation. Therefore, we advocate for decision rules that take into account the entire probability distribution and not just its mode. We provide one example of such a decision rule, and show that this is a fruitful research direction.
  • February 18: Jussi Karlgren
    • Title:The TREC Podcasts Track – One New Dataset, Two New Shared Tasks, and Many New Interesting Questions
    • Abstract:
      Podcasts are spoken documents across a wide-range of genres and styles, with skyrocketing listenership across the world. To promote research on podcast material, which differs from other sets of language in interesting ways, Spotify has released a data set of 100 000 English language podcast episodes with full audio, full automatic transcriptions, and metadata. Transcribed podcast material differs from other collections of English language data in some respects. The talk will describe some of the observed differences between this collection and some other collections and discuss how some further differences might be studied.The data set was used at the TREC Podcasts Track organised by the US National Institutes for Standards and Technology with two shared tasks: segment retrieval and summarisation. This talk will describe the data set, give a brief overview of the 2020 Podcasts Track, and describe how the tasks will develop for next year’s track.
  • February 25: Marianna Apidianaki
    • Title: What does BERT know about words? Unveiling hidden lexical semantic properties.
    • Abstract:
      Given the high performance of pre-trained language models on natural language understanding tasks, an important strand of work has focused on the linguistic knowledge encoded inside the models, mainly addressing structure-related aspects. In our work, we explore the knowledge BERT encodes about lexical semantics. We specifically probe BERT representations for scalar adjective ranking and noun property prediction. We perform intrinsic evaluations against hand-crafted data, and test the extracted representations on the tasks of indirect question-answering and in-context lexical entailment. We show that the model encodes rich information about adjective intensity acquired through pre-training, but has only marginal knowledge of noun properties and their prevalence. 

      The presented work has been performed in collaboration with my PhD student, Aina Garí Soler (University Paris-Saclay), in the frame of the MULTISEM ANR project.
  • March 4: Roberto Navigli, Sapienza University of Rome

    • Scaling Semantic Role Labeling and Semantic Parsing Across Languages
    • Abstract: Sentence-level semantics is hampered by the lack of large-scale annotated data in non-English languages. In this talk I will focus on two key tasks aimed at enabling Natural Language Understanding, that is, Semantic Role Labeling (SRL) and semantic parsing, and put forward innovative approaches which we developed to scale across languages. I will show you how new, language-independent techniques – including new deep learning architectures and high-quality silver data creation, as well as a brand-new, wide-coverage, multilingual verb frame resource, namely VerbAtlas – will help significantly close the gap between English and low-resource languages, and achieve the state of the art across the board.
    • Bio: Roberto Navigli is Professor of Computer Science at the Sapienza University of Rome, where he leads the Sapienza NLP Group. He is one of the few researchers to have received two prestigious ERC grants in computer science on multilingual word sense disambiguation (2011-2016) and multilingual language- and syntax-independent open-text unified representations (2017-2022). In 2015 he received the META prize for groundbreaking work in overcoming language barriers with BabelNet, a project also highlighted in The Guardian and Time magazine, and winner of the Artificial Intelligence Journal prominent paper award 2017. He is the co-founder of Babelscape, a successful company which enables Natural Language Understanding in dozens of languages. He is a Program Chair of ACL-IJCNLP 2021.
  • March 11: Edoardo Ponti, University of Cambridge / Mila Montreal

    • Title: Sample Efficiency, Generalisation, and Robustness in Multilingual NLP
    • Abstract: The typological variation across the world’s languages makes knowledge transfer very challenging for the current state-of-the-art models, as they are often vulnerable to the differences in source and target data distributions and suffer from the paucity of labelled data. Hence, I will explore how to endow machines with the ability to learn a new language from a few examples, to generalize to new domains by recombining previously acquired knowledge in original ways, and to make robust predictions in non-i.i.d. settings. I will show that solutions motivated by cognitive science and language acquisition in children can be elegantly given form within a Bayesian neural framework. Specifically, I will propose methods for the construction of an adequate inductive bias for neural weights and architectures, a modular design of the neural parameter space, and game-theoretic criteria for meta-learning. These models can be learned via approximate inference techniques, such as Laplace and variational methods.
    • ReferencesTowards Zero-shot Language Modeling and Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages
    • Bio: I am a postdoctoral fellow in computer science jointly at Mila and McGill University in Montreal. In 2020, I obtained a Ph.D. in computational linguistics at the University of Cambridge, St John’s College, where I was supervised by Anna Korhonen and Ivan Vulić. I am also a frequent collaborator of Ryan Cotterell’s lab at ETH Zürich. Previously, I have interned as AI/ML researcher at Apple in Cupertino. My research earned a Google Research Faculty Award and an ERC Proof of Concept Grant for projects co-written with my supervisors, a Fulbright scholarship, and a Best Paper Award at RepL4NLP 2019. Finally, I am a board member of SIGTYP, an ACL special interest group for computational typology, and a member of the ELLIS Society.
  • March 18: Raul Vazquez
    • Title: On the differences between BERT and MT embedding spaces
    • Abstract: In this talk I will present some recent work on the embedding spaces created using BERT and MT encoders. I will present a comparison of the representation spaces of both models using average cosine similarity and anisotropy-adjusted contextuality metrics, revealing that BERT and NMT encoder representations look  significantly  different  from  one  another.  Finally, I will present a technique we propose for overcoming this mismatch: a transformation from one embedding space into the other using explicit alignment and fine-tuning.  Our results demonstrate the need for such a transformation to improve the applicability of BERT in MT
  • March 25: Anna Dmitrieva

    • Title: Creating an Aligned Russian Text Simplification Dataset from Language Learner Data
    • Abstract: Parallel language corpora where regular texts are aligned with their simplified versions can be used in both natural language processing and theoretical linguistic studies. They are essential for the task of automatic text simplification, but can also provide valuable insights into the characteristics that make texts more accessible and reveal strategies that human experts use to simplify texts. Today, there exist a few parallel datasets for English and Simple English, but many other languages lack such data. In this presentation I will describe the work on creating an aligned Russian-Simple Russian dataset composed of Russian literature texts adapted for learners of Russian as a foreign language.
  • April 1: Roman Yangarber
    • Title:  NLP and language learning
    • Abstract: We discuss how recent advances in natural language processing and educational data science can be leveraged to create machines that can help people learn languages effectively.  We present our approach — Revita — which tries to simulate a good tutor, by continually assessing the learner’s current state and progress.  A key aspect of the approach is allowing the user to use any content as learning material — arbitrary, authentic texts, chosen by the learners themselves.  The system collects data about the learning process — about typical patterns of mistakes, paths of progress, etc., — to continually improve the teaching ability of the system.  We review the current state and open challenges.
    • Mini-BIO: Roman Yangarber is Associate Professor at the Department of Digital Humanities, University of Helsinki (UH).  He holds the Chair in Linguistic Inequalities and Translation Technologies at INEQ: The Helsinki Inequality Initiative.  His research group — the Language Learning Lab — researches a variety of topics: studying how language works, and how computers can better understand language.  His earlier research themes include analysis of news media, and modeling language evolution.  More recent focus is on AI support for language learning.  This has resulted in a system currently in use by learners and teachers at several universities.  The research focuses on support for learning smaller, endangered languages, as well as for “majority” languages.
  • April 8: Mikko Aulamo
    • Improving NMT for Sámi languages
    • Abstract: Historically, most of the machine translation work on Sàmi languages has been conducted with rule-based (RBMT) methods. We show that low-resource neural machine translation (NMT) techniques are also a viable option. This work focuses mainly on translation from Finnish to Northern Sámi, with some attention to the other two Sámi languages spoken in Finland: Skolt Sámi and Inari Sámi. We show that backtranslations from RBMT can be used to boost the quality of NMT. Additionally, we show preliminary experiments and plans on other ways of improving NMT for Sámi languages such as pivot translations and multilingual models.
  • April 15: Martijn Wieling, Rijksuniversiteit Groningen
    • Title: Approaches to investigating language variation and supporting minority languages
    • Abstract: In this presentation I will illustrate three computational approaches
      we use in our lab to investigate language variation and support
      minority languages. Specifically, I will show how we use an
      edit-distance-based approach to quantify pronunciation variation when
      transcriptions are available, and how these approaches may help in
      making different dialect data sets comparable in order to quantify
      pronunciation change. The second approach I will explain has the same
      goal, but instead of using transcriptions (which cost much effort to
      obtain), this approach uses the acoustic signal only, and, importantly
      is able to outperform the transcription-based approach. I will end
      with a slightly different study in which we aimed to assess whether
      language models (such as BERT) may be usefully employed for minority
      languages. By using a POS-tagging task, we show that a high level of
      performance can be reached which does not require any annotated data.
  • May 6: Hande Celikkanat
    • Title: Expressing User-Specific Connotations through Probabilistic Representations
    • Abstract:
      It is well-accepted that language is a negotiation of meaning, and that each single word gets its meaning assigned via a continuously ongoing negotiation throughout the lifetime of a language. However, such negotiations are by no means global to all the users of a language. Instead, tightly connected clusters of language users commonly develop their own “mini-dialects” by overloading certain existing words with special connotations. This makes sense from a language economy point of view, which creates a pressure to express commonly occurring concepts within the discourse of such groups with minimal length expressions. The phenomenon can be seen for example (1) within professional groups, which overload words for ease and efficiency of professional communication, (2) within users of similar political/social stance, which overload words to stress important concepts for the community, as well as to create a sense of in-group unity, and even (3) within fan and hobby groups, which use overloaded words both as a means of rapid (eg., game-related) communication, as much as potentially a part of their identity and self-expression, among many others. Motivated by this phenomenon, we claim that concepts occupy regions in the meaning space, whose coverage depends on the existence and strength of such negotiated in-group connotations, among other factors. This is in contrast to current state-of-the-art language models, which are optimized towards point-estimates for the meanings of tokens. In this study, which is yet under development, we try to extend the BERT language model to allow probabilistic meanings for tokens, which are in turn assumed to be conditioned not only on the sentence-level contexts of texts, but also on the predicted group affiliations of their authors. We discuss the problem definition, general model design, specific design choices and their alternatives, as well as possible evaluation options.
  • May 20: Janine Siewert
    • Title: Creating resources for Low Saxon dialectometry
    • Abstract: Despite its long written history, a standardised language for Low Saxon has not been developed so far, which makes it an excellent object for dialectometric investigations. Unfortunately, sufficiently annotated corpora for Low Saxon are extremely scarce, which is why the creation of such resources forms a central part of my PhD research. In my talk, I will present two such resources I have created or am still working on. The first of these is the LSDC dataset, including some of the dialect classification tasks I have conducted with it, and in the other half of my talk, I will introduce my ongoing work on the production of a large POS and morphologically tagged corpus spanning more than five centuries.
  • June 3: Kaisla Kajava
    • Title: A novel trilingual dataset for crisis news categorization across languages
    • Abstract: This paper describes a novel dataset of crisis-reporting news articles in three languages as a starting point for training news-domain-specific transformer models for crisis news monitoring. The automatically labeled dataset of news articles in English, French, and Russian is compiled from Common Crawl web archives. The dataset is used to assess the applicability of pre-trained language models based on BERT/RoBERTa, Sentence-Transformers, and LASER for unsupervised clustering of crisis news by topic. Using pre-trained sentence embeddings out-of-the-box gives a negative result compared to standard tf-idf vectorization with LSA, suggesting that pre-trained sentence models require domain-specific fine-tuning to capture features in automatically labeled news article data on the document and sentence level. Fine-tuning a BERT-based model yields a macro f1 score of 66% for Russian crisis news topic classification, and 63% for English.
  • (June 10 – NAACL)
  • June 17:  Iacer Calixto
    • Title: Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
    • Abstract: In this talk, I will talk about our recent work on investigating the reasoning ability of pretrained vision and language (V&L) models in a task that require multimodal integration: counting entities in an image. Previous work in the computer vision literature show that this is a difficult task, and standard model architectures struggle on counting. In this work, we show that the same holds with general-purpose pretrained V&L models, even though we generally believe such models “solve” many harder tasks. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that none of the pretrained V&L models is able to adequately solve our counting probe, and that they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings, find evidence that models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.