Research Seminar in Language Technology

20152016 / 20162017 / 20172018 / 20182019 / 20192020 / 20202021 / 20212022 / 20222023 / 2023-2024

General Information

This is the time table of the general research seminar in language technology for the spring 2024. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests We will alternate between online seminars (on zoom) and in-person seminars at the University of Helsinki. The latter may be broadcasted as well to make it possible to follow on-line. Details will be announced for each individual talk. Please, register to our mailing list below.

Place and Time:
Thursdays 14:15 – 15:45 (Metsätalo, room 26)
Live streaming: Zoom link

Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to majordomo@helsinki.fi with the following text in the body of your e-mail:

subscribe lt-research your.email@helsinki.fi

Tentative Schedule – Sprint 2024

  • February 8: Erofili Psaltaki, University of Crete (on-site)
    • Abstract: At the beginning of this talk, I will mention the NLP projects in which I have participated. Following that, I will shift my focus to dialectology, specifically Cretan dialectology, which is my primary interest. I will present my research on Cretan interrogative elements, namely ίντα [΄i.da] and γιάντα [΄ʝa.da], examining their evolution from diachrony to synchrony.
      Notably, there is no corpus available specifically for the Cretan dialect or Greek dialects in general. Due to this limitation, I conducted my research by reading Cretan literature and folklore materials, utilizing them as my personal corpus.
      Finally, I will draw connections between this work and my master’s thesis, as it served as the inspiration for the beginning of the latter.
  • February 15: Alan Ansell, University of Cambridge (remote)
    • Title: Modular and Efficient Adaptation with Sparsity [video]
    • Abstract: Fine-tuning enables us to adapt pretrained models to specific domains and tasks. But often we are interested in combining domains and tasks, or reusing knowledge learned about one task to assist with another. Cross-lingual transfer is a typical example, where we simultaneously wish to adapt to a specific task and a specific language. In this talk, I will discuss modular and efficient fine-tuning techniques, with a focus on the sparse fine-tuning techniques I have developed during my PhD, and how they can be applied in multilingual NLP.
  • February 22: Vered Shwartz, University of British Columbia (on-site)
    • part of the mini-workshop Goodbye FoTran
    • different zoom link
    • Title: Reality Check: Natural Language Processing in the era of Large Language Models [video]
    • Abstract: Large language models (LLMs) contributed to a major breakthrough in NLP, both in terms of understanding natural language queries, commands or questions; and in generating relevant, coherent, grammatical, human-like text. LLMs like ChatGPT became a product used by many, for getting advice, writing essays, troubleshooting and writing code, creative writing, and more. This calls for a reality check: which NLP tasks did LLMs solve? What are the remaining challenges, and which new opportunities did LLMs create? In this talk, I will discuss several areas of NLP that can benefit from but are still challenging for LLMs: grounding, i.e. interpreting language based on non-linguistic context; reasoning; and real-world applications.
    • Bio: Vered Shwartz is an Assistant Professor of Computer Science at the University of British Columbia, and a CIFAR AI Chair at the Vector Institute. Her research interests include commonsense reasoning, computational semantics and pragmatics, and multiword expressions. Previously, Vered was a postdoctoral researcher at the Allen Institute for AI (AI2) and the University of Washington, and received her PhD in Computer Science from Bar-Ilan University. Vered’s work has been recognized with several awards, including The Eric and Wendy Schmidt Postdoctoral Award for Women in Mathematical and Computing Sciences, the Clore Foundation Scholarship, and an ACL 2016 outstanding paper award.
  • February 29: Dana Roemling, University of Birmingham (on-site)
    • Title: Geolinguistic Authorship Profiling in Forensic Linguistics [video]
    • Abstract: Forensic authorship profiling, i.e. the analysis of texts to infer characteristics about an author, has traditionally relied on qualitative methods and the analyst’s expertise. Consequently, this project explores how forensic authorship analyses can be advanced by employing statistical and computational methods and a large corpus of German social media posts. This project focuses on geolinguistic profiling to infer an author’s regional background by utilising regional linguistic variation 1) as a reference tool for qualitative analyses and 2) for the computational prediction of (dialect) regions. Both aim to enhance the reliability and reproducibility of forensic authorship analysis.
  • March 7 – Michal Stefanik, Masaryk University (on-site)
    • Title: Robustness of Language Models and Perspectives of Modularization[video]
    • Abstract: As the abilities of Language Models are heavily conditioned by their training data, their practical applicability shows best on data that substantially differ from their training mix. Data from other sources, often called out-of-distribution (OOD), is commonly used to evaluate models’ practical quality. In this talk, we’ll take a look at different ways to make Language Models more robust to OOD data, outlining perspective directions for low-resource scenarios. We will also discuss the blind spots of OOD evaluations that remain commonly overlooked even in very recent work. Finally, we’ll take a look at the perspective role of modularization in creating more practical and robust models, with a short survey of what the “module” can constitute, and the limitations of existing modularization techniques.
  • March 14: Tommi Buder-Gröndahl (on-site)
    • Title: Linguistic Representations in Large Language Models: Some Foundational Problems
      [video]
    • Abstract: Due to the black-box status of large language models (LLMs), increasing their explainability on a human-readable level has become as a central requirement. As part of this endeavour, numerous studies have aimed to interpret LLM-internal embeddings via familiar linguistic formalisms, such as phrase-structures or dependency graphs. The most prominent methodology here is called probing, and involves mapping embeddings (or their sequences) to pre-defined linguistic target labels. Moreover, the probing literature has regularly endorsed the idea that LLMs internally represent such linguistic structures. However, it is unclear how this claim should be interpreted in the first place: some readings of “linguistic representation” would make assigning them to LLMs trivially true, while others would make it trivially false. Finding an appropriate middle-ground is necessary for making the claim informative to begin with, but turns out to be more difficult than expected. In this presentation, I give an overview of how the problem plays out both in the standard supervised probing paradigm as well as alternative parameter-free approaches. The take-home message is that the interpretation of LLMs is likely influenced by theoretical artefacts derived from the labeling formalism, which speaks against making swift claims of linguistic representation.
  • March 21 – EACL
  • March 28 – Easter holidays
  • April 4: Wietse de Vries, Rijksuniversiteit Groningen (remote)
    • Title:Evaluation and Adaptation of Language Models for Under-Resourced Languages” [video]
    • Abstract: Development and application of pre-trained language models has focussed on English. Other languages are primarily served either by replicating what works for English or by using cross-lingual transfer learning with English as the source language. In this talk, I will present some projects that evaluate and challenge the effectiveness of this process. Firstly, we probe language models for Dutch linguistic information and find similar results as for equivalent English models. Then we evaluate the state of Dutch language modelling with a novel Dutch benchmark and show that even English monolingual models can perform decently for Dutch after fine-tuning. Taking this a step further, we show that we can generate realistic Dutch and Italian text by only retraining the word embeddings of GPT-2. Using a similar procedure, germanic monolingual models can perform well on POS tagging for Frisian and Gronings (a Low Saxon variant). Finally, give insight into how cross-lingual training would perform for other language families by fine-tuning and testing XLM-RoBERTa on over a hundred different languages and show that English is generally not the source language for other (unrelated) target languages. Cross-lingual modelling can be effective, but we should focus on non-English source languages.
  • April 11: Sofoklis Kakouros (on-site)
    • Title: Large speech and language models in prosody research [video]
    • Abstract: Recent advancements in large speech and language models have important implications for the modeling and comprehension of prosody, raising crucial questions about the depth of prosodic information these models can encode, their understanding of speech’s prosodic elements, and their ability to integrate prosody into language understanding. In this presentation, I will explore the use of Large Language Models (LLMs) to examine the intricate relationship between word predictability, emotion recognition, and word prominence. Specifically, I will discuss how word surprisal—a measure of a word’s predictability—correlates with word prominence, thereby illuminating distinct facets of language use and its potential applications in emotion recognition. Additionally, I will introduce studies that utilize large speech models, demonstrating their effectiveness in tasks like dialect identification and how they contribute to our understanding of linguistic variability. Moreover, I will present the development of innovative pooling methods, such as correlation pooling, for prosody. These methods mark a considerable advancement in the extraction and analysis of acoustic features, leading to enhanced accuracy in prosodic tasks, including emotion recognition.
  • April 18: Esther Ploeger, Aalborg University (on-site)
    • Title: Merging Close and Distant Perspectives on Language: Using Linguistic Typology in NLP, [video]
    • Abstract: Research in NLP has increasingly taken a distant perspective on language, as state-of-the-art performance in multilingual language modelling is achieved by leveraging large volumes of text data. Still, this approach poses major challenges. In low-resource settings, where such data is not available, performance is trailing. Moreover, this data-driven view is often not resource-efficient. At the same time, we have detailed information about many of the world’s languages. Linguists have for a long time taken a closer perspective on language, for instance resulting in linguistic grammars. These grammars are used in linguistic typology to study structural similarities and differences across languages. In this talk, I will be giving an overview of the problems that arise in merging these close and distant perspectives on language, when it comes to the interpreting, evaluating and improving of multilingual capabilities in NLP.
  • April 25: Jussi Karlgren, Silo.AI (on-site)
    • Title: How good are generative language models at their job (and what is the job)? [slides][video]
    • Abstract: There are numerous quality assessment metrics, datasets, and test collections to gauge the quality of generative language models. Many are useful; most are culturally specific; some measure capacities that arguably lie outside the remit of language models. I will discuss these issues, with some examples, and present the ELOQUENT shared tasks which are currently being run on a large number of research models. Suggestions for future tasks are welcomed!
  • May 2: Janine Siewert (on-site)
    • Title: Low Saxon corpus-based dialectometry [video]
    • Abstract: As a low-resource language without an interregional standard variety, Low Saxon belongs to the majority of the world’s languages that pose challenges to NLP and thus have remained underresearched. Another problem is caused by the fact that it is spoken in several countries and traditionally treated as dialects of the respective majority languages. As a result, linguistic research on Low Saxon has commonly stayed within the national borders, leading to scarcity of cross-border research and resources. This makes it difficult to trace the influence of language contact on diachronic change. During my PhD, I have created a diachronic corpus of Low Saxon covering both the Netherlands and Germany. I will discuss challenges related to automatic lemmatisation and present the UD dataset based on my corpus. Finally, I will compare changes in the dialectal usage of modal verbs to dialect change based on n-grams of PoS tags and morphological features.
  • May 9 – Ascension Day
  • May 16: John Bateman, University of Bremen (on-site)
    • Title: What are Large Language Models doing? A semiotic perspective.[video]
    • Abstract: Although an ever growing number of applications celebrate the capabilities of large language models (LLMs) there is still considerable disagreement concerning just what such models are doing, and what they are not doing. Several kinds of techniques have been developed for exploring the abstractions that appear to have been learned by such models and then relating these further to various aspects of linguistic theory. Some researchers have claimed consequently that such models are inherently restricted to form and so cannot capture `meaning’ at all (e.g., Bender/Koller), while others appear to consider it quite natural to refer to `understanding’ whenever LLMs are used (often, but not only, developers). In this talk, I address the challenge of ascertaining what an LLM might be doing from a semiotic perspective. By applying a more differentiating view of the linguistic system adopted from functional linguistics, I explore whether it might be possible to tease apart distinct aspects of `meaning’ that can then be used to characterise the abilities and non-abilities of LLMs more precisely than has been achieved in formal linguistic approaches, which have tended to rely on more straightforward characterisations such as the bare meaning-form distinction itself. I consider how this might impact how users should consider the outputs of LLMs in general, both currently and in the future, with particular attention to misundertandings concerning the relations between interactional plausibility and truth, between explicit training and varied forms of acquired knowledge, and between generality and semiotic abstraction.
  • May 23 – LREC/COLING
  • May 30: Anna Dmitrieva
    • Title: Towards Automatic Finnish Text Simplification [video][slides]
    • Abstract: In this presentation I will talk about expanding the Finnish-Easy Finnish parallel corpus and making baseline simplification models. Two new parallel datasets were created: a document-aligned Finnish-Easy Finnish parallel dataset spanning years 2014 to 2018, and a sentence-aligned dataset spanning years 2014 to 2020. We also fine-tuned two models for text simplification. The first is mBART, a sequence-to-sequence denoising auto-encoder proven to show good results for monolingual translation tasks. The second is the Finnish GPT model, for which we utilize instruction fine-tuning. This work is the first attempt to create simplification models for Finnish using monolingual parallel data in this language. The data has been deposited in the Finnish Language Bank (Kielipankki) and is available for non-commercial use, and the models are accessible through Huggingface.
  • June 13: Martijn Wieling, Rijksuniversiteit Groningen  (postponed)
  • June 20 – NAACL