This is the time table of the general research seminar in language technology for the fall 2021. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests Because of the current situation we will organize the seminars as on-line meetings via zoom. The link to the zoom session will be send together with announcement of each talk. Please, register to our mailing list below.
Place and Time:
Thursdays 14:00 – 15:00, zoom room
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to firstname.lastname@example.org with the following text in the body of your e-mail:
subscribe lt-research email@example.com
Tentative Schedule – Fall 2021
- September 9: Maciej Janicki, University of Helsinki
- Title: Exploring Finnic folk poetry through string similarity
- Related project: The FILTER project
- Abstract: The collections “Suomen Kansan Vanhat Runot” (“Old Poems of the Finnish People”) and “Eesti Regilaulude Andmebaas” (“Estonian Database of Runosongs”) contain each around 100,000 folk songs written in Finnic tetrameter, collected from oral tradition mainly in the 19th and early 20th century. The texts are written in several closely related languages and while many of them show similarities and overlaps, those are obscured by a large amount of dialectal, orthographical, morphological and compositional variation. In this talk, I present a computational method for detecting similar verses, passages and poems based on character bigram cosine similarity. Furthermore, I show how automatic alignment can be used to study variation between similar texts. Finally, I outline possible applications of the method, including studying the regional distribution of poetic formulas, the relationship between printed works (e.g. Kalevala) and collected oral poems, and even discovering cross-lingual and cross-regional similarities.
- September 23: Marianna Apidianaki, University of Helsinki
- Title: Polysemy and sense partitionability in contextual language models
- Abstract: Pre-trained language models (LMs) encode rich information about linguistic structure but their knowledge about lexical polysemy is unclear. In this talk, I will present a novel experimental setup for analysing this knowledge in LMs specifically trained for different languages (English, French, Spanish, and Greek) and in multilingual BERT. The analysis is conducted on datasets carefully designed to reflect different sense distributions, controlling for parameters that are highly correlated with polysemy such as frequency and grammatical category. We demonstrate that BERT-derived representations reflect words’ polysemy level and their partitionability into senses. We show that polysemy-related information is more clearly present in English BERT embeddings, but models in other languages also manage to establish relevant distinctions between words at different polysemy levels. The goal of this study is to contribute to a better understanding of the knowledge encoded in contextualised representations, opening up new avenues for multilingual lexical semantics research.This is joint work with Aina Garí Soler which appeared in the TACL Journal in August, 2021.
- September 30: Alessandro Raganato, University of Helsinki
- Title: An Empirical Investigation of Word Alignment Supervision for Zero-Shot Multilingual Neural Machine Translation
- Abstract: Multilingual Neural Machine Translation (MNMT) systems are usually trained with a language label prepended to the input indicating the target language. However, these models have several flaws in zero-shot scenarios where language labels are ignored and the wrong language is generated, showing unstable results.
In this talk, I will present the benefits of an explicit alignment to language labels in Transformer-based MNMT models in the zero-shot context, by jointly training one cross attention head with word alignment supervision to stress the focus on the target language label. We compare and evaluate several MNMT systems on three multilingual MT benchmarks of different sizes, showing that simply supervising one cross attention head to focus both on word alignments and language labels reduces the bias towards translating into the wrong language, improving the zero-shot performance overall. Moreover, as an additional advantage, we find that the alignment supervision leads to more stable results across different training runs.
- This talk is based on a recent paper:
Alessandro Raganato, Raúl Vázquez, Mathias Creutz and Jörg Tiedemann.
An Empirical Investigation of Word Alignment Supervision for Zero-shot Multilingual Neural Machine Translation.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7-11 November 2021 (to appear).
- October 7: Michele Boggia, University of Helsinki
- Title: From the Higgs Boson to Language Technology
- Abstract: In this talk, I will focus on what I’ve been doing in the last few years.
“Research” originates from the Middle French “Recerche”, which broad meaning is “the Process to get something”.
In this sense, it is not just about publications and grant applications, but it is also about career aspirations and personal development.
This talk is about my personal experience, transitioning from a PhD in theoretical physics to NLP research, and the path I took to get to do what I like the most.
Goal of the talk is to encourage cross-disciplinary interactions by showing that taking the unbeaten path, if done wisely, can be rewarding – even if it requires moving out of the comfort zone. More than everything, this will be a good opportunity to introduce myself to the Language Technology group, get to know each other a bit better, and build bridges for future collaborations.
- October 14: Lena Voita, University of Edinburgh / University of Amsterdam
- Title: Neural Machine Translation Inside Out
- Abstract: In the last decade, machine translation shifted from traditional statistical approaches (SMT) to end-to-end neural ones (NMT). While traditional approaches split the translation task into several components and use various hand-crafted features, NMT learns the translation task directly from data, without splitting it into subtasks. The main question of this talk is how NMT manages to do this, and I will try to answer it keeping in mind the traditional paradigm. First, I will show that NMT components can take roles corresponding to the features modelled explicitly in SMT. Then I will explain how NMT balances the two different types of context, the source and the prefix of the target sentence. Finally, we will see that NMT training consists of the stages where it focuses on the competences mirroring three core SMT components: target-side language modeling, lexical translation, and reordering.
- October 28: Filip Ginter and Sampo Pyysalo, University of Turku
- Title: Finnish NLP in the Deep Learning Age
- Abstract: Deep transfer learning methods have had a great impact on various machine learning tasks over the last decade, with models based on the Transformer architecture such as GPT and BERT being particularly influential in natural language processing. In this presentation, we discuss recent advances in deep learning methods for Finnish NLP tasks, with particular focus on resources and methods using these technologies created in the TurkuNLP lab of the University of Turku, including tools for syntactic analysis, named entity recognition, sentiment analysis, and text generation.
- November 4: AI Day
- November 11: (EMNLP)
- November 18: Lidia Pivovarova, University of Helsinki
- Title: Grammatical Profiling for Semantic Change Detection
- Abstract: The talk is based on our recent paper (https://aclanthology.org/2021.conll-1.33/), where we investigated whether word meaning change could be predicted based only on morphosyntactic features, without encoding semantics explicitly. We rely on grammatical profiling, which is a corpus-linguistics technique grounded in the construction grammar and the priming theory. We demonstrate that grammatical profiling outperforms count-based distributional semantic approaches to the task across different languages and datasets. In some cases it even outperforms state-of-the-art results obtained using on large pretrained language models. These results call for more attention to the relation between morphosyntax and semantics.
- November 25: Anna Rogers, University of Copenhagen
Abstract: BERT and its Transformer-based cousins are at the core of most state-of-the-art NLP systems, yet it is still unclear why they work so well, and what could be our next step. This talk presents a collection of findings on how BERT learns and generalizes, where it fails, and whether the lottery ticket hypothesis holds for this architecture.
Bio: Anna Rogers is a post-doctoral researcher at the University of Copenhagen, focusing on analysis and evaluation of deep learning models for NLP, as well as their impact on society. She is also known for her work on NLP methodology, peer review, and as an organizer of the Workshop on Insights from Negative Results.
- December 9: Olli Kuparinen, University of Helsinki
- Title: A topic modeling approach to dialect analysis and visualization
- Abstract: The talk will serve two purposes:
1) to introduce myself and the recently started CorCoDial project to the group, and
2) to visualize and analyze the dialectal datasets we are about to use in the project with a topic modeling approach.Topic models are often used to find latent semantic structure in a collection of text documents. For such a task, the texts generally need to be normalized and lemmatized, as differing forms of the same word are not wanted. For dialects, the opposite is true – we are most interested in the differing forms and not as much in semantic meaning. Thus, topic models can be used on phonetically transcribed text directly to obtain components that correspond to dialect areas and features.In the talk, I present results of applying a non-negative matrix factorization (NMF) model on dialectal datasets from three language areas: Finnish, Norwegian and Swiss German. I will also comment on the effects of different inputs that include complete words, character n-grams, and automatically segmented morphological data.
- December 16: Christmas coffee (in person, we may open a zoom link as well …)