This is the time table of the general research seminar in language technology for the spring 2022. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests Because of the current situation we will organize the seminars as on-line meetings via zoom. The link to the zoom session will be send together with announcement of each talk. Please, register to our mailing list below.
Place and Time:
Thursdays 14:00 – 15:00, zoom room
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to email@example.com with the following text in the body of your e-mail:
subscribe lt-research firstname.lastname@example.org
Tentative Schedule – Spring 2022
- February 3: Jörg Tiedemann (University of Helsinki): OPUS-MT – machine translation for everyone
- Abstract: OPUS is the major data hub for research and development in machine translation. OPUS-MT builds on this resource and provides public translation models and tools that can easily be deployed and used in research and professional workflows. The goal is to make automatic translation available to everyone with a wide coverage of languages and domains. OPUS-MT comes with open source server solutions and thousands of pre-trained neural translation models. The models can also be plugged into computer-assisted translation frameworks through OPUS-CAT. The core of the project is based on Marian-NMT, an efficient implementation of modern machine translation but OPUS-MT has also been integrated in the popular transformers library to broaden its application. This talks will report on the current state of the project including discussions on transfer learning across languages and knowledge distillation for the creation of compact models for real-time translation.
- February 17: Krister Lindén (University of Helsinki):
- Title: FIN-CLARIAH – a national research infrastructure for Social Sciences and Humanities
- Abstract:The project involves all Finnish universities with research in SSH. While historically the SSH have not been at the forefront of the use of digital technology, the field in Finland has the potential to enact such a transformation. The aim of this development project is to ensure that a digital transformation happens in an orderly fashion without duplication of efforts or reinventing the wheel.FIN-CLARIAH is a national research infrastructure for Social Sciences and Humanities (SSH) comprising two components, FIN-CLARIN and DARIAH-FI. FIN-CLARIAH makes available large collections of textual and multimodal resources as well as tools for analysing and enriching them. The collections span different periods, genres and regions as well as different modalities such as text, audio, pictures and video containing or accompanied by language data.The FIN-CLARIAH ecosystem has already made great strides, for example by creating unified processes for negotiating research rights to materials by the Language Bank of Finland, and through developing unified access mechanisms for the resulting datasets. Recent advances due to neural network technology, supercomputing availability, and large digital SSH datasets have created clear opportunities and needs for further development of a common SSH data and tools infrastructure.In the current project, FIN-CLARIAH seeks to significantly broaden the scope of SSH infrastructural support into three major directions: first, to reach beyond the processing of spoken standard Finnish into processing everyday speech; and second, to cater to a broad range of SSH research needs for processing unstructured text; and third, to facilitate research-based on metadata.
- March 3: Veronika Laippala (University of Turku)
- Title: Multilingual modeling of registers in Web-scale corpora
- Abstract: In this talk, I present an ongoing TurkuNLP research project on the multilingual modeling of Web genres or registers, that is, texts such as news, forum discussions and encyclopedia articles. Web register identification has challenged NLP and linguistics since the early steps of Web-as-corpus research. In particular, the extreme range of linguistic variation found on the Web and the lack of representative register-annotated datasets have prevented the development of robust identification systems. During the presentation, I will first present the CORE corpus series that we have developed following the English CORE, a web register corpus featuring the full, unrestricted Web (Egbert et al. 2015). SweCORE, FinCORE and FreCORE consist of altogether 15,000 manually annotated texts following the same CORE register taxonomy, and additionally, we are collecting small evaluation datasets in more than ten languages. Second, I will discuss register identification experiments on these corpora, showing promising results in both single-language and multilingual settings. Finally, I will show how the analysis of model explanations using Integrated Gradients can help to interpret the model predictions and the model’s conception of the register classes as a whole.
- March 10: Andrey Kutuzov (University of Oslo): Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks
- Abstract: Russian SuperGLUE (RSG) is a recently published benchmark set and leader-board for Russian natural language understanding. We analyze it
to find out whether it features any statistical cues that machine
learning based language models can exploit. It turns out that its test
datasets are indeed vulnerable to shallow heuristics. Often, approaches
based on simple rules outperform or come close to the results of
pre-trained language models like GPT-3 or BERT. It is likely that a
significant part of the SOTA models performance in the RSG leader-board
is due to exploiting these shallow heuristics, having nothing to do with
real language understanding.
- Abstract: Russian SuperGLUE (RSG) is a recently published benchmark set and leader-board for Russian natural language understanding. We analyze it
- March 17: Lonneke van der Plas (IDIAP)
- Title: Crossing boundaries: from cross-lingual learning to creative thought
- Abstract: The first part of this presentation is concerned with multilingual
transformer-based language models. The question we are concerned with is
what the nature of the representations learned by these models is, also
after they cross language boundaries. We performed several experiments
where we fine-tuned mBERT in multiple ways for two different tasks,
cross-lingual part-of-speech tagging and natural language inference and
studied the effect on the language-specific and language neutral
representations in mBERT.
In the second part of my talk, I will argue in more general terms that
in order to encompass a larger share of what human intelligence entails,
a focus on models that are able to learn more from less data, and
extrapolate beyond the training distribution, possibly inspired by human
cognition, is needed. For example, creative thinking is a human ability
that has been underexplored in computational modelling, even though it
is an important ingredient for engaging technology, ethical AI, and
innovation. I will show some of our recent work on the topic of
computational creativity, more precisely on novel concept generation.
- March 24: Artur Kulmizev (Uppsala University)
- Title: Schrödinger’s Tree — On Syntax and Neural Language Models
- Abstract: In the last half-decade, the field of natural language processing (NLP) has undergone two major transitions: the switch to neural networks as the primary modeling paradigm and the homogenization of the training regime (pre-train, then fine-tune). Amidst this process, language models have emerged as NLP’s workhorse, displaying increasingly fluent generation capabilities and proving to be an indispensable means of knowledge transfer downstream. Due to the otherwise opaque, black-box nature of such models, researchers have employed aspects of linguistic theory in order to characterize their behavior. Questions central to syntax — the study of the hierarchical structure of language — have factored heavily into such work, shedding invaluable insights about models’ inherent biases and their ability to make human-like generalizations. In this talk, I will attempt to take stock of this growing body of literature. Namely, I will call attention to a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings. To remedy this, I propose some careful considerations when investigating coding properties, selecting representations, and evaluating via downstream tasks. Lastly, I will outline the implications of the different types of research questions exhibited in studies on syntax, as well as the inherent pitfalls of aggregate metrics.
- March 31: Sami Itkonen (University of Helsinki)
- Title: No Mumin Trolls in the valley – Idioms in Finnish and other languages
- Abstract: Idioms and Multiword Expression have traditionally posed many challenges for natural language processing. This presentation reviews properties of idioms and outlines two cases of idiom processing: The Helsinki-NLP entry for the SemEval 2022 idiomaticity detection shared task and Finnish idiomatic processing.
April 7: Aarne Talman (Silo.ai / University of Helsinki)
- Easter break
- April 21: Magnus Sahlgren (AI Sweden)
- Large-scale (generative) language models for Swedish
- Abstract: This talk gives an overview over the work done at AI Sweden on developing large-scale generative language models for Swedish. We cover the motivation and background for this work, as well as some of the challenges (and opportunities) we face when working on a smaller language such as Swedish.
- April 27-29: Helsinki NLP workshops: CoCorDial and UnGroundNLP
- May 5: Raul Vazquez (University of Helsinki)
- Title: LM and NMT systems are different, but why?
- Abstract: In this talk, I will go through some of my ongoing work at the University of Helsinki, focusing on understanding the differences between prevalent neural-LM and -MT learning objectives. I will present our findings on the differences in the encoder spaces of pretrained models, using contextuality and representational similarity measures. In addition, I will describe the discrepancies we encountered in the training dynamics of these learning objectives when training the models from scratch.
- May 12: Isabelle Augenstein (University of Copenhagen)
- Title: Accountable and Robust Automatic Fact Checking [video]
- Abstract: The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
- May 19: Timothee Mickus (University of Helsinki)
- Title: On the Status of Word Embeddings as Implementations of the Distributional Hypothesis [video]
- Abstract: We study the status of word embeddings insofar they are relevant to linguistic studies. We more specifically focus on the relation between word embeddings and distributional semantics.
Our first approach to this inquiry consists in comparing word embeddings to some other representation of meaning, namely dictionary definitions. We find limited success, suggesting that either distributional semantics and dictionaries encode different information, or that word embeddings are not linguistically coherent representations of distributional semantics. The second angle we adopt to study the relation between word embeddings and distributional semantics is to compare human judgments on distributional tasks to what we observe for word embeddings. While word embeddings attain some degree of performance on this task, their behavior and that of our human annotators are found to drastically differ. Strengthening these results, we observe that a large family of broadly successful embedding models all exhibit artifacts imputable to the neural network architecture they use, rather than to any semantically meaningful factor.
In all, the linguistic validity of word embeddings is not a solved problem. Interesting perspectives nonetheless emerge from these experiments, ranging from the elaboration of a well-delineated theoeretical framework for distributional semantics to quantitative statements on the degree to which two distinct lexical semantic theories depict meaning in the same way.
- Ascension day
- June 2: Teemu Vahtola (University of Helsinki)
- Title: Paraphrase detection with noisy labels and generation with syntactic information
- Abstract: This talk will give an overview of some of the recent work conducted within the Behind the words project at the University of Helsinki. I will first present our submission to the 2022 Language Resources and Evaluation Conference. In the submitted work, we study the integration of a noise processing layer into a BERT fine-tuning stage to alleviate the effect of label noise in paraphrase detection. Second, I will present my ongoing work regarding paraphrase generation utilising syntactic information as guidance for performing certain paraphrasing operations motivated by previous research about paraphrasing.
- June 9: Anna Dmitrieva (University of Helsinki)
- Title: Creating a list of word alignments from parallel Russian simplification data
- Abstract: This work describes the development of a list of monolingual word alignments taken from parallel Russian simplification data. A list of word pairs is made where each “source” word has an equivalent “target” word, which is a simpler (shorter, more often used, more modern) synonym of the source word. This list is compiled automatically and post-edited by human experts. Parallel simplification data in the Russian language was used as a data source for this project. Such word lists can be used both in rule-based simplification applications and in lexically constrained decoding for neural machine translation models. Moreover, it is a valuable source of information for educational materials developers for teaching L2.