This is the time table of the general research seminar in language technology for the spring 2022. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests We will alternate between online seminars (on zoom) and in-person seminars at the University of Helsinki. The latter may be broadcasted as well to make it possible to follow on-line. Details will be announced for each individual talk. Please, register to our mailing list below.
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to email@example.com with the following text in the body of your e-mail:
subscribe lt-research firstname.lastname@example.org
Tentative Schedule – Spring 2023
- January 12: Kyunghyun Cho (New York University)
- IMPORTANT: live talk in room 5 at Metsätalo (Unioninkatu 5)
- Title: Generative modeling for biological sequence – natural language generation to protein design [slides][video]
- Abstract: Prescient Design at Genentech investigates, develops and deploys a machine learning driven platform for protein design. At the core of Prescient Design’s strategy is the deep manifold sampler which is a non-autoregressive sequence denoising autoencoder combined with a function predictor. In this talk, I will describe the foundations behind the deep manifold sampler. In the first part, I will discuss the background on generative modeling in depth, starting from energy-based modeling, restricted Boltzmann machines to denoising autoencoders. This will be followed by the second part in which sampling from a discriminative model, such as a classifier, will be discussed carefully, particularly focusing on their shortcomings. In the final part of the talk, I will talk about how these two paradigms can be combined to complement each other and present the deep manifold sampler as a representative example specifically in the context of sequence-level protein design.
- Bio: Kyunghyun Cho is an associate professor of computer science and data science at New York University and CIFAR Fellow of Learning in Machines & Brains. He is also a senior director of frontier research at the Prescient Design team within Genentech Research & Early Development (gRED). He was a research scientist at Facebook AI Research from June 2017 to May 2020 and a postdoctoral fellow at University of Montreal until Summer 2015 under the supervision of Prof. Yoshua Bengio, after receiving MSc and PhD degrees from Aalto University April 2011 and April 2014, respectively, under the supervision of Prof. Juha Karhunen, Dr. Tapani Raiko and Dr. Alexander Ilin. He received the Samsung Ho-Am Prize in Engineering in 2021. He tries his best to find a balance among machine learning, natural language processing, and life, but almost always fails to do so.
- January 19: Ona de Gibert Bonet
- Title: Exploring Machine Translation Methods for the Promotion of Minority Languages [video]
- Abstract: The preservation and promotion of minority languages has become a crucial issue as the number of endangered languages continues to grow. Language technology has the capacity to play a significant role in supporting these languages. Machine translation, in particular, has the potential to greatly benefit these languages by increasing access to information and resources, as well as facilitating communication between speakers. We will present two research projects that demonstrate its potential. The first project is a case study on semantic-based parallel corpus filtering for the Catalan-English language pair. Through this project, we aim to demonstrate the effectiveness of parallel corpus filtering in improving machine translation performance for under-resource languages. The second project expands on previous work done on unsupervised machine translation by applying it to multiple languages and domains in low-resource settings. Through this project, we aim to demonstrate the promise of unsupervised machine translation in supporting minority languages and overcoming data scarcity issues. In conclusion, this talk will provide insights into the practical application of machine translation as a tool for supporting minority languages.
- January 26: Shaoxiong Ji
- February 2: Tommi Nieminen
- Title: Integration and adaptation: bringing open-source machine translation to professional translators [video]
- Abstract: Neural machine translation (NMT) has been widely and successfully adopted in the translation industry in the recent years. Almost all of the MT systems used in the translation industry are either non-public systems developed for the translation needs of a specific organization, or publicly available paid services offered by MT service providers (such as Google, DeepL or ModernMT). In the context of professional translation, an MT system is useful only if it meets two conditions: 1. it can be integrated into standard translation tools, and 2. it can be adapted to provide translations that are consistent with the stylistic and other requirements of the client. I will discuss these two aspects of MT (integration and adaptation) in relation to OPUS-CAT, which is a an open-source alternative to the afore-mentioned commercial MT systems. OPUS-CAT combines the Marian NMT framework with the OPUS-MT models, integrates them with popular translation tools, and provides functionalities for adapting the general domain OPUS-MT models for specific clients and domains. OPUS-CAT runs locally on the translators’ computer, which makes it inherently secure and confidential. The local MT generation is also more environmentally friendly, since no online services are required. OPUS-CAT is already used by many organizations and freelance translators, and its userbase has grown steadily. From the point of view of research, the value of OPUS-CAT lies in providing a complete, production-ready MT system, which can be used for user studies and as an open-source basis for quickly developing and testing new forms of MT-assisted translation.
- February 16: Dennis Ulmer (ITU Copenhagen)
- Title: Uncertainty Quantification for Natural Language Processing: Current State & Open Questions [video]
Determining model confidence, or, conversely, uncertainty, is an important mean to instill trust in end users and avert harm. While there exist many works on images and tabular data, obtaining reliable uncertainty estimates from neural networks remains underexplored in Natural Language Processing (NLP), which presents unique challenges compared to other modalities.I will briefly discuss the state of uncertainty quantification in Deep Learning and NLP, before presenting results from our own recent EMNLP paper “Exploring Predictive Uncertainty and Calibration in NLP: A Study on the Impact of Method & Data Scarcity”. Afterwards, I will dive into open problems and future work with respect to Natural Language Generation and Large Language Models.
- March 2: Wen Lai (LMU Munich)
- Title: Domain Adaptation for Machine Translation in Few-Shot Setting [video]
- Abstract: The success of Neural Machine Translation heavily relies on large-scale high quality parallel data, which is difficult to obtain in some domains. The simplest method to do NMT domain adaptation is fine-tune the model on out-of-domain data in the general domain machine translation model. However, these approaches have some limitations: i) when finetuning on a new domain, catastrophic forgetting reduces the performance on other domains. ii) When the new domains come in, the finetuned model in one specific domain is no longer perform well on the new unseen domains. To this end, in this talk, I will present our recent work for addressing the above limitations in a bilingual and multilingual perspective separately. More specifically, we present two effective methods to address the NMT domain adaptation problem in few-shot setting and proved to be keep the domain robustness and domain adaptability at the same time.
- (March 9 – reading week)
- March 16: Anna Dmitrieva
- This presentation will consist of two parts, each part covering one of my recent projects.[video]
- Part 1. Automatic text simplification of Russian texts using control tokensThis part describes the research on the possibilities to control automatic text simplification with special tokens that allow modifying the length, paraphrasing degree, syntactic complexity, and the CEFR (Common European Framework of Reference) grade level of the output texts, i.e., the level of language proficiency a non-native speaker would need to understand them. I focused on Russian texts and aimed to continue and broaden the existing research on controlled Russian text simplification. I did so by exploring available datasets for monolingual Russian machine translation (paraphrasing and simplification), experimenting with various model architectures, and adding control tokens that have not previously been used on Russian texts.Part 2. Creating a parallel Finnish-Easy Finnish dataset from news articlesThis part describes the creation of a parallel Finnish-Easy Finnish dataset from the Yle News archives. The dataset contains 1920 pairs of articles, each containing a piece of news in Easy Finnish and a corresponding article from Standard Finnish news. This new aligned resource was created automatically based on the Yle News archives from the Language Bank of Finland (Kielipankki) and manually checked by a human expert. This resource will allow for more effective Easy Language research as well as for creating applications for automatic simplification and/or summarization of Finnish texts. The dataset is now available for download from Kielipankki: http://urn.fi/urn:nbn:fi:lb-2022111625.
- March 23: Teemu Vahtola
- Title: How well do large language models represent antonymy and negation?[video]
- Abstract: The ability of large language models to embed essential linguistic properties makes them applicable across a wide range of tasks. However, it is still an open question how well they cope with systematic compositionality and to what level of abstraction they reflect the actual meaning behind a given sentence.
In this presentation, I present SemAntoNeg, a dedicated test suite focusing on a specific linguistic phenomenon. In particular, the test suite assesses the ability of language models to properly represent paraphrasticity, realised in expressions that incorporate antonymy and negation. Finally, I present a large scale evaluation of publicly available pretrained language models on the proposed test suite, showing large variation between models fine-tuned for different downstream tasks.
- April 13: Anisia Katinskaia
- Title: Assessing Grammatical Correctness in Language Learning
- (May 4 – EACL)
- May 11: Maria Copot (Université Paris Cité) and Sara Court (Ohio State University)
- Title: Rethinking Tools for the Morphosyntactic Analysis of Underdocumented Languages [slides] [video]
- Abstract: Recent decades have seen an increase in efforts to document endangered and underdocumented languages thanks in part to technological innovations that allow for the collection, analysis, and storage of multimedia resources in unprecedented ways. Machine learning shows promise as a tool to make such efforts faster and more efficient, but continue to face the inherent bottleneck of data scarcity. We propose a novel workflow for the morphosyntactic annotation of underdocumented languages that relies on combining the power of unsupervised and active learning with human-in-the-loop feedback from members of the linguistic community of interest. The method is built within the theoretical framework of Word-and-Paradigm morphology: words do not need to be segmented, or assigned morphosyntactic features, though this is possible at later stages of the analysis. The annotation process consists of determining whether a word in context “belongs with” other words suggested by the model – the simplicity of the annotation system makes it straightforward to involve members of the target linguistic community, giving them greater agency over the process of documentary fieldwork: a pilot with speakers of Wao Terero, an endangered language in the Ecuadorian Amazon, confirmed that consultants could perform annotation with minimal training, and found the task easy and worthwhile.
- Bio Maria: Maria is a PhD student at the Université Paris Cité, and works at the intersection of linguistic theory and cognitive science, by means of quantitative methodologies. Their research centres on the subfield of morphology, which is concerned with the structure of humans’ knowledge of words. Maria’s work uses quantitative analysis of corpus and experimental data to probe the structure of human knowledge of words, and resulting human behaviours, by examining the interaction of memory and structured generalisations.
- Bio Sara: Sara is a PhD student of linguistics at the Ohio State University in the US. Her research applies computational methods to the analysis of linguistic phenomena in low-resource and language contact settings. Sara’s work takes a typological perspective, integrating frameworks from theoretical morphology, contact linguistics, and language documentation to develop new infrastructures for collaboration among scholars and speakers of the world’s languages.
- (May 25 – NoDaLiDa 22-24 May)
- June 1: Jue Hou (University of Helsinki)
- Title: Linguistic Constructs Represent the Domain Model in Intelligent Language Tutoring [slides] [video]
- Abstract: We present the development of the AI-based language-learning platform, Revita. It is an intelligent online tutor, developed to support learners of multiple languages, from lower-intermediate to advanced levels. It has been in pilot use with hundreds of students at several universities, whose feedback and needs shape the development. One of the main emerging features of Revita is the system of linguistic constructs to represent the domain knowledge. The system of constructs is developed in collaboration with experts in language pedagogy. Constructs define the types of exercises, and the content of the feedback, and enable detailed modeling and evaluation of learner progress.
- June 8
- June 15: CrowdMT workshop at EAMT
- (June 22)