Research Seminar in Language Technology – Fall 2020

20152016 / 20162017 / 20172018 / 20182019 / 20192020 / 2020-2021

General Information

This is the time table of the general research seminar in language technology for the fall 2020. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests Because of the current situation we will organize the seminars as on-line meetings via zoom. The link to the zoom session will be send together with announcement of each talk. Please, register to our mailing list below.

Place and Time:
Thursdays 14:15-15, zoom room

Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to majordomo@helsinki.fi with the following text in the body of your e-mail:

subscribe lt-research your.email@helsinki.fi

Tentative Schedule – Fall 2020

  • October 9: PhD Defense Senka Drobac (12:00)
  • October 15: Lidia Pivovarova: Detection of word meaning shifts
    • Abstract: Each word has a variety of senses and connotations, constantly evolving through usage in social interactions and changes in cultural and social practices. Identifying and understanding these changes is important for linguistic research and social analysis, since it allows detection of cultural and linguistic trends and possibly predict future changes. Detection of these changes can also be used for improvement of many NLP tasks, such as text classification and information retrieval. A large majority of modern methods for semantic shift detection leverage word embeddings. Most recent contextualized embedding models open new opportunities for automatic detection of word semantic change over time. Contextualized embedding models produce a separate vector for each word mention in the corpus. Difference in these vectors across time could be generalized into measurable word meaning shift. Results obtained using contextualized embeddings are potentially more interpretable than models based on static word embeddings such as word2vec, though in some cases static embeddings demonstrate higher performance.  A novel method for word meaning shift detection, based on clustering of BERT embeddings, will be presented in the talk.
      The talk is based on two recent papers:
    • Matej Martinc, Syrielle Montariol, Elaine Zosa, and Lidia Pivovarova. Capturing evolution in word usage: Just add more clusters? Temporal Web Analytics Workshop, 2020
    • Matej Martinc, Syrielle Montariol, Lidia Pivovarova, and Elaine Zosa. Discovery team at SemEval-2020 Task 1: Context-sensitive embeddings not always better than static for semantic change detection. SemEval, 2020
  • October 22: Elaine Zosa: Topic modelling and historical newspapers
    • Topic modelling is a method to extract latent topics in a collection of documents. It is a popular tool in data-driven historical research to analyse large-scale historical collections such as newspaper archives. This talk is based on two recent works we have done related to the use of topic modelling in historical newspaper corpora.The first part of my talk will discuss our work on uncovering discourses in a diachronic text collection and tracking their prominence across time. We are interested in grasping broad societal topics, discourses that cannot be reduced to mere words, isolated events, or particular people. Our long-lasting goal is to investigate a global change in the presence of such topics and especially finding discourses that are disappeared or declined and thus could easily slip away in modern research. We believe that these research questions are better approached in a data-driven way without necessarily deciding what we are looking for beforehand. To this end we applied unsupervised topic modelling methods to a collection of nineteenth-century Finnish newspapers and analyse discourse dynamics over time through a humanistic interpretation.In the second part I will talk about our recent effort in quantifying the robustness of embedding-based topic models to OCR noise in digitised newspapers. Embedding-based topic models incorporate information from word embeddings directly during training and have been shown to improve topic quality on clean data but these models have not seen much use in historical text analysis. In this work, we evaluate the robustness of two embedding-based topic models, Gaussian LDA and the Embedded Topic Model, to OCR noise. Our results show that these models are more resilient than LDA to noisy data on a variety of measures.
  • October 29: Hande Celikkanat: How I Learned to Stop Worrying and Love the Uncertainty: An Introduction to Representing Uncertainty in Language Models.
    • State-of-the-art models of language, in their never ending endeavour of optimizing best to the available data, pretend that there is no uncertainty in the world. They try to find the best possible parameters over deterministic data points, giving us 100% certain predictions even in the cases where they do not have a lot of clue. In this talk I will walk us through the motivations for allowing uncertainty in our models, and some basic (Bayesian) strategies for doing so, as well as the difficulties along the way. Pretending that uncertainty does not exist is convenient, because it is rather difficult and computationally costly to model. I will talk about basic Bayesian approximations, and how they are being utilized in a couple of important models, including Monte Carlo Dropout and Variational Auto-Encoders, to capture some of the uncertainty in the world. We will also talk about the missing points of these fundamental models, which will pave the way to a second chapter of a series of talks in the topic.
  • November 5: Andrey Indukaev: Automating text annotation when doing interpretative social science: what state of the art supervised machine learning can do?

    • What can recent developments in machine learning bring to the disciplines characterized by interpretative approach to research, such as history, cultural sociology and the like? Interpretative research presumes that social action can only be studied if one takes into account what that action means for those performing it and those affected by it. To put it simple, interpretative research has an inherent “bottom-up” aspect, where categories of analysis, even the research object definition emerge from study. In other words the researcher’s analytical categories and research objects are constructed and revised in light of the empirical examination of the webs of meaning the studied phenomenon is embedded into. At the first sight, unsupervised machine learning has affinity to this type of research, as already suggested by some researchers. In this presentation, I address the specificities of applying supervised machine learning in research heavily relying on interpretation. First, I focus primarily on an essential, but often overlooked activity, the production of training data. I suggest that this process not only heavily relies on interpretation, but may also lead to advances in interpretative analysis of data, i.e. the update of the analytical categories. In particular, the measures of inter-annotator agreement could be used not only to access the quality of annotated data, but also to reveal problematic annotation categories and, as consequence, weak points in the theories that informed category selection. Second, I suggest that the results of automated annotation, in particular inter-category discrepancies in classification accuracy, could also provide hints for the interpretative analysis. The demonstration is based on the ongoing work on automating text annotation informed by “Justification Theory” by Luc Boltanski and Laurent Thévenot, an influential social theory providing a typology of socially shared perspectives on value and worth. In this project, a BERT model is trained on multi-class, multi-label annotation data performed by expert annotators.
  • November 12: Tatu Ylönen: Inside-out parsing of Wikitext and extraction of a large multilingual machine-readable dictionary with translations, pronunciations, inflections, and more

    • I will talk about the syntax used in WikiMedia projects (e.g., Wiktionary, Wikipedia) and how to parse and expand contents (including templates and Lua modules) programmatically for data extraction. The syntax consists of multiple layers, some processing output of template expansion, and has token-level ambiguity that makes traditional parsing difficult. An inside-out parsing approach with partial template expansion was developed. As far as I know this is also the first Python tool for general expansion of MediaWiki templates and Lua macros.A large machine-readable dictionary of thousands of languages with English glosses was then extracted using the tools. The dictionary includes pronunciation, translation linkages, conjugation/declination classes, and linguistic and semantic tags for word senses. The data is output in JSON format and web-based dictionaries for each language are generated, with download options for subsets for individual languages, semantic categories, linguistic categories, etc. The resource may be useful for NLP applications (e.g., machine translation, semantic parsing, speech recognition) as well as other linguistic research.
  • November 19: Mikko Kivelä: Polarised communication patterns – structure and content
    • Abstract: Networks are a powerful tool for analyzing social systems. After around a hundred years of social network analysis, and especially after large-scale automatic data collection was made possible due to technological advancements, we know a great deal about the structure of social networks. However, apart from the structure much less is known: the context of discussions, attitudes of the people in the networks, etc. These factors have been shown to be crucial in theoretical models, but joint empirical analysis of network structure and content is less developed. I will illustrate how the content can inform structural network analysis and how network structure can inform content analysis using political communication on Twitter as a case. First I will illustrate how the content of the discussion can reveal clear structures in networks that otherwise seem much less structured. Then I will show how structural analysis can be used to find groups, which can then be used to analyze the type of content that glues these groups together. For a better understanding of communication systems, network analysis and language analysis need to be combined as the structure and content are clearly intertwined in human communication and social systems.
      I will present these ideas following these two recent papers:
    • THY Chen, A Salloum,  A Gronow, T Ylä-Anttila, MK.  “Polarization of Climate Politics Results from Partisan Sorting: Evidence from Finnish Twittersphere.” arXiv:2007.02706 (2020)
    • Y Xia,  THY Chen, MK. “Spread of Tweets in Climate Discussions.” arXiv:2010.09801 (2020)
  • November 26: Aleksi Sahala: From signs to semantic analysis: A pipeline for ancient Akkadian cuneiform texts
    • Abstract: Akkadian is an ancient East-Semitic Mesopotamian language (2400 BCE – 100 CE), best known as the language of the Babylonians and the Assyrians. Several hundreds of thousands of cuneiform tablets have been excavated and a part of them (about 3 million words out of 10M) has been either manually or semi-automatically digitized into more or less annotated text corpora. Due to the extend of spelling variation and morphological complexity of the Akkadian language, a rich annotation is mandatory before any kind of quantitative linguistic, literary or related analysis can be feasibly done.In this presentation I will present some research conducted in Helsinki in 2017-2020 to advance the automatic annotation, and quantitative lexicographic analysis of Akkadian texts. This research has been a part of the Semantic Domains in Akkadian Texts Project and the Center of Excellence in Ancient Near Eastern Empires.The presentation consists of two topics. First, the use of neural sequence-to-sequence models and finite-state transducers to convert transliterated Akkadian cuneiform text into lemmatized, POS-tagged and morphologically analyzed corpora, and second, improving the performance of word association measures and word embeddings on the sparse and fragmentary Akkadian data.
  • December 3: Jose Camacho Collados: Contextualized Embeddings and Lexical Ambiguity: Word Sense Disambiguation and Beyond
    • Abstract: Embeddings have been one of the most important topics of interest in Natural Language Processing (NLP) for the past decade. Representing knowledge through a low-dimensional vector which is easily integrable in modern machine learning models has played a central role in the development of the field. Embedding techniques initially focused on words but the attention soon started to shift to other forms. Recently, contextualized embeddings such as ELMo or BERT have taken NLP by storm, providing improvements in many downstream tasks. Unlike static word embeddings, contextualized models can dynamically capture the meaning of a word in context. In this talk, I will explain to what extent this is true, showing the main advantages and limitations of current approaches. I will take word sense disambiguation as a proxy to answer of these questions, presenting an overview of the field from a language modelling perspective.
  • December 17: Emily Öhman: Cross-lingual expressions of emotions in text
    • Emotions are expressed differently in different types of texts, languages, and cultures. This is the basic premise of my doctoral studies into sentiment analysis and emotion detection. I present the central tenets of my dissertation work that deals with emotion detection for a multitude of languages.