Research Seminar in Language Technology – Fall 2019

20152016 / 20162017 / 20172018 / 20182019 / 2019-2020

General Information

This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests

Place and Time:
Thursdays 14:15-16, Metsätalo lecture room 26
(group meetings and reading groups are in Metsätalo B610)

Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to majordomo@helsinki.fi with the following text in the body of your e-mail:

subscribe lt-research your.email@helsinki.fi

Tentative Schedule – Autumn 2019

  • September 5: Wrap up from summer conferences
  • September 12: David Mareček
    • Searching for hidden linguistic structures in neural networks 
    • I am going to present my recent work on neural-networks
      interpretation, more specifically, what linguistic structures are
      latently learned by neural networks when solving downstream tasks like
      machine translation or sentiment analysis. Instead of using probing
      (supervised learning of linguistic features from hidden states), I
      focus on an unsupervised extraction of the features. For instance, I
      am going to show how part-of-speech tags are encoded in word
      embeddings or what syntax is learned inside the transformer’s
      self-attentions.
  • September 19: FoTran Retreat to Tallinn (no seminar!)
  • October 3: Peter Revesz
  • October 17: Tommaso Pasini
    • Different Approaches for Enabling Multilinguality in Word Sense Disambiguation.
    • Word Sense Disambiguation (WSD) lies at the basis of Natural Language Processing and aims at assigning a word to one of its possible meanings depending on the context where the target word appears in. While knowledge-based approaches for WSD proved to be flexible enough to scale easily across different languages, they still fall behind supervised approaches which, on the other hand, struggle to generalise over different languages due to the paucity of annotated data.
      In this talk I am going to present a few published and still unpublished works which cope with this problem and aim at making WSD-models life easier when it comes to scale over different languages.
      Especially, I will introduce Train-O-Matic, a knowledge-based approach for generating sense annotated datasets which enable supervised WSD models to scale over different languages; DaD, a knowledge-based approach for automatically inducing the distribution of word senses from a given corpus; a neural approach for encoding documents in a latent cross-lingual space and a knowledge-based and multilingual approach for building sense representations of word senses that lay in the same vector space of BERT.
  • October 24: Anssi Yli-Jyrä
    • Transition-Based Coding and Formal Language Theory for Ordered Digraphs
    • Transition-based parsing of natural language uses transition systems to build directed annotation graphs (digraphs) for sentences. In this paper, we define, for an arbitrary ordered digraph, a unique decomposition and a corresponding linear encoding that are associated bijectively with each other via a new transition system. These results give us an efficient and succinct representation for digraphs and sets of digraphs. Based on the system and our analysis of its syntactic properties, we give structural bounds under which the set of encoded digraphs is restricted and becomes a context-free or a regular string language. The context-free restriction is essentially a superset of the encodings used previously to characterize properties of noncrossing digraphs and to solve maximal subgraphs problems. The regular restriction with a tight bound is shown to capture the Universal Dependencies v2.4 treebanks in linguistics. I will try to relate the technical results informally to
      • the neural dependency parsing (seq2seq parsing and parsing as labeling)
      • finite-state limits of natural language complexity
      • graph-based semantic parsing and the development of a graph-based Constraint Grammar
      • linear representation of translation alignment, with applications to interlinear texts
    • However, the core presentation aims at explaining how the vertex-ordered graphs can be encoded, because this is the problem where the paper contributes to the state of the art.
  • October 31: Tatu Ylönen
    • Title: Interpreting the meaning of text and discourse
    • Abstract:My research aims to develop interpretable meaning representations and natural language parsing and generation methods for computers. I view language use as a communicative process that aims to instill beliefs, emotions, and desires in the recipient. I view the computational interpretation task as understanding the objective, intended meaning, and situation depicted by the speaker/writer, either as the recipient or as an outside observer. I view the discourse context as providing critical background information for interpretation and disambiguation, and likely to make the interpretation problem computationally easier.In contrast, most existing work in computational semantics has used logic-oriented representations of the literal meaning of individual sentences, shallow lexical semantic representations, or difficult-to-interpret vector-based representations. Interpreting sentences out-of-context makes it harder to disambiguate intended meanings, understand the scope of various modifiers, and how clauses and sentences connect to each other. It doesn’t help us understand the situation depicted in the communication. The shallow representations sidestep the whole problem of understanding the objective and intended meaning, and vector-based representations are difficult to integrate with symbolic components, such as knowledge bases.This early-stage presentation reviews the foundations and starting point of my research and aims to provoke discussion on the underlying assumptions used in most computational linguistics research.
  • November 4: Lunch seminar with Vinit Ravishankar (University of Oslo)
    • NOTE: different day and time!
    • Title: Probing for syntax – cross-lingual and temporal perspectives
    • Place: B610 Metsätalo (coffee room, 6th floor)
    • Time: 12:15
  • November 7: HELDIG Summit
  • November 14: Aarne Talman
    • Predicting Prosodic Prominence from Text with Pre-Trained Contextualized Word Representations
    • In this talk I will describe a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this is the largest publicly available dataset with prosodic labels. I will first describe the dataset construction and the resulting benchmark dataset and then present the results of our baseline experiments with different feature-based classifiers and neural network systems. I will show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally I will discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. This is joint work with Antti Suni, Hande Celikkanat, Sofoklis Kakouros, Jörg Tiedemann and Martti Vainio to be presented at NoDaLiDa 2019.
  • November 19: Lunch seminar with Johannes Graën (University of Zurich)
    • NOTE: different day and time!
    • Place: B610 Metsätalo (coffee room, 6th floor)
    • Time: 12:00
    • Title: Language learning on large parallel corpora: aggregation and classification
    • Abstract: Most applications for parallel corpora aim at extracting general
      knowledge from them, often in form of language models. Individual
      examples have seen use in the compilation of dictionaries (as good
      dictionary examples, GDEX) or in online concordancers such as Linguee,
      Reverso or Tradooit. Those tools are often labelled as dictionaries,
      although their content has not been curated, and we often see errors due
      to misalignment.In my presentation, I want to showcase some applications that we have created with language learners in mind and discuss my current project, which aims at supporting language learners with the help of examples
      from parallel corpora. Besides filtering algorithms, feedback from users
      is envisaged to be used for improving quality (i.e. crowd correction).
  • November 21: Eetu Sjöblom and Mathias Creutz
    • Title: Paraphrasing and error correction
    • A paraphrase is an alternate way of expressing a meaning using other words than in the original utterance. Paraphrasing is interesting for many reasons. Paraphrases allow us to study possible underlying common semantic representations for sentences that essentially mean the same thing but have different surface realizations, such as the sentence pair: “Why don’t you watch your mouth?” vs. “Take care what you say.” Paraphrases can also serve as a useful tool for language users looking for more effective ways of conveying their message. This applies to native and non-native language users alike. Any user makes mistakes in spelling and grammar occasionally. In addition, non-native users, in particular, may be struggling to find fluent, natural sounding idiomatic expressions.This talk presents joint work with Eetu Sjöblom and with important contributions from Mikko Aulamo and Yves Scherrer. The talk touches upon paraphrase detection, paraphrase generation, the creation of a paraphrase corpus and the use of paraphrasing for error correction. Neural methods are applied. Experiments are carried out on six European languages: Finnish, Swedish, English, German, French and Russian, with particular focus on the first three languages mentioned.
  • November 28: Dorota Glowacka and Alan Medlar
    • Manuscripts and newspapers: Using machine learning to map the sciences and identify political bias
    • Dorota Glowacka and Alan Medlar will talk about their research on using topic models to analyse scientific literature. This work stems from a concern that the content and/or structure of scientific articles could impact the results of interactive information retrieval experiments. They will show that how well an abstract summarises the full-text is subfield-specific, which could adversely affect perceived retrieval performance. We will demonstrate that this is not random, but a consequence of style and writing conventions in different disciplines. Indeed, these features can be used to infer an “evolutionary” tree that highlights the inter-relatedness of different subfields within Computer Science. Finally, we will touch upon our current work on using neural language models to identify political bias in online newspapers.
  • December 5: Raul Vazquez
    • Title: Is understanding embedding spaces a key to perform unsupervised MT?
    • Abstract: I will present advances in my research on (i) unsupervised MT, and (ii) The Flexibridge: A flexible implementation of the attention bridge.
      • (i): Current techniques on unsupervised machine translation propose automated alignment between the word embedding spaces of the source and the target languages. For the most part, such methods have not focused sufficiently on the properties of the embedded points in space they search to align.
      • (ii): The Flexibridge is a modification of an inner-attention-based multilingual translation model originally proposed for the extraction of sentence representations. The model can now handle audio files as input, resulting in promising end-to-end speech-to-text translation capabilities.
  • December 12: Yves Scherrer
    • Mapping dialects – from atlas to corpus data
    • Dialect atlases have always constituted the main data source for quantitative dialect studies. However, most atlases of European dialect areas have been compiled and published in the first half of the 20th century, which poses challenges with regard to their accessibility for computational methods. In the first part of this talk, I will discuss some digitization efforts and present interactive visualisations of dialectometrical analyses that have been enabled by digitization.In the second part of the talk, I will describe how corpus-based dialectology can benefit from machine translation technology. In the context of the compilation of the multi-dialect ArchiMob corpus, we have used statistical MT to provide a normalization layer. The resulting normalization models allow us to easily study phonological and morphological processes and map their geographical extents. I will also show that a neural MT model trained on the same corpus yields interesting correlations with metadata such as the geographical origin of the texts.
  • December 19: canceled