This is the time table of the general research seminar in language technology. Presentations will be given primarily by doctoral students but also other researchers in language technology including external guests
Place and Time:
Thursdays 14:15-16, Metsätalo lecture room 26
(group meetings and readin groups are in Metsätalo B610)
Registration and Announcements:
Please subscribe to our e-mail list lt-research by sending a message to firstname.lastname@example.org with the following text in the body of your e-mail:
subscribe lt-research email@example.com
Tentative Schedule – Spring 2019
- January, 9: Måns Huldén
Wednesday 9.1.2019 at 14-16
Metsätalo Lecture room 24 (Unioninkatu 40, 5th floor)
Title: Linguistic with Neural Networks
Abstract: Neural networks have led to previously unimaginable advances in computational linguistics. The main criticism against them from a linguistic point of view is that neural models – while fine for “language engineering tasks” – are thought of as being black boxes, and that their parameter opacity prevents us from discovering new facts about the nature of language itself, or specific languages. In this talk I will challenge that assumption to show that there are ways to uncover facts about language, even with a black box learner.I will discuss specific experiments with neural models and sound embeddings that reveal new information about the organization of sound systems in human languages, give us insight into the limits of complexity of word-formation, give us models of why and when irregular forms – surely an inefficiency in a communication system – can persist over long periods of time, and reveal what the boundaries of pattern learning is (how much information do we minimally need to learn a grammatical aspect of language such as its word inflection or sentence formation).
BIO: Mans Hulden is an assistant professor in Linguistics at the University of Colorado Boulder where he also holds an affiliation with the Institute of Cognitive Science. His research focuses on developing computational methods to infer and model linguistic structure using varying degrees of prior linguistic knowledge, particularly in the domains of phonology and morphology. He has worked extensively with linguistic applications of finite state technology, modeling of linguistic theory, grammatical inference, and the development of language resources.
- January, 17: group meeting / reading group
- January, 24: Emily Öhman: New developments with Sentimentator & teaching computational methods.
- January, 31: Mikko Aulamo and Jörg Tiedemann: Fiskmö and OPUS: Tools and interfaces
OPUS is a large collection of freely available parallel corpora. Fiskmö is a project dedicated to collect Finnish/Swedish translations to build better machine translation for the two national languages of Finland. In this presentation we will introduce the resources and tools included and currently developed within both projects. In particular, we will present tools for finding, accessing and converting data in OPUS, we will present the on-going development of our resource repository software with its frontend and backend as well as the machine translation demonstrator that we created for the fiskmö project.
- February, 7: Kimmo Koskenniemi
Title: Discovering morphophonemes mechanically from carefully chosen examples
More about the method: morphophon and the implementation on github
- February, 14: reading group
- February, 21: — no seminar this week —
- February, 28: Tommi Gröndahl: Using logical form encodings for unsupervised linguistic transformation: Theory and applications
Abstract: We present a novel method to architect automatic linguistic transformations for a number of tasks, including controlled grammatical or lexical changes, style transfer, text generation, and machine translation. Our approach consists in creating an abstract representation of a sentence’s meaning and grammar, which we use as input to an encoder-decoder network trained to reproduce the original sentence. Manipulating the abstract representation allows the transformation of sentences according to user-provided parameters, both grammatically and lexically, in any combination. Additionally, the same architecture can be used for controlled text generation, and even unsupervised machine translation, where the network is used to translate between different languages using no parallel corpora outside of a lemma-level dictionary. This strategy holds the promise of enabling many tasks that were hitherto outside the scope of NLP techniques for want of sufficient training data. We provide empirical evidence for the effectiveness of our approach by reproducing and transforming English sentences, and evaluating the results both manually and automatically. A single unsupervised model is used for all tasks. We report BLEU scores between 55.29 and 81.82 for sentence reproduction as well as back-and-forth grammatical transformations between 14 class pairs. Paper on arxiv: https://arxiv.org/abs/1902.09381
- March, 7: Sami Virpioja: In the search for optimal lexicons
Abstract: One question common to many natural language processing applications is how to select the discrete lexical units of representation from textual data. Especially for morphologically rich languages, words are rarely suitable as lexical units. The current trend for neural network models makes it attractive for efficiency reasons to use subword units even for languages such as English. We have shown that a simple but extensible method, Morfessor, can be used to induce well-performing subword lexicons from raw text for various applications. For automatic speech recognition, our subword-based approach has provided state-of-the-art results for several datasets across languages, and won the 2017 MGB Challenge for Arabic. For machine translation, we have introduced different ways to improve the consistency of the segmentation both for individual languages and between the translated languages. Finally, we have shown that the optimization principle in Morfessor seems to be related on how humans process morphologically complex Finnish words.
March, 14: reading group
- March, 21: Danny Merkx: Learning semantic representations of sentences from visually grounded language without words
Current approaches to learning semantic representations of sentences often use prior word-level knowledge. Our study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. We show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. This shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics.
- March, 28: Visit by translation studies
Irina Kudasheva, M.A.: Technological competence in translators’ work and translation teaching in Finland
Juha Eskelinen, M.A., Synergies in MT research and PE training
Matias Tamminen, M.A., Automatic recognition of translation-typical patterns
Simo Määttä, Ph.D.: Challenges to non-human translation and interpretation
- April, 4: Timo Honkela: Meaning negotiations as phenomena and as LT challenges with Iiro Jääskeläinen (Aalto University) on the Study of individualized meanings using brain research
Different room! Topelia, F211
Models of linguistic semantics can be viewed through representation and reasoning. This distinction concerns questions on how do we represent the world that we refer to by linguistic expressions and what kind of reasoning do we apply based on these representations. It has been commonplace to assume that each word or expression has one or a limited number of different, distinct senses. The classification task of disambiguation has been devised to find the right reference in each case. It is also possible to represent the world using a high-dimensional continuous space. In that case, we do not need to assume that the world is represented as a network of nodes and their connections. These mathematical representations go beyond the capacities provided by symbolic logic. The word embeddings has history that stems from vector space representations in information retrieval. When a framework of multidimensional continuous spaces is available, it is possible to study nuances of meaning that go beyond conducting disambiguation or choosing between alternatives within a logical framework.
In the present work, it is postulated that semantic processes are essentially subjective and thus individual. When high-dimensional continuous spaces are used to represent meanings and defining contextual distributions, subjective aspects can be modelled. It is possible to measure subjectivity of meaning. This can be studied, for instance, in the framework of brain research (Saalasti et al. 2019) or motions tracking (Honkela & Förger 2013). The methodology or measuring subjective contextually grounded meaning has been been presented, for instance, in Raitio et al. 2014. Further methodological work and an empirical demonstration is presented in Sintonen et al. (2014). When it is possible to represent individually contextual meaning of expressions, it is consequently possible to analyse the differences of meaning between two individuals. A hypothesis is that suitable data for the purpose of meaning negotiation can be collected, computational algorithms devised and applied in real world contexts that helps in meaning negotiations. An alternative view is to aim at defining the meaning of words in a precise way and to teach all people to use these definition. In this present work, it is claimed that that objectivity can be reached only to a degree as it would require vast human cognitive and time resources and the mapping between words and the world is doomed to be partial. This concern has implications both in scientific and in real world communication and representation and has been applied in building the Peace Machine framework.
- April, 15 at 12:00: Marianna Apidianaki,
Metsätalo, B 610 (Note: different day, time and place)
Title: Word similarity and difference: discovering meaning through paraphrases, translations and context
Abstract: A key challenge for both natural language understanding and generation is recognizing different ways of expressing the same meaning. Paraphrases can reveal this semantic proximity, support textual inference, and promote diversity and flexibility in NLP applications. In this talk, I will show how paraphrases discovered through translations can also serve to analyze the internal semantics of words and their semantic relationships. I will demonstrate how in-context paraphrase substitution can reflect word usage similarity, and will present a comparison with usage similarity prediction performed by state-of-the-art contextualized representations. Finally, I will show that paraphrases can serve to detect sentiment intensity, and understand information expressed indirectly in text and dialogue.
- April, 18: Jack Rueter and Mika Hämäläinen
Title (Mika): Theoretically grounded creative movie title generation and socialization of creative agents.
- May, 2
- May, 9: reading group
- May, 16:
- May, 23: Ozan Çağlayan (tbc)
- June, 6 (cancel because of NAACL?)
- June, 13 – reading group