FoTran

Found in Translation: Natural Language Understanding with Cross-lingual Grounding is an ERC funded project led by Jörg Tiedemann. It will run 2018-2023 within the language technology research group at the University of Helsinki. Stay tuned for more updates.

The success of modern intelligent systems is the ability to learn from data. The goal of the project is to develop models for natural language understanding trained on implicit information given by large collections of human translations. We will apply massively parallel data sets of over a thousand languages to acquire language-independent meaning representations that can be used for reasoning with natural languages and for multilingual neural machine translation.

Natural language understanding is the “holy grail” of computational linguistics and a long-term goal in research on artificial intelligence. Understanding human communication is difficult due to the various ambiguities in natural languages and the wide range of contextual dependencies required to resolve them. Discovering the semantics behind language input is necessary for proper interpretation in interactive tools, which requires an abstraction from language-specific forms to language-independent meaning representations.

With this project, we propose a line of research that will focus on the development of novel data-driven models that can learn such meaning representations from indirect supervision provided by human translations covering a substantial proportion of the linguistic diversity in the world. A guiding principle is cross-lingual grounding, the effect of resolving ambiguities through translation. The beauty of that idea is the use of naturally occurring data instead of artificially created resources and costly manual annotations. The framework is based on deep learning and neural machine translation and our hypothesis is that training on increasing amounts of linguistically diverse data improves the abstractions found by the model. Eventually, this will lead to language-independent meaning representations and we will test our ideas with multilingual machine translation and tasks that require semantic reasoning and inference.

Principal Investigator

Jörg Tiedemann is the professor of Language Technology at the Department of Modern Languages at the University of Helsinki. His main research interest is in cross-lingual NLP and machine translation.

Postdoctoral Researchers

Alessandro Raganato joined the team in March 2018. He got his PhD from the University of Rome on multilingual semantic modelling.

Hande Celikkanat joined our team in September 2018. She moved from the CS department in Helsinki to our group.

Miikka Silfverberg moved back from Boulder/Co and started his position as lecturer in language technology in September 2018 and will spend his research time in connection with FoTran.

Doctoral Students

Aarne Talman is a Doctoral Student in Language Technology at University of Helsinki. His main research interests are in natural language understanding, natural language inference and computational semantics.

Raul Vazquez started his PhD in March 2018 at the University of Helsinki.

Umut Sulubacak also started his PhD in March 2018 at our department. He mainly works for the MeMAD project but will also participate in FoTran.

 

Publications

  1. Tiedemann, J. (2018). Emerging Language Spaces Learned From Massively Multilingual Corpora. In Proceedings of the 3rd Conference on Digital Humanities in the Nordic Countries (DHN 2018), Helsinki, Finland
  2. Östling, R., Scherrer, Y., Tiedemann, J., Tang, G. and Nieminen, T. (2017). The Helsinki Neural Machine Translation System. In Proceedings of WMT at EMNLP 2017, Copenhagen/Denmark.
  3. Östling, R. and Tiedemann, J. (2017). Continuous multilinguality with language vectors. In Proceedings of EACL, Valencia/Spain, pp. 644-649.
  4. Tiedemann, J. (2016). Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016).
  5. Lison, P. and Tiedemann, J. (2016). OpenSubtitles2015: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016).
  6. Östling, R. (2015). Word order typology through multilingual word alignment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 205-211, Beijing, China, Association for Computational Linguistics.

Resources

Software

HNMT, The Helsinki Neural Machine Translation system, is a neural network-based machine translation system developed at the University of Helsinki and Stockholm University.

Data

OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.