Research Projects

Green NLP – controlling the carbon footprint in sustainable language technology

GreenNLP addresses the problem of increasing energy consumption caused by modern solutions in natural language processing (NLP). Neural language models and machine translation require heavy computations to train and their size is constantly growing, which makes them expensive to deploy and run. In our project we will reduce the training costs and model sizes by clever optimizations of the underlying machine learning algorithms with techniques that make use of knowledge transfer and compression. Furthermore, we will focus on multilingual solutions that can serve many languages in a single model reducing the number of actively running systems. Finally, we will also openly document and freely distribute all our results to enable efficient reuse of ready-made components to further decrease the carbon footprint of modern language technology.

HPLT – High-Performance Language Technologies

  • PI: Jörg Tiedemann
  • 2022-2025, funded by the Horizon Europe Framework Programme (HORIZON)
  • project homepage

High Performance Language Technologies (HPLT) combines petabytes of natural language data with large-scale model training. With trillions of words of text, HPLT will provide the largest open text collection. Cleaning and privacy protecting services improve the quality and ethical properties of the text. Going beyond static repositories that require the user to individually analyze each data set, the project will rate data sets by how much they improve end-to-end language models and machine translation systems. Continuous integration of models and data will result in free downloadable high-quality models for all official European Union languages and beyond. The models will be reproducible with information and evaluation metrics shown in a publicly available dashboard. By focusing on training at scale, the project complements the inference-focused European Language Grid, which in turn will be used for model deployment. Datasets, models and information about them will be published in recognized FAIR data repositories, aggregation catalogues and marketplaces for easy discovery, access, replication, and exploitation.

Uncertainty-aware neural language models

  • PI: Jörg Tiedemann
  • 2022-2024, funded by the Academy of Finland
  • project homepage

In this project, we propose to develop novel models that incorporate the interpretation uncertainty of natural languages into modern neural language models. We expect that these uncertainty-aware language models will provide a much better fit for human languages and communication in general, which will become visible in practical tasks such as question-answering and machine translation.

CoCorDial: Corpus-based Computational Dialectology

  • PI: Yves Scherrer
  • 2021-2025, funded by the Academy of Finland
  • project homepage

Dialectology is concerned with the study of language variation across space. This project aims to introduce comparability in dialect corpora. In particular, we will use machine translation techniques to normalize the dialect texts, i.e., to transform them to standardized spelling. However, we are not only interested in the result of this normalization process, but also in the transformation operations that the normalization model learns. These model parameters will allow us to provide new visualisations of dialect landscapes and to confirm or challenge traditional dialect classifications.

FoTran – Found in Translation

Found in Translation: Natural Language Understanding with Cross-lingual Grounding (FoTran) is a project that focuses on the development of models for natural language understanding trained on implicit information given by large collections of human translations. A translation provides an alternative view on information and this kind of cross-lingual grounding can be used to resolve ambiguities and to extend the signal that languages provide. The hypothesis in FoTran is that increasing the amount of linguistic diversity in training will lead to stronger abstractions and ultimately to language-agnostic meaning representations. We study the use of machine translation as a training objective and investigate the effect on neural embeddings and downstream applications.

Behind the words: Deep neural models of language meaning for industry-grade applications

  • PI’s: Filip Ginter (University of Turku), Jörg Tiedemann and Mathias Creutz (University of Helsinki)
  • 2021-2023, funded by the Academy of Finland
  • Helsinki project website

In this project, we propose the development of a unique data resource, models, and tools for improved neural meaning representations that are robust to non-trivial rephrasing of statements with equivalent, or near-equivalent meaning. Such models and tools can significantly boost the coverage and precision in industrial language content management, data mining and internal as well as external communication procedures. The project will deliver a dedicated data collection and research focused on paraphrase encoding and generation that will pave the way to significantly improved language models operating on more appropriate semantic abstractions instead of relying on surface-oriented, lexical or syntactic overlaps or spurious relations that can be learned from purely collocational distributions. The focus is on Finnish and Swedish.

LumiNMT: Multilingual Neural Machine Translation on LUMI Scale

  • PI: Jörg Tiedemann
  • 2022-2023: LUMI-G Extreme Access project

The goal of the project is to train neural machine translation models on a large scale using state-of-the-art transformer models and also using novel modular multilingual setups. Our focus is on increasing language coverage and efficient use of massively parallel data sets. We want to make use of the extensive parallel computing capabilities of LUMI to reduce training time and to scale up model size. We will cover bilingual as well as highly multilingual translation models trained on huge translation corpora. We will study the effect of transfer learning across languages, extensive data augmentation techniques and knowledge distillation. All models will be made available for public use through data services at CSC (allas and fairdata/ida) in research and development.

OPUS – the Open Parallel Corpus

  • PI: Jörg Tiedemann
  • long-term effort with various types of partial funding
  • homepage

OPUS is a growing collection of translated texts from the web and other resources. In the OPUS project we convert and align free online data in order to provide a large parallel corpus that is ready to be used for cross-lingual and multilingual natural language processing as well as for extensive empirical studies in general linguistics and translation studies. OPUS includes tools and procedures to work with the data and also extends into a framework for the development of data-driven machine translation.

Previous projects

EOSC-nordic

EOSC-Nordic aims to facilitate the coordination of European Open Science Cloud (EOSC) relevant initiatives within the Nordic and Baltic countries. The project aims to exploit synergies to achieve greater harmonisation at policy and service provisioning across these countries, in compliance with EOSC agreed standards and practices.

FISKMÖ – finsk-svensk korpus & maskinoversättning

  • PI: Jörg Tiedemann
  • 2018-2022, funded by the Swedish Culture Foundation (Svenska Kulturfonden)
  • project homepage

The primary goal of the FISKMÖ project is to create a massive parallel corpus of Finnish and Swedish to support linguistic research. The second goal of the project is to use the corpus to develop a public machine translation service for translation between Finnish and Swedish. As the quantity and quality of training data is the most important factor in machine translation quality, the FISKMÖ translation service can be expected to provide translations of higher quality than other public machine translation services, for which less training data is available.

OPUS-MT: Open Translation Models, Tools and Services

  • PI: Jörg Tiedemann
  • 2020-2021, ELG pilot project (funded by the European Commission H2020 program)
  • project homepage, github

OPUS-MT started as an extension of the fiskmö project with the goal to create state-of-the-art neural machine translation models that can freely be shared, re-used and integrated in open web services and professional translation workflows. The ELG-funded pilot project focused on European minority languages and their improved support through multilingual NMT models and transfer learning. It also delivered easily deployable translation services and tools for quick domain-adaptation and on-demand personalisation. OPUS-MT offers local solutions that are independent of on-line services to avoid security risks with open data transfer.

MeMAD

MeMAD project provides novel methods for efficient re-use and re-purpose of multilingual audiovisual content. These methodologies support video management and digital storytelling in broadcasting and media production. The project tackled automatic video description methods by making the machine learn from human expertes. The resulting description are not only a time-aligned semantic extraction of objects but make use of the audio and recognizes action sequences. The project addresses the challenge of improving the discoverability and findability of audiovisual data, by developing novel methods for accessing and using the content. As a bonus, the methods developed for this automatic analysis, have a tremendous effect on the costs of the multimedia production processes: tasks that earlier have required hundreds of hours of human labor can be carried out with just hours of machine work.

NLPL – The Nordic Language Processing Laboratory

The Nordic Language Processing Laboratory (NLPL) is a collaboration of university research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) creating new ways to enable data- and compute-intensive Natural Language Processing research by implementing a common software, data and service stack in multiple Nordic HPC centres, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.

Natural Language Understanding with Cross-Lingual Grounding

  • PI: Jörg Tiedemann
  • 2018-2019, funded by the Finnish Academy

Natural language understanding requires a proper mapping of ambiguous expressions to unambiguous abstract meaning representations. Current research in deep learning has advanced the field of data-driven language modeling and the project proposes the development of novel approaches to semantic representation learning from massively parallel corpora of human translations. The guiding principle of the project is the idea of using human translations as indirect supervision for disambiguation. Expressions with unclear interpretations in one language are explicit and unambiguous in another language. The ambition is to use hundreds and up to over a thousand languages in parallel to train neural machine translation models that encode input sequences into distributed vector representations.

DiscoMT – Discourse-Oriented SMT

  • PI: Jörg Tiedemann
  • 2012-2015, funded by the Swedish Research Council
  • project homepage

In this project, we proposed to develop novel models for discourse-oriented statistical machine translation that lead to more natural translations of coherent documents. Using information from across sentence boundaries, such models will be able to adjust to various topics and domains in a similar way to how human translators would. In order to achieve this goal, we emphasized theoretical research on computational models of discourse-oriented translation. Furthermore, we developed efficient algorithms and practical implementations that make it possible to apply these novel models in real-world settings. The project ideas were tested on a wide range of translation tasks for a number of language pairs and textual domains.