Workshop on Data Curation for (Neural) Machine Translation

    • Date: November 20, 2018
    • Place: Luxembourg, EC

The topic of data curation is an important issue to discuss in the field of neural machine translation. These models rise and fall with the data on which they are trained. The outome in international evaluation campaigns shows that those teams will win that can handle more data and know how to pre-process them in proper ways. This workshop is about gathering more data of high quality for a larger variety of languages with better support for specific domains and application scenarios.

In order to discuss this topic in the light of current research, we invited nine experts in the field to present different aspects of modern data collection and MT system development. The workshop schedule was designed around the following three headers:

  • Web-Crawling and Crowd-Sourcing – Coverage vs Quality
  • Compiling Multilingual Parallel Corpora of High Quality
  • Efficient use of Data in NMT

The first one addresses the question of increasing the coverage of data sets by developing web-scale data collection techniques that include a broader selection of languages and domains and that apply advanced techniques for filtering and cleaning the data that can be scraped from the web. The second topic focuses on the creation of clean data sets from business clients and public server providers in order to improve the quality of customized MT. The last point emphasizes open questions and new developments in the direction of unsupervised machine translation and the support of low-resource languages.

Web-Crawling and Crowd-Sourcing – Coverage vs Quality

Compiling Multilingual Parallel Corpora of High Quality

Efficient use of Data in NMT


Jörg Tiedemann: Introduction to Data Curation for (Neural) Machine Translation: Nobody wants to hear the phrase “data is the new oil” anymore – but, nevertheless, it is true that our models rise and fall with the data they are trained on. The success in international evaluation campaigns show that those teams will win that can handle more data and know how to pre-process them in proper ways. This workshop is about gathering oil of high quality for neural MT and about finding other kinds of fuel that can run the engines.

Kenneth Heafield:  Improving machine translation by mining the web: The ParaCrawl project ( is mining millions of parallel sentences from hundreds of thousands of websites and sharing online for free in EU languages.  I will talk about how we are scaling to a petabyte, our efforts to extract slightly less messy data from the messy web through translation, and ambiguous copyright. ParaCrawl is funded by the Connecting Europe Facility.

Filip Ginter and Jenna Kanerva, Finding translations in unordered text: We present recent advances in extracting sentence translation pairs from large monolingual corpora. Unlike in many parallel data extraction works, here we make no assumptions about the corpora being document-aligned or comparable in any way, and even sentence-shuffled datasets can be used to mine parallel sentences. Common methods build upon multilingual sentence embeddings as well as automatically induced translation dictionaries and alignment scores for creating a vector representation of each sentence, which can be used to efficiently score candidate sentence pairs using the dot product of their vectors. We show how to efficiently utilize these techniques on Internet crawl datasets with hundreds of millions of unique monolingual sentences.

Miguel Esplà I Gomes: Developing and Using Bitextor: This talk will present Bitextor, a free/open-source tool to harvest parallel data from the Internet. The presentation will describe how the tool works, and how it has been adapted to be used in different scenarios and projects during the last years. It will also enumerate some of the most relevant challenges faced during the design and use of the tool, such as dealing with noisy data or evaluating performance, and will describe the solutions proposed for the Bitextor project.

Antonio Toral: Quality of parallel crawled data: translationese, machinese, transcreations: Translationese. Finding out the original language of crawled text. It is known that a MT system for SL->TL works best when the source and target sides of the training data were originally written in those languages. I can talk on recent research where we aimed at guessing the original language of a text using PoS sequences and is robust to domain variation. Machinese. Discriminating between HT and MT. It is more tricky to discriminate between HT and NMT than it was between HT and PBMT, which has an impact for crawling as you want to make sure crawled data is not MT’ed. I can talk about some ideas on how to approach this. Transcreations. Not all HT is the same. Some HTs are more literal and foreignising while others are more domesticating or transcreations. What impact does this have for crawled data?

Andreas Eisele:  Data collection and NMT for the EC: The eTranslation MT engines are mainly based on translation memory databases from the Euramis project, which cover all 24 EU languages and more.  We will describe the methods that are used to improve the quality of (N)MT engines by filtering and pre-processing the training data, as well as our efforts to extend the coverage of MT engines by adding training data from other sources.  We will present results from several approaches to improve MT quality via data curation.

Raivis Skadins: Tilde has experience in building customized NMT engines both for business and public sector clients. NMT systems for the 2017-2018 Presidency of the Council of the European Union ( and NMT systems for Latvian e-government machine translation platform ( are just few most visible latest examples. We are also participating in several data collection projects for MT, for example creation of Tilde Model corpus that has been undertaken as part of the ODINE Open Data Incubator for Europe and European Language Resource Coordination (ELRC) action in the framework of the CEF Program. The presentation will cover data collection issues in these project, it will outline methods and tools used for data collection, crawling, conversion, cleaning, evaluation and quality assurance, and the presentation will outline main challenges and solutions found to handle them. Finally we will share our experience in NMT adaptation to narrow domains combining both out-domain and in-domain parallel and monolingual data.

Robert Östling: NMT for (really) low-resource languages: NMT systems generally require large amounts of parallel text for training, although interesting recent developments in unsupervised NMT require only large amounts of monolingual text. What if you have neither? This talk is about our efforts to bring NMT to thousands of low-resource languages.

Mikel Artetxe: Towards Unsupervised Machine Translation: While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train machine translation systems in an unsupervised manner, relying on monolingual corpora alone. In this talk, I will introduce cross-lingual word embedding mappings, which are the basis of these systems, and present our work on both unsupervised neural and statistical machine translation.