Discourse-Oriented Statistical Machine Translation
A project funded by the Swedish Research Council (Vetenskapsrådet 2012-916).
Purpose and Aims
Automatic translation of human language (machine translation) has matured significantly over the recent years. However, severe problems in today’s technology are still obvious and originate from the inability of current models to adapt to specific topics and domains and to produce coherent text. The aim of this project is to tackle these issues by taking discourse-wide information into modern translation engines.
One of the main problems of state-of-the-art machine translation (MT) is the focus on local information. MT systems typically disregard any information outside the individual sentence to be translated. There is no way to enforce textual coherence in such a system. It is impossible to detect topic shifts or other discourse related events. Sentences are just translated one-by-one in isolation without making use of cross-sentential information and the result is exactly the same with any arbitrary ordering of sentences. The objective of our research is to develop novel models in the framework of statistical machine translation that change this situation in a principled way.
In this project, we propose to develop novel models for discourse-oriented machine translation that lead to more natural translations. First or all, we like to introduce models of textual cohesion in the target language focusing on fluency and coherence of the generated text. Secondly, we also like to include cross-sentential information from the source language input to improve adequacy of the translated texts. The latter will enable models to adjust dynamically to various topics and domains, creating topic-aware machine translation systems that derive appropriate information from discourse similarly to how human translators would.
- Jörg Tiedemann, project leader
- Christian Hardmeier, postdoctoral researcher
- Sara Stymne, associated PostDoc
- Aaron Smith, project assistant
Resources and Tools
- Docent – A Document-Level Local Search Decoder for Phrase-Based SMT
- ParCor – A Parallel Pronoun-Coreference Corpus (bitbucket)
- DiscoMT2015 data set on pronoun-focused MT evaluation
- Pronoun translation task at WMT2016 (results)
- DiscoMT – at EMNLP 2015 (data set)
- Docent at the MT Marathon in Trento, 2014
- DiscoMT – at ACL 2013
- Guillou, L., Hardmeier, C., Smith, A., Tiedemann, J. and Webber, B. ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), pages 3191-3198, European Language Resources Association (ELRA), 2014. more… [pdf]
- Tiedemann, J., Agić, Ž. and Nivre, J. Treebank Translation for Cross-Lingual Parser Induction. In Proceedings of the 18th Conference Natural Language Processing and Computational Natural Language Learning (CoNLL), 2014. [pdf]
- Hardmeier, C., Tiedemann, J. and Nivre, J. Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 2013.
- Webber, B., Popescu-Belis, A., Markert, K. and Tiedemann, J., ed. Proceedings of the Workshop on Discourse in Machine Translation. [pdf]
- Stymne, S., Hardmeier, C., Tiedemann, J. and Nivre, J. Tunable Distortion Limits and Corpus Cleaning for SMT. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 225-231, Association for Computational Linguistics, 2013. [pdf]
- Stymne, S., Hardmeier, C., Tiedemann, J. and Nivre, J. Feature Weight Optimization for Discourse-Level SMT. In Proceedings of the Workshop on Discourse in Machine Translation, pages 60-69, Association for Computational Linguistics, 2013. [pdf]
- Hardmeier, C., Stymne, S., Tiedemann, J. and Nivre, J. Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 193-198, Association for Computational Linguistics, 2013. [pdf]
- Stymne, S., Tiedemann, J., Hardmeier, C. and Nivre, J. Statistical Machine Translation with Readability Constraints. In Proceedings of Nodalida 2013, pages 375-386, NEALT, 2013. [pdf]
- Hardmeier, C., Nivre, J. and Tiedemann, J. Document-Wide Decoding for Phrase-Based Statistical Machine Translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179-1190, Association for Computational Linguistics, 2012. [pdf]