Cross-Lingual NLP

Parallel multilingual resources capture valuable linguistic information that can be used in various fields of computational linguistics. The most obvious application is machine translation with systems that can be trained on large collections of text and their human translations. Alined parallel corpora can also be used as knowledge sources for finding implicit linguistic patterns treating translations as natural semantic annotation. Parallel data sets can also be used to port linguistic tools and resources to other languages, which is especially useful for supporting low-density languages. The language technology group in Helsinki works on several of these topics including statistical machine translation and annotation projection. Read news and posts related to cross-lingual NLP.


  • FoTran – Found in Translation (ERC), 2018-2022
  • fiskmö – Finnish-Swedish Parallel Corpus and Machine Translation (SKF)
  • MeMAD – Method for Managing Audiovisual Data (EU H2020), 2018-2020
  • NLUxG – Natural Language Understanding with Cross-Lingual Grounding (AoF), 2018-2019
  • Cross-Linguistic and Multilingual NLP with the Focus on Low-Resource Languages and Language Variants (University of Helsinki), 2015-2019
  • NLPL – the Nordic Language Processing Laboratory
  • EOSC-nordic – the NLPL use case inside the nordic chapter of the European Open Science Cloud
  • OPUS (supported by NLPL), including OPUS-MT

Multilingual Resources:

  • OPUS: The Open Parallel Corpus (search, lexicon, NMT)
  • MuCoW: Multilingual contrastive WSD test sets for MT
  • ParCor: A Parallel Corpus with Annotated Pronoun Coreference
  • DiscoMT2015: Data from a Shared Task on Pronoun-Focused MT
  • FinnWordNet: A bilingual lexical database
  • PIE: The Proto-Indo-European Lexicon
  • SemFi and SemUr: Semantic Databases for Finnish, Skolt Sami, Erzya, Moksha and Komi-Zyrian.



Cross-lingual NLP entered the research group in Helsinki in 2015. Current and past members of the group who deal with related research are:

Related Publications:

  1. Robert Östling; Yves Scherrer; Jörg Tiedemann; Gongbo Tang; Tommi Nieminen: The Helsinki Neural Machine Translation System. In Proceedings of WMT at EMNLP 2017, Copenhagen/Denmark
  2. Jörg Tiedemann and Yves Scherrer: Neural Machine Translation with Extended Context. In Proceedings of DiscoMT at EMNLP 2017, Copenhagen/Denmark,
  3. Robert Östling and Jörg Tiedemann: Continuous multilinguality with language vectors. In Proceedings of EACL, Valencia/Spain, pp. 644-649
  4. Jörg Tiedemann: Cross-lingual dependency parsing for closely related languages – Helsinki’s submission to VarDial 2017Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp 131-136
  5. Robert Östling and Jörg Tiedemann: Efficient Word Alignment with Markov Chain Monte Carlo. The Prague Bulletin of Mathematical Linguistics (PBML), Number 106, pp. 125–146
  6. Robert Östling: Morphological reinflection with convolutional neural networks. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
  7. Tiedemann, Jörg and Cap, Fabienne and Kanerva, Jenna and Ginter, Filip and Stymne, Sara and Östling, Robert and Weller-Di Marco, Marion: Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools. In Proceedings of the First Conference on Machine Translation at ACL 2016, 391-398
  8. Guillou, Liane and Hardmeier, Christian and Nakov, Preslav and Stymne, Sara and Tiedemann, Jörg and Versley, Yannick and Cettolo, Mauro and Webber, Bonnie and Popescu-Belis, Andrei: Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction. In Proceedings of the First Conference on Machine Translation at ACL 2016, 525-542
  9. Tiedemann, Jörg: A Linear Baseline Classifier for Cross-Lingual Pronoun Prediction.  In Proceedings of the First Conference on Machine Translation at ACL 2016, 616-619
  10. Aaron Smith, Christian Hardmeier and Jörg Tiedemann
    Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations.  In Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016
  11. Jörg Tiedemann: OPUS – Parallel Corpora for Everyone. In Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016
  12. Tiedemann, J.: Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
  13. Lison, P. and Tiedemann, J.: OpenSubtitles2015: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
  14. Tiedemann, J. and Agić, J.: Synthetic Treebanking for Cross-Lingual Dependency Parsing. In Journal of Artificial Intelligence Research, 55: 209-248, 2016. doi..  more…
  15. Jörg Tiedemann and Željko Agić: Synthetic Treebanking for Cross-Lingual Dependency Parsing. In Journal of Artificial Intelligence Research. 55, pages 209-248, January 2016
  16. Jörg Tiedemann, Filip Ginter and Jenna Kanerva: Morphological Segmentation and OPUS for Finnish-English Machine Translation, In Proceedings of the Tenth Workshop on Statistical Machine Translation,  pages 177-183, 2015
  17. Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, and Mauro Cettolo: Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation, In Proceedings of the Second Workshop on Discourse in Machine Translation (DiscoMT), pages 1-16, 2015
  18. Robert Östling: Word order typology through multilingual word alignment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 205-211, Beijing, China, July 2015. Association for Computational Linguistics.
  19. Robert Östling, Carl Börstell, and Lars Wallin: Enriching the Swedish Sign Language Corpus with part of speech tags using joint Bayesian word alignment and annotation transfer. In Proceedings of the 20th Nordic Conference on Computational Linguistics (NODALIDA 2015), volume 23 of NEALT Proceedings Series, pages 263–268, Vilnius, Lithuania, May 2015.