Parallel multilingual resources capture valuable linguistic information that can be used in various fields of computational linguistics. The most obvious application is machine translation with systems that can be trained on large collections of text and their human translations. Alined parallel corpora can also be used as knowledge sources for finding implicit linguistic patterns treating translations as natural semantic annotation. Parallel data sets can also be used to port linguistic tools and resources to other languages, which is especially useful for supporting low-density languages. The language technology group in Helsinki works on several of these topics including statistical machine translation and annotation projection. Read news and posts related to cross-lingual NLP.


  • FoTran – Found in Translation (ERC), 2018-2022
  • MeMAD – Method for Managing Audiovisual Data (EU H2020), 2018-2020
  • NLUxG – Natural Language Understanding with Cross-Lingual Grounding (AoF), 2018-2019
  • Cross-Linguistic and Multilingual NLP with the Focus on Low-Resource Languages and Language Variants (University of Helsinki), 2015-2019
  • OPUS (supported by NLPL)

Multilingual Resources:

  • OPUS: The Open Parallel Corpus (search, lexicon)
  • ParCor: A Parallel Corpus with Annotated Pronoun Coreference
  • DiscoMT2015: Data from a Shared Task on Pronoun-Focused MT
  • FinnWordNet: A bilingual lexical database
  • PIE: The Proto-Indo-European Lexicon



