Parallel multilingual resources capture valuable linguistic information that can be used in various fields of computational linguistics. The most obvious application is machine translation with systems that can be trained on large collections of text and their human translations. Alined parallel corpora can also be used as knowledge sources for finding implicit linguistic patterns treating translations as natural semantic annotation. Parallel data sets can also be used to port linguistic tools and resources to other languages, which is especially useful for supporting low-density languages. The language technology group in Helsinki works on several of these topics including statistical machine translation and annotation projection. Read news and posts related to cross-lingual NLP.
- OPUS: The Open Parallel Corpus (search, lexicon)
- ParCor: A Parallel Corpus with Annotated Pronoun Coreference
- DiscoMT2015: Data from a Shared Task on Pronoun-Focused MT
- FinnWordNet: A bilingual lexical database
- PIE: The Proto-Indo-European Lexicon
- HNMT: The Helsinki Neural Machine Translation system
- BNAS: Basic Neural Architecture Subprograms
- Docent: Document-Level Statistical Machine Translation
- efmaral, eflomal and efmugal: Efficient Markov Chain Word Alignment
- UPlug: Tools for Processing Parallel Corpora (ICA & ISA)
- Lingua::Align: A Toolbox for Tree Alignment
- subalign: Tools for Aligning Translated Movie Subtitles
Cross-lingual NLP entered the research group in Helsinki in 2015. Current members of the group who deal with related research are:
- Jörg Tiedemann
- Mathias Creutz
- Yves Scherrer (2017-2019)
- Robert Östling (2016)
- Emily Öhman
- Jouna Pyysalo
- Austin Matthews (visiting PhD, October-December, 2016)
- Idan Tene (student project on NMT)
- Sabine Nyholm (bachelor thesis on NMT)
- Robert Östling and Jörg Tiedemann: Efficient Word Alignment with Markov Chain Monte Carlo. The Prague Bulletin of Mathematical Linguistics (PBML), Number 106, pp. 125–146
- Robert Östling: Morphological reinflection with convolutional neural networks. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
- Tiedemann, Jörg and Cap, Fabienne and Kanerva, Jenna and Ginter, Filip and Stymne, Sara and Östling, Robert and Weller-Di Marco, Marion: Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools. In Proceedings of the First Conference on Machine Translation at ACL 2016, 391-398
- Guillou, Liane and Hardmeier, Christian and Nakov, Preslav and Stymne, Sara and Tiedemann, Jörg and Versley, Yannick and Cettolo, Mauro and Webber, Bonnie and Popescu-Belis, Andrei: Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction. In Proceedings of the First Conference on Machine Translation at ACL 2016, 525-542
- Tiedemann, Jörg: A Linear Baseline Classifier for Cross-Lingual Pronoun Prediction. In Proceedings of the First Conference on Machine Translation at ACL 2016, 616-619
- Aaron Smith, Christian Hardmeier and Jörg Tiedemann
Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations. In Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016
- Jörg Tiedemann: OPUS – Parallel Corpora for Everyone. In Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016
- Tiedemann, J.: Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
- Lison, P. and Tiedemann, J.: OpenSubtitles2015: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
- Tiedemann, J. and Agić, J.: Synthetic Treebanking for Cross-Lingual Dependency Parsing. In Journal of Artificial Intelligence Research, 55: 209-248, 2016. doi.. more…
- Jörg Tiedemann and Željko Agić: Synthetic Treebanking for Cross-Lingual Dependency Parsing. In Journal of Artificial Intelligence Research. 55, pages 209-248, January 2016
- Jörg Tiedemann, Filip Ginter and Jenna Kanerva: Morphological Segmentation and OPUS for Finnish-English Machine Translation, In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 177-183, 2015
- Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, and Mauro Cettolo: Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation, In Proceedings of the Second Workshop on Discourse in Machine Translation (DiscoMT), pages 1-16, 2015
- Robert Östling: Word order typology through multilingual word alignment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 205-211, Beijing, China, July 2015. Association for Computational Linguistics.
- Robert Östling, Carl Börstell, and Lars Wallin: Enriching the Swedish Sign Language Corpus with part of speech tags using joint Bayesian word alignment and annotation transfer. In Proceedings of the 20th Nordic Conference on Computational Linguistics (NODALIDA 2015), volume 23 of NEALT Proceedings Series, pages 263–268, Vilnius, Lithuania, May 2015.