Parallel multilingual resources capture valuable linguistic information that can be used in various fields of computational linguistics. The most obvious application is machine translation with systems that can be trained on large collections of text and their human translations. Alined parallel corpora can also be used as knowledge sources for finding implicit linguistic patterns treating translations as natural semantic annotation. Parallel data sets can also be used to port linguistic tools and resources to other languages, which is especially useful for supporting low-density languages. The language technology group in Helsinki works on several of these topics including statistical machine translation and annotation projection. Read news and posts related to cross-lingual NLP.
Projects:
- HPLT – High-Performance Language Technologies (2022-2025)
- FoTran – Found in Translation (ERC), 2018-2022
- OPUS-MT – Open Translation Models, Tools and Services (ELG pilot pro)
- fiskmö – Finnish-Swedish Parallel Corpus and Machine Translation (SKF)
- MeMAD – Method for Managing Audiovisual Data (EU H2020), 2018-2020
- NLUxG – Natural Language Understanding with Cross-Lingual Grounding (AoF), 2018-2019
- Cross-Linguistic and Multilingual NLP with the Focus on Low-Resource Languages and Language Variants (University of Helsinki), 2015-2019
- NLPL – the Nordic Language Processing Laboratory
- EOSC-nordic – the NLPL use case inside the nordic chapter of the European Open Science Cloud
- OPUS (supported by NLPL), including OPUS-MT
Multilingual Resources:
- OPUS: The Open Parallel Corpus (search, lexicon, NMT)
- MuCoW: Multilingual contrastive WSD test sets for MT
- ParCor: A Parallel Corpus with Annotated Pronoun Coreference
- DiscoMT2015: Data from a Shared Task on Pronoun-Focused MT
- FinnWordNet: A bilingual lexical database
- PIE: The Proto-Indo-European Lexicon
- SemFi and SemUr: Semantic Databases for Finnish, Skolt Sami, Erzya, Moksha and Komi-Zyrian.
Tools:
- OPUS-MT: Web-apps for machine translation
- The NMT attention-bridge model
- HNMT: The Helsinki Neural Machine Translation system
- BNAS: Basic Neural Architecture Subprograms
- Docent: Document-Level Statistical Machine Translation
- efmaral, eflomal and efmugal: Efficient Markov Chain Word Alignment
- UPlug: Tools for Processing Parallel Corpora (ICA & ISA)
- Lingua::Align: A Toolbox for Tree Alignment
- subalign: Tools for Aligning Translated Movie Subtitles
Events:
- UnGroundNLP 2022
- VarDial 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014
- FoTran 2018
- SMART-Select-4
- FinMT 2017
- FinMT 2016
- CrossLingNLP 2016
Cross-lingual NLP entered the research group in Helsinki in 2015. Current and past members of the group who deal with related research are:
- Jörg Tiedemann
- Mathias Creutz
- Yves Scherrer
- Michele Boggia
- Stig-Arne Grönroos
- Timothee Mickus
- Alessandro Raganato
- Hande Celikkanat
- Marianna Apidianaki
- Sami Virpioja
- Juan Raul Vazquez Carrillo
- Aarne Talman
- Umut Sulubacak
- Emily Öhman
- Mikko Aulamo
- Eetu Sjöblom
- Jouna Pyysalo
- Sharid Loáiciga (fall 2017)
- Robert Östling (2016)
- Austin Matthews (visiting PhD, October-December, 2016)
- Idan Tene (student project on NMT)
- Sabine Nyholm (bachelor thesis on NMT)
Related Publications: