Cross-Lingual NLP – Language Technology

Parallel multilingual resources capture valuable linguistic information that can be used in various fields of computational linguistics. The most obvious application is machine translation with systems that can be trained on large collections of text and their human translations. Alined parallel corpora can also be used as knowledge sources for finding implicit linguistic patterns treating translations as natural semantic annotation. Parallel data sets can also be used to port linguistic tools and resources to other languages, which is especially useful for supporting low-density languages. The language technology group in Helsinki works on several of these topics including statistical machine translation and annotation projection. Read news and posts related to cross-lingual NLP.

Projects:

HPLT – High-Performance Language Technologies (2022-2025)
FoTran – Found in Translation (ERC), 2018-2022
OPUS-MT – Open Translation Models, Tools and Services (ELG pilot pro)
fiskmö – Finnish-Swedish Parallel Corpus and Machine Translation (SKF)
MeMAD – Method for Managing Audiovisual Data (EU H2020), 2018-2020
NLUxG – Natural Language Understanding with Cross-Lingual Grounding (AoF), 2018-2019
Cross-Linguistic and Multilingual NLP with the Focus on Low-Resource Languages and Language Variants (University of Helsinki), 2015-2019
NLPL – the Nordic Language Processing Laboratory
EOSC-nordic – the NLPL use case inside the nordic chapter of the European Open Science Cloud
OPUS (supported by NLPL), including OPUS-MT

Multilingual Resources:

OPUS: The Open Parallel Corpus (search, lexicon, NMT)
MuCoW: Multilingual contrastive WSD test sets for MT
ParCor: A Parallel Corpus with Annotated Pronoun Coreference
DiscoMT2015: Data from a Shared Task on Pronoun-Focused MT
FinnWordNet: A bilingual lexical database
PIE: The Proto-Indo-European Lexicon
SemFi and SemUr: Semantic Databases for Finnish, Skolt Sami, Erzya, Moksha and Komi-Zyrian.

Tools:

OPUS-MT: Web-apps for machine translation
The NMT attention-bridge model
HNMT: The Helsinki Neural Machine Translation system
BNAS: Basic Neural Architecture Subprograms
Docent: Document-Level Statistical Machine Translation
efmaral, eflomal and efmugal: Efficient Markov Chain Word Alignment
UPlug: Tools for Processing Parallel Corpora (ICA & ISA)
Lingua::Align: A Toolbox for Tree Alignment
subalign: Tools for Aligning Translated Movie Subtitles

Events:

UnGroundNLP 2022
VarDial 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014
FoTran 2018
SMART-Select-4
FinMT 2017
FinMT 2016
CrossLingNLP 2016

Cross-lingual NLP entered the research group in Helsinki in 2015. Current and past members of the group who deal with related research are:

Jörg Tiedemann
Mathias Creutz
Yves Scherrer
Michele Boggia
Stig-Arne Grönroos
Timothee Mickus
Alessandro Raganato
Hande Celikkanat
Marianna Apidianaki
Sami Virpioja
Juan Raul Vazquez Carrillo
Aarne Talman
Umut Sulubacak
Emily Öhman
Mikko Aulamo
Eetu Sjöblom
Jouna Pyysalo
Sharid Loáiciga (fall 2017)
Robert Östling (2016)
Austin Matthews (visiting PhD, October-December, 2016)
Idan Tene (student project on NMT)
Sabine Nyholm (bachelor thesis on NMT)

Related Publications: