Resources

Opensource software and resources

Various resources and tools are available from github, bitbucket, source forge including tools for processing (parallel) corpora, alignment, translation, natural language inference and official test sets for evaluating English-Finnish machine translation.

OPUS – the open parallel corpus

OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.

Website: http://opus.nlpl.eu

NLPL – The Nordic Language Processing Laboratory

The Nordic Language Processing Laboratory (NLPL) is a collaboration of academic research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) piloting innovative ways to share high-performance computing and data resources across country borders, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.

Website: http://wiki.nlpl.eu

DiscoMT 2015 Shared Task on Pronoun Translation

DiscoMT at EMNLP 2015 featured two shared tasks: one in pronoun-focused machine translation and one in cross-lingual pronoun prediction. The data sets, baseline systems and system submissions along with additonal manual annotation are available from LINDAT/CLARIN:

Website: http://hdl.handle.net/11372/LRT-1611

ParCor – A Parallel Pronoun-Coreference Corpus

ParCor 1.0 is a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as – – referring expressions – has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.

Website: http://opus.lingfil.uu.se/ParCor

A Blacklist Classifier for language discrimination

The blacklist classifier is a simple tool for discriminating related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages.

Information and download: https://bitbucket.org/tiedemann/blacklist-classifier

Lingua::Align – a toolbox for Tree Alignment

Lingua::Align is a collection of command-line tools for automatic tree and word alignment of parallel corpora. The main purpose is to provide an experimental toolbox for experiments with various feature sets and alignment strategies. Alignment is based on local classification and alignment inference. The local classifier is typically trained on available aligned training data.

Information and download: https://bitbucket.org/tiedemann/lingua-align

Uplug – tools for creating and processing parallel corpora

Information and download: https://bitbucket.org/tiedemann/uplug

Docent – A document-level search decoder for SMT

Information and download: https://github.com/chardmeier/docent/wiki

pdf2xml – convert PDF documents to XML

Information and download: https://bitbucket.org/tiedemann/pdf2xml

subalign – convert and align movie subtitles

Information and download: https://bitbucket.org/tiedemann/subalign


Setup and Data from EACL 2012 – Character-Based Pivot Translations for Under-Resourced Languages and Domains

Tiedemann, J. Character-Based Pivot Translations for Under-Resourced Languages and Domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012.

Catalan – English (via Spanish) parallel data (675 MB) monolingual data (2.7 GB)
Galician – English (via Spanish) parallel data (511 MB) monolingual data (2.7 GB)
Macedonian – English (via Bulgarian and Bosnian) parallel data (412 MB) monolingual data (2.4 GB)
Norwegian – English (via Danish, Swedish, German, French) parallel data (260 MB) monolingual data (2.6 GB)