Opensource software and resources
Various resources and tools are available from github, bitbucket, source forge including tools for processing (parallel) corpora, alignment, translation, natural language inference and official test sets for evaluating English-Finnish machine translation.
OPUS – the open parallel corpus
OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
NLPL – The Nordic Language Processing Laboratory
The Nordic Language Processing Laboratory (NLPL) is a collaboration of academic research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) piloting innovative ways to share high-performance computing and data resources across country borders, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.
DiscoMT 2015 Shared Task on Pronoun Translation
DiscoMT at EMNLP 2015 featured two shared tasks: one in pronoun-focused machine translation and one in cross-lingual pronoun prediction. The data sets, baseline systems and system submissions along with additonal manual annotation are available from LINDAT/CLARIN:
ParCor – A Parallel Pronoun-Coreference Corpus
ParCor 1.0 is a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as – – referring expressions – has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.
A Blacklist Classifier for language discrimination
The blacklist classifier is a simple tool for discriminating related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages.
Information and download: https://bitbucket.org/tiedemann/blacklist-classifier
Lingua::Align – a toolbox for Tree Alignment
Lingua::Align is a collection of command-line tools for automatic tree and word alignment of parallel corpora. The main purpose is to provide an experimental toolbox for experiments with various feature sets and alignment strategies. Alignment is based on local classification and alignment inference. The local classifier is typically trained on available aligned training data.
Information and download: https://bitbucket.org/tiedemann/lingua-align
Uplug – tools for creating and processing parallel corpora
Information and download: https://bitbucket.org/tiedemann/uplug
Docent – A document-level search decoder for SMT
Information and download: https://github.com/chardmeier/docent/wiki
pdf2xml – convert PDF documents to XML
Information and download: https://bitbucket.org/tiedemann/pdf2xml
subalign – convert and align movie subtitles
Information and download: https://bitbucket.org/tiedemann/subalign
Setup and Data from EACL 2012 – Character-Based Pivot Translations for Under-Resourced Languages and Domains
Tiedemann, J. Character-Based Pivot Translations for Under-Resourced Languages and Domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2012.
|Catalan – English (via Spanish)||parallel data (675 MB)||monolingual data (2.7 GB)|
|Galician – English (via Spanish)||parallel data (511 MB)||monolingual data (2.7 GB)|
|Macedonian – English (via Bulgarian and Bosnian)||parallel data (412 MB)||monolingual data (2.4 GB)|
|Norwegian – English (via Danish, Swedish, German, French)||parallel data (260 MB)||monolingual data (2.6 GB)|