The Fiskmö project is interested in all Swedish–Finnish and Finnish–Swedish translations. There are three main ways in which we obtain data: Translation Memory donation, Crawl and filter, and Data alignment.
How to provide data:
The simplest way to upload data is to use our anonymous upload service from the translation demo interface: https://translate.ling.helsinki.fi
There you can upload translation memories in TMX or XLIFF format, translated documents in a variety of common formats (XML, PDF, DOC, HTML, e-pub) or links to translated webpages that will be fetched automatically from the original source. It is possible to send pre-processed and aligned data back to the provider of the translated documents and webpage URL’s in TMX format. Just specify your e-mail address and tick the box in the form.
Another option is to use the data repository interface at https://opus-repository.ling.helsinki.fi. In this interface, you can create your own user account and upload data directly within the on-line system. The data will be processed and aligned if necessary and the user can freely decide whether the data set should be public or private. Groups for sharing the data can also be created and the system implements various tools for managing and processing the data. Simply create your own translation memory using that tool.
The last option is to contact us via the contact links and discuss file transfer and the usage of the data. A common way to send data in a secure way is to use the Funet FileSender app. As an external user, you can obtain upload vouchers that enable sending large files to us. We can also consider further data encryption before uploading the data, for example, using password protected zip files. Talk to us for more details.
Three ways to donate:
Option A: Translation Memory donation
- The donor simply hands out their translation memories in .tmx or comparable format. This is the easiest way when the translation memories don’t contain any sensitive or secret data. Depending on what we and the donor agree upon, the data may be either made public or only used internally in research and Machine Translator training.
Option B: Crawl and filter
- When the translation memories contain sensitive or secred data, we may extract all the publicly available texts from the donor’s web site. We send these datasets to the donor, where they are pretranslated with the translation memories using a computer assisted translation tool, such as SDL Trados Studio. Alternatively, if agreed upon, we can do this pretranslation ourselves. The results of these pretranslations are bilingual files which only contain the content of the translation memories that is already publicly available on the internet. These bilingual files are then exported into .tmx or comparable file format and given to us, either as a public dataset or an internal-use-only dataset.
Option C: Data alignment
- If the donor does not have any translation memories but has, e.g. a bilingual website, we may extract the texts and align the corresponding sentences, thus creating a bilingual dataset. This is also a good option if the donor has the same documents in both Finnish and Swedish but no translation memories.
- By donating, you make a valuable contribution towards Fiskmö’s main goals, building a massive Finnish–Swedish parallel corpus for language research and creating a public machine translation service between Finnish and Swedish.
- We can train a Machine Translation engine on your material and give you access to a specialized machine translator tailored to your texts.
- If you donate language versions of a text but don’t have a translation memory, we give the aligned memory to you to use.
- We can create bilingual dictionaries/termbases for you based on word frequencies in the donated data.
- We can help you maintain and clean existing data:
- wrongly tagged language (i.e. language detection)
- character encoding problems (e.g. question marks in place of å, ä, ö)
- xml/html code in data