The Fiskmö project is interested in all Swedish–Finnish and Finnish–Swedish translations. There are three main ways in which we obtain data: Translation Memory donation, Crawl and filter, and Data alignment.
Three ways to donate:
Option A: Translation Memory donation
- The donor simply hands out their translation memories in .tmx or comparable format. This is the easiest way when the translation memories don’t contain any sensitive or secret data. Depending on what we and the donor agree upon, the data may be either made public or only used internally in research and Machine Translator training.
Option B: Crawl and filter
- When the translation memories contain sensitive or secred data, we may extract all the publicly available texts from the donor’s web site. We send these datasets to the donor, where they are pretranslated with the translation memories using a computer assisted translation tool, such as SDL Trados Studio. Alternatively, if agreed upon, we can do this pretranslation ourselves. The results of these pretranslations are bilingual files which only contain the content of the translation memories that is already publicly available on the internet. These bilingual files are then exported into .tmx or comparable file format and given to us, either as a public dataset or an internal-use-only dataset.
Option C: Data alignment
- If the donor does not have any translation memories but has, e.g. a bilingual website, we may extract the texts and align the corresponding sentences, thus creating a bilingual dataset. This is also a good option if the donor has the same documents in both Finnish and Swedish but no translation memories.
- By donating, you make a valuable contribution towards Fiskmö’s main goals, building a massive Finnish–Swedish parallel corpus for language research and creating a public machine translation service between Finnish and Swedish.
- We can train a Machine Translation engine on your material and give you access to a specialized machine translator tailored to your texts.
- If you donate language versions of a text but don’t have a translation memory, we give the aligned memory to you to use.
- We can create bilingual dictionaries/termbases for you based on word frequencies in the donated data.
- We can help you maintain and clean existing data:
- wrongly tagged language (i.e. language detection)
- character encoding problems (e.g. question marks in place of å, ä, ö)
- xml/html code in data