Machine Translation and MWEs – Lab Session

Goals of this Lab Session

The aim with this session is to give you a first impression about phrase-based statistical machine translation and to give you the basics in training a model from data. The example is a toy example but big enough to give reasonable results. We also include a special data set with some multi-word-units that have been marked to be treated differently by the model. Does this make a difference? And what kind of difference can be seen if any? You will see …

Getting Started With Linux

During this lab-session, we will use the command line of the Linux system to run all tools and to inspect the results of the training and translation processes. In case you are not familiar with the command line and basic Linux commands, try to get familiar with the most basic operations you can run in the terminal window, which will give you access to the file system and tools installed on the machine.

  • start a new terminal (ask for help if you don’t know what that means)
  • there will be a prompt waiting for your input
  • you will be in your own home directory after starting the terminal window
  • commands that you can run are typically names of programs followed by some parameters that control the behaviour of the program
  • type the whole command on one line and press <enter> to run it

Try to get help if you feel lost already …

  • There are special characters that have special meanings on the command line – don’t use them when giving names to files: ‘ ‘ (space), *, &, |, >, <, (, )
  • The file system is organised in directories that may contain any number of sub-directories

Try the following commands:

ls list all files and sub-directories in the current directory
mkdir lab create a directory with the name ‘lab’
cd lab change into the directory ‘lab’
echo ‘test’ print the string ‘test’ in the terminal window
echo ‘test’ > file.txt print ‘test’ and save it in a file called ‘file.txt’
NOTE: this will overwrite the file if it already exists
echo ‘hi’ >> file.txt print ‘hi’ and add it to the file ‘file.txt’
NOTE: this will append a line at the end of the file
cat file.txt print the contents of file ‘file.txt’ in the terminal window
cp file.txt new.txt copy the file ‘file.txt’ to a new file ‘new.txt’
grep ‘hi’ file.txt search for lines with the string ‘hi’ in the file ‘file.txt’
grep ‘hi’ file.txt > out.txt save the result of your search in ‘out.txt’
less file.txt show the contents of file ‘file.txt’, allow scrolling
(stop showing the file by pressing ‘q’ for quit)
zless file.gz show the contents of a compressed file
nano file.txt edit the file ‘file.txt’;
save your changes by pressing ctrl+s
leave the editor by pressing ctrl+x

Data Sets and Experimental Setup

We will use data from Tatoeba – a collaborative data set of translated sentences – to test statistical machine translation models and tools. We will focus on French and English using French as the input (source) language and English as the target language. Feel free to test the other direction as well if you like. The data sets are already prepared for you and you can find them in the following directory on the local file system: ……

You can also download the data sets from here (uncompress the files using the command ‘unzip’ and ‘unzip’):

The data sets are split into training data, development data and test data. The file names tell you which part of the data sets each file contains (train, dev, test) and the final extension specifies the language (en for English and fr for French).

Aligned sentences are on the same line in corresponding files. Look at the test data to see how they look like:

less tatoeba.tok.test.en

We will work with two different sets of files:

  1. tokenised lowercased data (look at the files to understand what tokenisation does)
  2. tokenised data with marked multi-word expressions (try to find MWEs and see how they are marked)

MWEs are taken from automatically parsed data. Only a few MWEs are available. Try to find MWEs and see how they are marked in the text. What is the purpose of marking them in this way?

Make a new directory for your experiments and copy all the data into that directory:

mkdir mtlab
cd mtlab
cp -R /path/to/data/* .

Train a Language Model

The first step in our training procedure is to train a probabilistic language model. We will use a 5-gram model in our case:

lmplz -o 5 < tatoeba.tok.train.en >

What is a language model and what is it good for in statistical machine translation? Look at the output file (use ‘less’). Here are explanations of the ARPA file format.

The next step is to create a compact binary format that we can use in machine translation:

build_binary train-tok.kenlm

If there are problems training the language models then get pre-trained models from here.

Train a Translation Model

The next step is to train a probabilistic phrase table from the parallel data. This involves several steps that we have introduced during the course including word alignment, phrase extraction and phrase scoring.

  • Word alignment

Use efmaral to run word alignments in both directions: -i tatoeba.tok.train.en > -r -i tatoeba.tok.train.en > train-tok.en-fr

Look at the output files and try to understand the data format. Words are aligned based on their positions in the text. Source and target word positions are separated by a hyphen. The first token in a sentence is at position 0. Does the alignment make sense?

  • Alignment symmetrisation

We need to combine both alignment directions to get one symmetric alignment to be used for extracting phrase translations:

mkdir -p train-tok/model
atools -c grow-diag-final -i -j train-tok.en-fr > train-tok/model/aligned.grow-diag-final
  • Phrase extraction

All phrase-pairs that are consistent with the symmetrised word alignments are extracted from the parallel corpus. The maximum length of a phrase is 7 tokens by default. Run the following command:

train-model.perl --corpus tatoeba.tok.train -e en -f fr --root-dir train-tok -do-steps 5

This command creates two files with extracted phrase pairs in the directory fr-en.tok/model (this will take some time and the files will be quite big). Look at the files and try to understand how these phrases are extracted from the word aligned parallel training corpus. Note that the files are compressed! Use the command zless instead of less.

  • Phrase scoring

Finally, we have to estimate scores for each unique phrase pair that has been extracted from the training data. There are 4 scores that will be used in the phrase table, two phrase translation probabilities and two lexical translation weights. Run the following commands to estimate the scores:

train-model.perl --corpus tatoeba.tok.train -e en -f fr --root-dir train-tok -do-steps 4
train-model.perl --corpus tatoeba.tok.train -e en -f fr --root-dir train-tok -do-steps 6

This will, again, take quite some time. Read more about Moses in the meantime. When ready, look at the phrase-table, which you will find in fr-en.tok/model. This file is also compressed and, therefore, use the command zless instead of less.

Create Configuration File and Translate

Now we are ready with the two main components and we can use the models to translate sentences from French into English. But first we have to create the configuration file for our model to tell the translation engine (the “decoder”) how to run the model:

train-model.perl --corpus tatoeba.tok.train -e en -f fr --root-dir train-tok -lm 0:5:${PWD}/train-tok.kenlm -do-steps 9

This creates the configuration file (moses.ini) in the directory fr-en.tok/model. Look at the file and try to figure out what the components are in this model.

Finally, we can translate sentences in our test set:

moses -f fr-en.tok/model/moses.ini < > test-tok.en

This will take a few minutes and you can start setting up a system for the MWE-corpus in the meantime. Open a new terminal window and repeat the whole procedure for the other data set. We recommend that you create a brand-new working directory to avoid overwriting model files. Copy the MWE data into the new directory and start all over again. Don’t forget to change ‘tok’ into ‘mwe’ in all commands and files!


Let’s also look at the translation results. First we will run automatic evaluation using BLEU and the reference translations:

multi-bleu.perl test-tok.en <

What is the score that you get? What is your impression of the translations?

Comparison – With or Without MWEs

So, what happens when we mark MWEs as single tokens? Now it is time to compare the two models based on the translations of the test sets. Make sure that you translate the proper version for each model (the one with marked MWEs for the MWE model)!

After translating the test set with the MWE markup, run the following command to get rid of the ‘+’ characters that connect the MWE tokens:

tr '+' ' ' < test-mwe.en > test-mwe.en.tok

(assuming that is the translated file)

Compute BLEU scores after this conversion with the reference translation without MWE markup. What is the difference in terms of BLEU between the two models for the given test set? Look also more closely at the difference between the two translations. You can put them together with the following command:

paste -d "\n" tatoeba.tok.test.en test-tok.en test-mwe.en.tok | sed 'n;n;n;G' > test.merged

Look at the file tatoeba.test.merged. The first two lines are input and reference translation. The third row is the translation without MWE markup and the fourth row is the one with MWE markup. What do you observe? Can you see systematic differences and patterns that may be caused by the different treatment of the training data? What is your conclusion? You may also look at the phrase tables of each model to figure out what is going on with the translations of MWEs. Would you have any ideas to improve the models?

Inspecting extracted phrase translations

Let’s have a look at the phrases that we have extracted to see if they make sense and what they cover. We are mainly interested in multi-word expressions and, therefore, we would like to filter the gigantic phrase tables to see some interesting examples. First, we can sort the extracted phrases (from the model without marked MWEs) to get the most frequently extracted phrase pairs with multi-word units that only involve words with alphabetic characters. Run the following command to get a sorted list of those phrases and their translations:

zgrep '^[[:alpha:]]* [[:alpha:]][[:alpha:] ]* |||' model/extract.sorted.gz | uniq -c | sort -nr > extract.tok.mwe

Look at the beginning of this file (using ‘less’). Ask the teachers or your fellow trainees to understand what this command-line does, if you don’t know already. What do you see? Are there any interesting MWEs in the list and do their translations make sense? Select some of the phrases and try to find them in the probabilistic phrase table of the translation model. For example, to look for translations of ‘je ne suis pas’ you can run the following command:

zgrep '^je ne suis pas' model/phrase-table.gz | less

What do you see in the phrase table?

Let’s also look at the other phrase table with the MWEs marked by ‘+’ characters. Run the following commands on the extracted phrases in the other model with marked MWEs:

zgrep '^[[:alpha:]]*+[[:alpha:]]* |||' model/extract.sorted.gz | uniq -c | sort -nr > extract.mwe.src
zgrep '||| [[:alpha:]]*+[[:alpha:]]* |||' model/extract.sorted.gz | uniq -c | sort -nr > extract.mwe.trg

What can you see in those files (extract.mwe.src and extract.we.trg)? What kind of translations are listed for the selected MWEs? Can you also find the corresponding entries in the probabilistic phrase table?


  • Moses – decoder and toolkit for statistical machine translation
  • efmaral – efficient word aligner
  • kenlm – scalable language modelling