Sheffield Submissions for WMT18 Multimodal Translation Shared Task

This paper describes the University of Sheffield’s submissions to the WMT18 Multimodal Machine Translation shared task. We participated in both tasks 1 and 1b. For task 1, we build on a standard sequence to sequence attention-based neural machine translation system (NMT) and investigate the utility of multimodal re-ranking approaches. More specifically, n-best translation candidates from this system are re-ranked using novel multimodal cross-lingual word sense disambiguation models. For task 1b, we explore three approaches: (i) re-ranking based on cross-lingual word sense disambiguation (as for task 1), (ii) re-ranking based on consensus of NMT n-best lists from German-Czech, French-Czech and English-Czech systems, and (iii) data augmentation by generating English source data through machine translation from French to English and from German to English followed by hypothesis selection using a multimodal-reranker.


Introduction
This paper describes the University of Sheffield's submissions for both Tasks 1 and 1b of the third edition of the Multimodal Machine Translation shared task. Task 1 consists in translating source sentences in English that describe an image into German (DE) or French (FR) or Czech (CS), given the image. Task 1b consists in translating source sentences in English that describe an image into Czech, given the image and the French and German translations of the source sentence.
This task poses the challenging problem of building models that use both language and image modalities. The dataset for the shared task  has sentences with simple language constructions and it has been observed by earlier systems Elliott et al., 2017) that standard text-only sequence to sequence neural machine translation models (NMT) with attention are able to obtain very high performance.
Building on this, for further inspection, we built our own standard NMT systems for EN-DE, EN-FR and EN-CS language directions and noticed that the translation hypotheses besides the 1-best output are also of high quality. We made our systems produce 20 translation hypotheses for English descriptions in the validation set and selected the hypothesis with the highest sentencelevel METEOR (Denkowski and Lavie, 2014) score, called the Oracle, and compared this to the 1-best. In this experiment, we observed that the Oracle performs way better (11 to 13.5 METEOR points) than the 1-best output (See Table 1). This preliminary experiment motivated us to investigate re-ranking approaches.  Table 1: Motivation for re-ranking. In this preliminary experiment, we observe that re-ranking of the 20-best translation hypotheses generated by a standard NMT model has the potential of improving translation by upto 10.84 to 13.49 METEOR points for the three language pairs.
For a re-ranking strategy, we were inspired by how humans use images to translate image descriptions. We believe humans look at the image usually to disambiguate ambiguous words in the source sentence especially in those instances where the text alone is not sufficient. For example, translating 'A sportsperson is playing football' into French requires us to know whether the sportsperson is a male or a female and accordingly the translation is 'Une sportif joue au football' (male) or 'Une sportive joue au football' (female). In such cases, humans usually look at the image to disambiguate and select the correct translation which is what we try to mimic in our approach.
More specifically, in our systems we adopt a two-step pipeline approach. In the first step, we use an ensemble of text-only models initialized with different seeds to produce lists of 10-best translation hypotheses. In the second step, we rerank the 10-best hypotheses using a novel multimodal cross-lingual Word Sense Disambiguation (WSD) approach. For control experiments, we also compare our results with monomodal crosslingual WSD (Lefever and Hoste, 2013) and a system that performs re-ranking using the Most Frequent Sense (MFS) baseline (Section 3.1.2).
Our main goal is to investigate a multimodal, image-based, cross-lingual WSD that predicts the translation candidate which correctly disambiguates ambiguous words in the source sentence. Our baseline NMT system is based on the attentive encoder-decoder (Bahdanau et al., 2015) approach with a Conditional GRU (CGRU) (Cho et al., 2014) decoder and is built using NMTPY toolkit (Caglayan et al., 2017b).
For task 1b, we explore three approaches. The first approach concatenates the 10-best translation hypotheses from DE-CS, FR-CS and EN-CS MT systems and then re-ranks them using the imageaware multimodal cross-lingual WSD mentioned earlier (the same way as in Task 1) (Section 3.1.2).
The second approach explores the consensus between the different 10-best lists. The best hypothesis is selected according to the number of times it appeared in the different 10-best lists. We followed the order of the n-best lists, meaning that the highest ranked hypothesis with the majority votes was selected.
The third approach uses data augmentation that hinges on the fact that the objective is to translate from English into Czech. Extra source data is generated by building systems that translate from German into English and French into English. With this extra data, we build an EN-CS system. We then obtain a 10-best list over training, development and test sets respectively. For selecting the best hypothesis from the 10-best list, we experiment with a classification-based approach. We calculate METEOR (Denkowski and Lavie, 2014) scores for each hypothesis in the 10-best list of the training set and threshold the scores to build classifiers to distinguish good from bad translations using a) word embeddings and image features with a Random Forest model and b) a multimodal Recurrent Neural Network (RNN) model.
In Section 3 we describe our systems in detail. We describe the data preprocessing in Section 2. The results are discussed in Section 4.

Translation models
We use the Multi30K  dataset provided by the organizers. Each image i contains one English description en i taken from Flickr30K and human translations into German de i , French f r i and Czech cz i . In other words, each instance is a 5-tuple of the form (i, en i , de i , f r i , cz i ). The dataset contains 29,000 training and 1,014 development instances.
For Task 1, the test sets of the previous two editions (2016 and 2017) have also been provided for validation purposes. These do not contain Czech translations. A new test set of 1,071 tuples containing an English description and its corresponding image is provided for evaluation.
For Task 1b, a test set of 1,000 tuples containing English, French, and German descriptions and their corresponding images is provided for evaluation. This test set corresponds to the unseen portion of the Czech Test 2017 data. The test set of 2016 is provided for validation purposes.

Cross-lingual WSD models
For the cross-lingual WSD models, we use the Multimodal Lexical Translation Dataset (MLTD) (Lala and Specia, 2018), which was extracted from the Multi30K  dataset. MLTD consists of 4-tuples of the form (x, i, en i , x t ) where x is an ambiguous 1 word in the English description en i of the image i, and x t is the lexical translation of x in a specified target language t ∈ {German, French, Czech} that conforms with the image and the description. Only instances from the training portion of the Multi30K dataset are used to train the cross-lingual WSD models.
For English-German, MLTD consists of 745 ambiguous words in English with 4.09 different translations per word (on average) in German and 17.69 instances per translation (on average) totalling 53,868 MLTD instances.
For English-French, MLTD consists of 661 ambiguous words in English with 2.98 different translations per word (on average) in French and 22.73 instances per translation (on average) totalling 44,779 MLTD instances.
For English-Czech 2 , MLTD consists of 3,217 ambiguous words in English with 5.15 different translations per word (on average) in Czech and 11.32 instances per translation (on average) totalling 187,495 MLTD instances.

Image features
We used the ResNet-50 image features provided by the task organizers. These are 2048dimensional features extracted from pool5 of a pretrained ResNet-50 (He et al., 2016) model which has been trained on the ImageNet dataset (Russakovsky et al., 2015).

System descriptions
In this section we describe the systems submitted for both tasks.

Task 1 systems
Our two-step pipeline consists in first obtaining high quality hypotheses from a NMT model, followed by a re-ranking step. We describe the setup of the NMT in Section 3.1.1. The cross-lingual WSD models used for re-ranking are described in Section 3.1.2 and the re-ranking formulation with examples is shown in Section 3.1.3.

Baseline NMT model setup
We make use of an ensemble of text only attention based NMT models (Bahdanau et al., 2015) with a conditional gated recurrent units (CGRU) (Cho et al., 2014) decoder. We build the system using the NMTPY toolkit (Caglayan et al., 2017b).
Our models have a setting similar to Caglayan et al. (2016) with a bi-directional 256-dimensional recurrent GRU followed by a conditional GRU which is initialized with a non-linear transformation of the mean of encoder states. We use a simple feedforward network to compute the attention scores as described in Caglayan et al. (2016). We use Adam optimizer with a learning rate of 5e − 5 and a batch size of 64. We set the embedding dimensionality of encoder and decoder to 128 and follow the default parametrization in (Caglayan et al., 2017a). Our final baseline model is an ensemble of different runs of the model with five different seeds.

Crosslingual WSD models
The goal of cross-lingual WSD (Lefever and Hoste, 2013) is to generate contextually correct translations of ambiguous words in the source language into the target language. For this, the sense inventory for the ambiguous words is created from the parallel corpus. MLTD (Lala and Specia, 2018) (Section 2.2) provides us with the data settings needed for this task.
As a baseline we have the Most Frequent Sense (MFS) model, which returns the most frequent translation of a given ambiguous word as seen in the training corpus. For example in the English-French MLTD, the ambiguous word woods appears 95 times in the training set. In 16 times the translation is forêt (forest), while in the remaining 79 times the translation is bois (timber/wood). In this case, the MFS model translates the word woods as bois irrespective of the context.
As a second baseline, we have a text-only Lexical Translation (LT) model. This is a single layer Bidirectional Long Short-Term Memory (BiLSTM) network (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) used as a sequence tagger as depicted in Figure 1.
For the LT model, we convert the classification task of cross-lingual WSD into a sequence tagging task as demonstrated in (Raganato et al., 2017). The 4-tuples of MLTD are transformed into a sequential tagged dataset. This consists of English sentences where each word is tagged to itself if it is unambiguous and tagged to the correct lexical translation in the target language if it is ambiguous. 3 The training is done such that an unambiguous word is tagged with itself, while an ambiguous word, like trail and woods in this example, is tagged with the corresponding lexical translation in the target language like sentier and bois respectively.
Our proposed model is a Multimodal Lexical Translation (MLT) model. It has the same architecture as the LT model except that the LSTM weights are initialized with the image features. 4 To avoid dimensionality mismatch, the image features (Section 2.3) undergo a dimensionality reduction via a fully connected layer, which is also trained.
Training: Both LT and MLT models are trained on only those sentences which have at least one ambiguous word as per MLTD. For optimization, we use the ADAM (Kingma and Ba, 2014) algorithm with a learning rate = 0.001 and batch size = 32. The LSTM hidden state dimensions and the word embedding dimensions are set to 300 and the dropout rate is set to 0.3. Training is stopped early if model accuracy over the validation set does not improve for 30 epochs. These models are implemented and trained in the TensorFlow framework.
The performance of the models (Table 2) 5 , measured in terms of percentage of correctly translated ambiguous words (accuracy), suggests that the image-aware MLT model is slightly better than the text-only LT and MFS models. 4 We tried a few other ways of using the image featureslike concatenating it to word embeddings, using it as a separate word, etc. -but these did not result in any improvements. 5 The performance of cross-lingual WSD models for EN-CS language direction could not be evaluated because the EN-CS Multimodal Lexical Translation Dataset was noisy. The clean 'filtered by human' versions of the EN-CS MLTD test sets were not ready at the time of submitting this paper.

Re-ranking
Our re-ranking strategy is depicted in Figure 2. First, given an English source sentence, the base model generates an n-best list of translation candidates with a likelihood score. The idea is to select the translation candidate in the n-best translations which correctly disambiguates as many ambiguous words in the source sentence as possible. The source sentence in our example ( Figure 2) contains two ambiguous words trail and woods as per the English-French MLTD. We use a crosslingual WSD model, MFS or LT or MLT, to predict the lexical translations of these words (the correct ones being sentier and forêt respectively in this example). Next, we match these to the words in the translation candidates and add the number of matching words to the original score 6 of the candidates. Then, the n-best translations are re-ranked using the new scores and the top candidate (which has the highest number of matches) is used in the evaluation.

Task 1b systems
Three different approaches were explored in our submissions for Task 1b. The first approach follows the re-raking experiments using MLT for Task 1. The second approach exploits consensusbased selection and the third explores data augmentation and n-best selection through classification. We try two different types of classifiers -Random Forest and Recurrent Neural Network.
Re-ranking using MLT For the re-ranking approach, we first train three baseline EN-CS, DE-CS and FR-CS NMT models. Given a source sentence in the test set, we generate 10-best translation hypotheses using each of the three models. The three 10-best lists are concatenated to form a list of 30 translation hypotheses. We then use the trained EN-CS MLT model for cross-lingual WSD and perform re-ranking as mentioned in 3.1.2 and 3.1.3.

Consensus-based selection
For the consensusbased selection approach, we again use the three 10-best translation hypotheses coming from the EN-CS, DE-CS and FR-CS systems. We then explore consensus between the different 10-best lists. The best hypothesis is selected according to the number of times it appears in the different lists. We follow the order of the EN-CS 10-best list: the highest ranked hypothesis in the EN-CS list with the majority of the votes (measured in terms of whether it occurs in the DE-CS and FR-CS 10-best lists) is selected.

Data augmentation
We explore data augmentation by creating systems that first translate source sentences from French, German and Czech into English. This leads to variants of the source data that translate into the same Czech sentence. The augmented data is used to train an NMT system to translate test source sentences from English into Czech. We then obtain a 10-best list for the training, development and test sets. For the selection approach, we compute METEOR scores for each of the hypotheses in the 10-best list of the training set. To treat this as a binary classification task, we set a threshold such that the top four hypotheses are assumed to be the best translations and are chosen as positive samples, with the remaining six as bad examples. 7 This is then used to train two types of classifiers: 7 This threshold was empirically defined.
• Random Forest (RF) classifier: we use the image vectors concatenated with sentence embeddings from source and target sentences as features for training the classifier. For extracting sentence embeddings, we use the approach of Arora et al. (2016). Pre-trained embeddings for English and Czech from MUSE 8 (Conneau et al., 2018) are used.The RF algorithm in the scikit-learn framework (Pedregosa et al., 2011) is trained to distinguish between good and bad translations. • RNN classifier: We use a simple RNN-based classifier where the last hidden state of the encoded sentence is concatenated with the image vector and used with a hinge loss to distinguish between good and bad translations.

Results
For both tasks, the initial evaluation was performed in terms of METEOR, BLEU (Papineni et al., 2002) and TER (Snover et al., 2006), with METEOR as the primary metric. Direct human assessments of translation adequacy will be used for the final evaluation by the task organizers. For task 1, our submitted systems consisted of: a) SHEF LT: re-ranking using LT model; b) SHEF MLT: re-ranking using MLT model; c) SHEF MFS: re-ranking using MFS model; and d) SHEF Baseline: our baseline text-only ensemble NMT model Table 3 shows the official evaluation results of our systems submitted to Task 1 and the baseline system provided by the organizers. For all language pairs, our systems outperform the official baseline for all metrics.
For EN-DE and EN-FR, the systems with LT and MLT are slightly better than the system with MFS. For EN-CS, however, the MFS system scores better than the LT and MLT variants. This is, perhaps, because the EN-CS MLTD (on which LT and MLT models are trained) is noisy, as previously mentioned. The dataset has been extracted using the same procedure in (Lala and Specia, 2018) except for the human filtering step, which is crucial for a clean dataset.
On further inspection, we observe that the cross-lingual WSD re-ranking affects only 127 to     Table 4). These usually result in changing only one or two words and as a result it affects only 180 to 244 words in the entire test set (See Table 4). In other words, only 1.4% words in the entire test set are affected by the re-ranking, which may explain why the performance of all the systems is so similar. It also suggests that automatics metrics like BLEU, METEOR and TER may not be sufficient to detect subtle changes in translation quality making it difficult to deduce insights from our re-ranking approaches. We hope to rely on Direct Human Assessment and other more sensitive metrics to help to better understand the affects. For Task 1b, we submitted four models: 9 We ignore EN-CS in this observation because the EN-CS MLTD is noisy and thus the trained cross-lingual WSD models are not reliable for this language pair. a) SHEF CON: consensus based model; b) SHEF MLT: a re-ranking approach using MLT model; c) SHEF ARNN: a data augmentation and hypothesis selection approach using an RNN classifier; and d) SHEF ARF: data augmentation and hypothesis selection approach using an RF classifier. Table 5 shows the automatic metric scores for our systems and the official baseline. Our systems outperform the baseline in terms of BLEU and METEOR. For TER, all systems are better than the baseline except for SHEF ARF. Our best performing system is SHEF CON.

Conclusions
We have described our submissions to the Multimodal Machine Translation shared task at WMT18. We explored novel multimodal n-best re-ranking approaches for task 1, and consensusbased approaches for task 1b using image information for re-ranking of an augmented n-best list with outputs from different translation models.
All our models perform better than the official baseline for all metrics and language pairs in task 1. However, we observe that SHEF LT and SHEF MLT, for the dataset and in the current setup, are not significantly different and their performance are nearly identical which indicates that the image information is not contributing significantly for this task and cross-lingual WSD is, perhaps, not very useful. On the other hand, it is worth emphasising that the corpora used may not show many ambiguous words and our model is not expected to be beneficial in this case.
For task 1b, our models also outperform the official baseline, with the best model being SHEF CON. As for task 1, the use of image information do not lead to improvements when evaluated using automatic metrics METEOR, BLEU and TER.