DCU System Report on the WMT 2017 Multi-modal Machine Translation Task

We report experiments with multi-modal neural machine translation models that incorporate global visual features in different parts of the encoder and decoder, and use the VGG19 network to extract features for all images. In our experiments, we explore both different strategies to include global image features and also how ensembling different models at inference time impact translations. Our submissions ranked 3 rd best for translating from English into French, always improving considerably over an neural machine translation baseline across all language pair evaluated, e


Introduction
In this paper we report on our application of three different multi-modal neural machine translation (NMT) systems to translate image descriptions. We use encoder-decoder attentive multi-modal NMT models where each training example consists of one source variable-length sequence, one image, and one target variable-length sequence, and a model is trained to translate sequences in the source language into corresponding sequences in the target language while taking the image into consideration. We use the three models introduced in Calixto et al. (2017b), which integrate global image features extracted using a pre-trained convolutional neural network into NMT (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state.
We are inspired by the recent success of multimodal NMT models applied to the translation of image descriptions (Huang et al., 2016;Calixto et al., 2017a). Huang et al. (2016) incorporate global visual features into NMT with some success, and Calixto et al. (2017a) propose to use local visual features instead, achieving better results. We follow Calixto et al. (2017b) and investigate whether we can achieve better results while still using global visual features, which are considerably smaller and simpler to integrate when compared to local features.
We expect that, by integrating visual information when translating image descriptions, we are able to exploit valuable information from both modalities when generating the target description, effectively grounding machine translation (Glenberg and Robertson, 2000).

Model Description
The models used in our experiments can be viewed as expansions of the attentive NMT framework introduced by Bahdanau et al. (2015) with the addition of a visual component that incorporates visual features from images. A bi-directional recurrent neural network (RNN) with gated recurrent unit (GRU) (Cho et al., 2014) is used as the encoder. The final annotation vector for a given source position i is the concatenation of forward and backward RNN hidden states, We use the publicly available pre-trained convolution neural network VGG19 1 of Simonyan and Zisserman (2014) to extract global image feature vectors for all images. These features are the 4096D activations of the penultimate fullyconnected layer FC7, henceforth referred to as q.
We now describe the three multi-modal NMT models used in our experiments. For a detailed explanation about these models, see Calixto et al. (2017b).

IMG 2W : Image as source words
In model IMG 2W , the image features are used as the first and last words of the source sentence, and the source-language attention model learns when to attend to the image representations. Specifically, given the global image feature vector q ∈ R 4096 : where W 1 I ∈ R 4096×4096 and W 2 I ∈ R 4096×dx are image transformation matrices, b 1 I ∈ R 4096 and b 2 I ∈ R dx are bias vectors, and d x is the source words vector space dimensionality, all trained with the model.
We directly use d as the first and last words of the source sentence. In other words, given the word embeddings for a source sentence X = (x 1 , x 2 , · · · , x N ), we concatenate the transformed image vector d to it, i.e. X = (d, x 1 , x 2 , · · · , x N , d), and apply the forward and backward encoder RNN. By including images into the encoder in model IMG 2W , our intuition is that (i) by including the image as the first word, we propagate image features into the source sentence vector representations when applying the forward RNN to produce vectors − → h i , and (ii) by including the image as the last word, we propagate image features into the source sentence vector representations when applying the backward RNN to produce vectors ← − h i .

IMG E : Image for encoder initialisation
In the original attention-based NMT model of Bahdanau et al. (2015), the hidden state of the encoder is initialised with the zero vector #» 0 . Instead, we propose to use two new single-layer feed-forward neural networks to compute the initial states of the forward and the backward RNN, respectively.
Similarly to what we do for model IMG 2W described in Section 2.1, given a global image feature vector q ∈ R 4096 , we compute a vector d using Equation (1), only this time the parameters W 2 I and b 2 I project the image features into the same dimensionality as the hidden states of the source language encoder.
The feed-forward networks used to initialise the encoder hidden state are computed as in (2): where W f and W b are multi-modal projection matrices that project the image features d into the encoder forward and backward hidden states dimensionality, respectively, and b f and b b are bias vectors. − → h init and ← − h init are directly used as the forward and backward RNN initial hidden states, respectively.

IMG D : Image for decoder initialisation
To incorporate an image into the decoder, we introduce a new single-layer feed-forward neural network. Originally, the decoder initial hidden state is computed using a summary of the encoder hidden states. This is often the concatenation of the last hidden states of the encoder forward RNN and backward RNN, respectively − → h N and ← − h 1 , or the mean of the source-language annotation vectors h i .
We propose to include the image features as additional input to initialise the decoder's hidden state, as described in (3): where s 0 is the decoder initial hidden state, W m is a multi-modal projection matrix that projects the image features d into the decoder hidden state dimensionality and W di and b di are learned model parameters.
Once again we compute d by applying Equation (1) onto a global image feature vector q ∈ R 4096 , only this time the parameters W 2 I and b 2 I project the image features into the same dimensionality as the decoder hidden states.

Experiments
We report results for Task 1, specifically when translating from English into German (en-de) and French (en-fr). We conducted experiments on the constrained version of the shared task, which means that the only training data we used is the data released by the shared task organisers, i.e. the translated Multi30k (M30k T ) data set (Elliott et al., 2016) with the additional French image descriptions, included for the 2017 run of the shared task.
Our encoder is a bi-directional RNN with GRU, one 1024D single-layer forward RNN and one 1024D single-layer backward RNN. Throughout, we parameterise our models using 620D source and target word embeddings, and both are trained jointly with our model. All non-recurrent matrices are initialised by sampling from a Gaussian distribution (µ = 0, σ = 0.01), recurrent matrices are random orthogonal and bias vectors are all initialised to #» 0 . We apply dropout (Srivastava et al., 2014) with a probability of 0.3 in source and target word embeddings, in the image features, in the encoder and decoder RNNs inputs and recurrent connections, and before the readout operation in the decoder RNN. We follow Gal and Ghahramani (2016) and apply dropout to the encoder bidirectional RNN and decoder RNN using the same mask in all time steps.
The translated Multi30k training and validation sets contain 29k and 1014 images respectively, each accompanied by a sentence triple, the original English sentence and its gold-standard translations into German and into French.
We use the scripts in the Moses SMT Toolkit (Koehn et al., 2007) to normalise, lowercase and tokenize English, German and French descriptions and we also convert space-separated tokens into subwords (Sennrich et al., 2016). The subword models are trained jointly for English-German descriptions and separately for English-French descriptions using the English-German and English-French WMT 2015 data (Bojar et al., 2015). English-German models have a final vocabulary of 74K English and 81K German subword tokens, and English-French models 82K English and 82K French subword tokens. If sentences in English, German or French are longer than 80 tokens, they are discarded.
Finally, we use the 29K entries in the M30k T training set for training our models, and the 1, 014 entries in the M30k T development set for model selection, early stopping the training procedure in case the model stops improving BLEU scores on this development set. We evaluate our English-German models on three held-out test sets, the Multi30k 2016/2017 and the MSCOCO 2017 test sets, and our English-French models on the Multi30k 2017 and the MSCOCO 2017 test sets.

Results
In Table 1, we show results when translating the Multi30k 2017 test sets. Models are trained on the original M30k T training data only. The NMT baseline is the attention-based NMT model of Bahdanau et al. (2015) and its results are the ones reported by the shared task organisers. When compared to other submissions of the multi-modal MT task under the constrained data regime, our models ranked sixth best when translating the English-German Multi30k 2017, and fourth best when translating the English-German MSCOCO 2017 test sets. When translating both the Multi30k 2017 and the MSCOCO 2017 English-French test sets, our models are ranked third best, scoring only 1-2 points (BLEU, METEOR) less than the best system.
In Table 2, we show results when translating the MSCOCO 2017 English-German and English-French test sets. Again, all models are trained on the original M30k T training data only. When compared to other submissions of the multi-modal MT task under the constrained data regime, our submission ranked fourth best for the English-German and third best for the English-French lan-  (Bojar et al., 2015), consisting of ∼4.3M sentence pairs. Table 3: Results for the best model of Calixto et al. (2017a), which is pre-trained on the English-German WMT 2015 (Bojar et al., 2015), and different combinations of multi-modal models, all trained on the original M30k T training data only, evaluated on the M30k T 2016 test set.
guage pair, scoring only 1 to 1.5 points less than the best system. These are promising results, specially taking into consideration that we are using global image features, which are smaller and simpler than local features (used in Calixto et al. (2017a)).
Ensemble decoding We now report on how can ensemble decoding be used to improve multimodal MT. In Table 3, we show results when translating the Multi30k 2016's test set. We ensembled different models by starting with one of Calixto et al. (2017b)'s best performing multi-modal models on this data set, IMG D , and by adding new models to the ensemble one by one, until we reach a maximum of four independent models, all of which are trained separately and on the original M30k T training data only. We also report results for the best model of Calixto et al. (2017a), which is pre-trained on the English-German WMT 2015 (Bojar et al., 2015) and uses local visual features extracted with the ResNet-50 network (He et al., 2015). We first note that adding more models to the ensemble seems to always improve translations by a large margin (∼ 3 BLEU/METEOR points). Adding model IMG 2W to the ensemble already consisting of models IMG E and IMG D still improves translations, according to all metrics evaluated. This is an interesting result, since compared to these other two multi-modal models, model IMG 2W performs the worst according to BLEU, METEOR and chrF3 (see Calixto et al. (2017b)). Our best results are obtained with an ensemble of four different multi-modal models.
By using an ensemble of four different multimodal NMT models trained on the translated Multi30k training data, we were able to obtain translations comparable to or even better than those obtained with the strong multi-modal NMT model of Calixto et al. (2017a), which is pretrained on large amounts of WMT data and uses local image features.

Conclusions and Future work
In this work, we evaluated multi-modal NMT models which integrate global image features into both the encoder and the decoder. We experimented with ensembling different multi-modal NMT models introduced in Calixto et al. (2017b), and results show that these models can generate translations that compare favourably to multimodal models that use local image features. We observe consistent improvements over a text-only NMT baseline trained on the same data, and these are typically very large (e.g., 7.0-9.2 METEOR points across language pairs and test sets). In future work we plan to study how to generalise these models to other multi-modal natural language processing tasks, e.g. visual question answering.