Incorporating Global Visual Features into Attention-Based Neural Machine Translation

We introduce multi-modal, attention-based neural machine translation (NMT) models which incorporate visual features into different parts of both the encoder and the decoder. We utilise global image features extracted using a pre-trained convolutional neural network and incorporate them (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state. In our experiments, we evaluate how these different strategies to incorporate global image features compare and which ones perform best. We also study the impact that adding synthetic multi-modal, multilingual data brings and find that the additional data have a positive impact on multi-modal models. We report new state-of-the-art results and our best models also significantly improve on a comparable phrase-based Statistical MT (PBSMT) model trained on the Multi30k data set according to all metrics evaluated. To the best of our knowledge, it is the first time a purely neural model significantly improves over a PBSMT model on all metrics evaluated on this data set.


Introduction
Neural Machine Translation (NMT) has recently been proposed as an instantiation of the sequence to sequence (seq2seq) learning problem (Kalchbrenner and Blunsom, 2013;Cho et al., 2014b;Sutskever et al., 2014). In this problem, each training example consists of one source and one target variable-length sequence, with no prior information regarding the alignments between the two.
A model is trained to translate sequences in the source language into corresponding sequences in the target. This framework has been successfully used in many different tasks, such as handwritten text generation (Graves, 2013), image description generation (Hodosh et al., 2013;Kiros et al., 2014;Mao et al., 2014;Elliott et al., 2015;, machine translation (Cho et al., 2014b;Sutskever et al., 2014) and video description generation .
Recently, there has been an increase in the number of natural language generation models that explicitly use attention-based decoders, i.e. decoders that model an intra-sequential mapping between source and target representations. For instance, Xu et al. (2015) proposed an attentionbased model for the task of Image Description Generation (IDG) where the model learns to attend to specific parts of an image (the source) as it generates its description (the target). In MT, one can intuitively interpret this attention mechanism as inducing an alignment between source and target sentences, as first proposed by Bahdanau et al. (2015). The common idea is to explicitly frame a learning task in which the decoder learns to attend to the relevant parts of the source sequence when generating each part of the target sequence.
We are inspired by recent successes in using attention-based models in both IDG and NMT. Our main goal in this work is to propose end-toend multi-modal NMT models which effectively incorporate visual features in different parts of the attention-based NMT framework. The main contributions of our work are: • We propose novel attention-based multimodal NMT models which incorporate visual features into the encoder and the decoder.
• We discuss the impact that adding synthetic multi-modal and multilingual data brings to multi-modal NMT.
• We show that images bring useful information to an NMT model and report state-ofthe-art results.
One additional contribution of our work is that we corroborate previous findings by  that suggested that using image features directly as additional context to update the hidden state of the decoder (at each time step) prevents learning.
The remainder of this paper is structured as follows. In §1.1 we briefly discuss relevant previous related work. We then revise the attention-based NMT framework and further expand it into different multi-modal NMT models ( §2). In §3 we introduce the data sets we use in our experiments. In §4 we detail the hyperparameters, parameter initialisation and other relevant details of our models. Finally, in §6 we draw conclusions and provide some avenues for future work.

Related work
Attention-based encoder-decoder models for MT have been actively investigated in recent years. Some researchers have studied how to improve attention mechanisms (Luong et al., 2015;Tu et al., 2016) and how to train attention-based models to translate between many languages (Dong et al., 2015;Firat et al., 2016).
However, multi-modal MT has only recently been addressed by the MT community in the form of a shared task . We note that in the official results of this first shared task no submissions based on a purely neural architecture could improve on the Phrase-Based SMT (PB-SMT) baseline. Nevertheless, researchers have proposed to include global visual features in reranking n-best lists generated by a PBSMT system or directly in a purely NMT framework with some success (Caglayan et al., 2016;Calixto et al., 2016;Libovický et al., 2016;Shah et al., 2016). The best results achieved by a purely NMT model in this shared task are those of Huang et al. (2016), who proposed to use global and regional image features extracted with the VGG19 (Simonyan and Zisserman, 2014) and the RCNN (Girshick et al., 2014) convolutional neural networks (CNNs).
Similarly to one of the three models we pro-pose, 1 Huang et al. (2016) extract global features for an image, project these features into the vector space of the source words and then add it as a word in the input sequence. Their best model improves over a strong NMT baseline and is comparable to results obtained with a PBSMT model trained on the same data, although not significantly better. For that reason, their models are used as baselines in our experiments. Next, we point out some key differences between the work of Huang et al. (2016) and ours.
Architecture Their implementation is based on the attention-based model of Luong et al. (2015), which has some differences to that of Bahdanau et al. (2015), used in our work ( §2.1). Their encoder is a single-layer unidirectional LSTM and they use the last hidden state of the encoder to initialise the decoder's hidden state, therefore indirectly using the image features to do so. We use a bi-directional recurrent neural network (RNN) with GRU (Cho et al., 2014a) Huang et al. (2016) only use it as a word. We also show it is better to include an image exclusively for the encoder or the decoder initialisation (Tables 1 and 2).
Data Huang et al. (2016) use object detections obtained with the RCNN of Girshick et al. (2014) as additional data, whereas we study the impact that additional back-translated data brings.
Performance All our models outperform Huang et al. (2016)'s according to all metrics evaluated, even when they use additional object detections. If we use additional back-translated data, the difference becomes even larger.

Attention-based NMT
In this section, we briefly revise the attentionbased NMT framework ( §2.1) and expand it into a multi-modal NMT framework ( §2.2).

Text-only attention-based NMT
We follow the notation of Bahdanau et al. (2015) and Firat et al. (2016) throughout this section.
Given a source sequence X = (x 1 , x 2 , · · · , x N ) and its translation Y = (y 1 , y 2 , · · · , y M ), an NMT model aims at building a single neural network that translates X into Y by directly learning to model p(Y |X). Each x i is a row index in a source lookup matrix W x ∈ R |Vx|×dx (the source word embeddings matrix) and each y j is an index in a target lookup matrix W y ∈ R |Vy|×dy (the target word embeddings matrix). V x and V y are source and target vocabularies and d x and d y are source and target word embeddings dimensionalities, respectively.
A bidirectional RNN with GRU is used as the encoder. A forward RNN − → Φ enc reads X word by word, from left to right, and generates a sequence of forward annotation vectors Similarly, a backward RNN ← − Φ enc reads X from right to left, word by word, and generates a sequence of backward annotation vectors (1) The final annotation vector for a given time step i is the concatenation of forward and backward vec- In other words, each source sequence X is encoded into a sequence of annotation vectors h = (h 1 , h 2 , · · · , h N ), which are in turn used by the decoder: essentially a neural language model (LM) (Bengio et al., 2003) conditioned on the previously emitted words and the source sentence via an attention mechanism.
At each time step t of the decoder, we compute a time-dependent context vector c t based on the annotation vectors h, the decoder's previous hidden state s t−1 and the target wordỹ t−1 emitted by the decoder in the previous time step. 2 We follow Bahdanau et al. (2015) and use a single-layer feed-forward network to compute an expected alignment e t,i between each source annotation vector h i and the target word to be emitted at the current time step t, as in (2): In Equation (3), these expected alignments are 2 At training time, the correct previous target word yt−1 is known and therefore used instead ofỹt−1. At test or inference time, yt−1 is not known andỹt−1 is used instead.  discussed problems that may arise from this difference between training and inference distributions. normalised and converted into probabilities: where α t,i are called the model's attention weights, which are in turn used in computing the time-dependent context vector c t = N i=1 α t,i h i . Finally, the context vector c t is used in computing the decoder's hidden state s t for the current time step t, as shown in Equation (4): where s t−1 is the decoder's previous hidden state, W y [ỹ t−1 ] is the embedding of the word emitted in the previous time step, and c t is the updated timedependent context vector.
We use a single-layer feed-forward neural network to initialise the decoder's hidden state s 0 at time step t = 0 and feed it the concatenation of the last hidden states of the encoder's forward RNN ( − → Φ enc ) and backward RNN ( ← − Φ enc ), as in (5): where W di and b di are model parameters. Since RNNs normally better store information about recent inputs in comparison to more distant ones (Hochreiter and Schmidhuber, 1997;Bahdanau et al., 2015), we expect to initialise the decoder's hidden state with a strong source sentence representation, i.e. a representation with a strong focus on both the first and the last tokens in the source sentence.

Multi-modal NMT (MNMT)
Our models can be seen as expansions of the attention-based NMT framework described in §2 with the addition of a visual component to incorporate image features. Simonyan and Zisserman (2014) trained and evaluated an extensive set of deep Convolutional Neural Networks (CNNs) for classifying images into one out of the 1000 classes in ImageNet (Russakovsky et al., 2015). We use their 19-layer VGG network (VGG19) to extract image feature vectors for all images in our dataset. We feed an image to the pre-trained VGG19 network and use the 4096D activations of the penultimate fullyconnected layer FC7 3 as our image feature vector, henceforth referred to as q.
We propose three different methods to incorporate images into the attentive NMT framework: (a) An encoder bidirectional RNN that uses image features as words in the source sequence.
(b) Using an image to initialise the encoder hidden states.
(c) Image as additional data to initialise the decoder hidden state s0. using an image as words in the source sentence ( §2.2.1), using an image to initialise the source language encoder ( §2.2.2) and the target language decoder ( §2.2.3).
We also evaluated a fourth mechanism to incorporate images into NMT, namely to use an image as one of the different contexts available to the decoder at each time step of the decoding process. We added the image features directly as an additional context, in addition to W y [ỹ t−1 ], s t−1 and c t , to compute the hidden state s t of the decoder at a given time step t. We corroborate previous findings by  in that adding the image features as such prevents the model from learning. 4 2.2.1 Images as source words: IMG W One way we propose to incorporate images into the encoder is to project an image feature vector into the space of the words of the source sentence. We use the projected image as the first and/or last word of the source sentence and let the attention model learn when to attend to the image representation. Specifically, given the global image feature vector q ∈ R 4096 , we compute (6): I ∈ R dx are bias vectors, and d x is the source words vector space dimensionality, all trained with the model. We then directly use d as words in the source words vector space: as the first word only (model IMG 1W ), and as the first and last words of the source sentence (model IMG 2W ).
An illustration of this idea is given in Figure 1a, where a source sentence that originally contained N tokens, after including the image as source words will contain N + 1 tokens (model IMG 1W ) or N + 2 tokens (model IMG 2W ). In model IMG 1W , the image is projected as the first source word only (solid line in Figure 1a); in model IMG 2W , it is projected into the source words space as both first and last words (both solid and dashed lines in Figure 1a).
Given a sequence X = (x 1 , x 2 , · · · , x N ) in the source language, we concatenate the transformed image vector d to W x [X] and apply the forward and backward encoder RNN passes, generating hidden vectors as in Figure 1a. When computing the context vector c t (Equations (2) and (3)), we effectively make use of the transformed image vector, i.e. the α t,i attention weight parameters will use this information to attend or not to the image features.
By including images into the encoder in models IMG 1W and IMG 2W , our intuition is that (i) by including the image as the first word, we propagate image features into the source sentence vector representations when applying the forward RNN , and (ii) by including the image as the last word, we propagate image features into the source sentence vector representations when applying the backward RNN

Images for encoder initialisation: IMG E
In the original attention-based NMT model described in §2, the hidden state of the encoder is initialised with the zero vector #» 0 . Instead, we propose to use two new single-layer feed-forward neural networks to compute the initial states of the forward RNN − → Φ enc and the backward RNN ← − Φ enc , respectively, as illustrated in Figure 1b. Similarly to §2.2.1, given a global image feature vector q ∈ R 4096 , we compute a vector d using Equation (6), only this time the parameters W 2 I and b 2 I project the image features into the same dimensionality as the textual encoder hidden states.
The feed-forward networks used to initialise the encoder hidden state are computed as in (7): where W f and W b are multi-modal projection matrices that project the image features d into the encoder forward and backward hidden states dimensionality, respectively, and b f and b b are bias vectors.

Images for decoder initialisation:
IMG D To incorporate an image into the decoder, we introduce a new single-layer feed-forward neural network to be used instead of the one described in Equation 5. Originally, the decoder's initial hidden state was computed using the concatenation of the last hidden states of the encoder forward RNN ( − → Φ enc ) and backward RNN ( ← − Φ enc ), respectively − → h N and ← − h 1 . Our proposal is that we include the image features as additional input to initialise the decoder hidden state at time step t = 0, as in (8): where W m is a multi-modal projection matrix that projects the image features d into the decoder hidden state dimensionality and W di and b di are the same as in Equation (5).
Once again we compute d by applying Equation (6) onto a global image feature vector q ∈ R 4096 , only this time the parameters W 2 I and b 2 I project the image features into the same dimensionality as the decoder hidden states. We illustrate this idea in Figure 1c.

Data set
Our multi-modal NMT models need bilingual sentences accompanied by one or more images as training data. The original Flickr30k data set contains 30K images and 5 English sentence descriptions for each image (Young et al., 2014). We use the translated and the comparable Multi30k datasets , henceforth referred to as M30k T and M30k C , respectively, which are multilingual expansions of the original Flickr30k.
For each of the 30K images in the Flickr30k, the M30k T has one of its English descriptions manually translated into German by a professional translator. Training, validation and test sets contain 29K, 1014 and 1K images, respectively, each accompanied by one sentence pair (the original English sentence and its German translation). For each of the 30K images in the Flickr30k, the M30k C has five descriptions in German collected independently of the English descriptions. Training, validation and test sets contain 29K, 1014 and 1K images, respectively, each accompanied by 5 English and 5 German sentences.
We use the scripts in Moses (Koehn et al., 2007) to normalise, truecase and tokenize English and German descriptions and we also convert spaceseparated tokens into subwords (Sennrich et al., 2016b). All models use a common vocabulary of ∼83K English and ∼91K German subword tokens. If sentences in English or German are longer than 80 tokens, they are discarded.
We use the entire M30k T training set for training, its validation set for model selection with BLEU, and its test set to evaluate our models. In order to study the impact that additional training data brings to the models, we use the baseline model described in §2 trained on the textual part of the M30k T data set (German→English and English→German) without the images to build back-translation models (Sennrich et al., 2016a). We back-translate the 145K German (English) descriptions in the M30k C into English (German) and include the triples (synthetic English description, German description, image) as additional training data when translating into German, and (synthetic German description, English description, image) when translating into English.
We train models to translate from English into German and from German into English, and report evaluation of cased, tokenized sentences with punctuation.

Experimental setup
Our encoder is a bidirectional RNN with GRU (one 1024D single-layer forward RNN and one 1024D single-layer backward RNN). Source and target word embeddings are 620D each and both are trained jointly with our model. All nonrecurrent matrices are initialised by sampling from a Gaussian (µ = 0, σ = 0.01), recurrent matrices are orthogonal and bias vectors are all initialised to zero. Our decoder RNN also uses GRU and is a neural LM (Bengio et al., 2003) conditioned on its previous emissions and the source sentence by means of the source attention mechanism. Image features are obtained by feeding images to the pre-trained VGG19 network of Simonyan and Zisserman (2014) and using the activations of the penultimate fully-connected layer FC7. We apply dropout with a probability of 0.2 in both source and target word embeddings and with a probability of 0.5 in the image features (in all MNMT models), in the encoder and decoder RNNs inputs and recurrent connections, and before the readout operation in the decoder RNN. We follow Gal and Ghahramani (2016) and apply dropout to the encoder bidirectional RNN and decoder RNN using the same mask in all time steps.
Our models are trained using stochastic gradient descent with Adadelta (Zeiler, 2012) and minibatches of size 40 for improved generalisation (Keskar et al., 2017), where each training instance consists of one English sentence, one German sentence and one image. We apply early stopping for model selection based on BLEU scores, so that if a model does not improve on the validation set for more than 20 epochs, training is halted.
As our main baseline we train an attentionbased NMT model ( §2) in which only the textual part of M30k T is used for training. We also train a PBSMT model built with Moses on the same English→German (German→English) data, where the LM is a 5-gram LM with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the German (English) of the M30k T dataset. We use minimum error rate training (Och, 2003) for tuning the model parameters for BLEU scores. Our third baseline (English→German), is the best comparable multi-modal model by Huang et al. (2016) and also their best model with additional object detections: respectively models m1 (image at head) and m3 in the authors' paper. Finally, our fourth baseline (German→English) is 5 We specifically compute character 6-gram F3 scores.  We highlight in parentheses the improvements brought by our models compared to the best corresponding text-only baseline score. Results differ significantly from PBSMT baseline ( †) or NMT baseline ( ‡) with p = 0.05.
the best-performing model in the WMT'16 multimodal MT shared task (Shah et al., 2016), henceforth PBSMT + . It uses image features as well as additional data from WordNet (Miller, 1995) to rerank n-best lists.

Results
The Multi30K dataset contains images and bilingual descriptions. Overall, it is a small dataset with a small vocabulary whose sentences have simple syntactic structures and not much ambiguity . This is reflected in the fact that even the simplest baselines perform fairly well on it, i.e. the smallest BLEU scores of 32.9 for translating into German, which are still reasonably good results.  Table 2: BLEU4, METEOR, TER and chrF3 scores on the M30k T test set for models trained on original and additional back-translated data. Best text-only baselines are underscored and best overall results in bold. We highlight in parentheses the improvements brought by our models compared to the best baseline score. Results differ significantly from PBSMT baseline ( †) or NMT baseline ( ‡) with p = 0.05. We also show the improvements each model yields in each metric when only trained on the original M30k T training set vs. also including additional back-translated data. PBSMT + is the best model in the multimodal MT shared task .

Multi30k
In Table 1, we show results for translating from English→German and German→English. When translating into German, our multi-modal models perform well, with models IMG E and IMG D improving on both baselines according to all metrics analysed. We also note that all models but IMG 2W+D perform consistently better than the strong multi-modal NMT baseline of Huang et al. (2016), even when this model has access to more data (+RCNN features). 6 Combining image features in the encoder and the decoder at the same time does not seem to improve results compared to using the image features in only the encoder or the decoder when translating into German. To the best of our knowledge, it is the first time a purely neural model significantly improves over a PBSMT model in all metrics on this data set. When translating into English, all multi-modal models significantly improve over the NMT baseline, with the only exception being model IMG 2W 's BLEU scores. In this scenario, model IMG E+D is the best performing one according to all but one metric. However, differences between multi-modal models are not statistically significant, i.e. all multi-modal models but IMG 2W and IMG 2W+D perform comparably.
Additional back-translated data Arguably, the main downside of applying multi-modal NMT in a real-world scenario is the small amount of publicly available training data (∼30K entries). For that reason, we back-translated the German and English sentences in the M30k C and created two sets of 145K synthetic triples, one for each translation direction, as described in §3.
In Table 2, we present results for some of the models evaluated in Table 1 but when also trained on the additional data. In order to add more data to our PBSMT baseline, we simply added the German sentences in the M30k C to train the LM. 7 We also include results for PBSMT + , which uses image features as well as additional features extracted using WordNet (Shah et al., 2016). When translating into German, both our models IMG E and IMG D that use global image features to initialise the encoder and the decoder, respectively, significantly improve according to BLEU, METEOR and TER with the additional back-translated data, and also achieved better chrF3 scores. When translating into English, IMG E is the only model to significantly improve over both baselines according to all metrics with the additional back-translated data, also improving chrF3 scores. We highlight that our best-performing model IMG E significantly outperforms PBSMT + according to BLEU and TER, and all our other multi-modal models perform comparably to it. This is a noteworthy finding, since same data. 7 Adding the synthetic sentence pairs to train the baseline PBSMT model, as we did with all neural MT models, deteriorated the results.  PBSMT + uses image features as well as additional data from WordNet and, to the best of our knowledge, is the best published model in this language pair and data set to date.
Ensemble decoding We now report on how can ensemble decoding be used to improve translations obtained with our multi-modal NMT models. In order to do that, we use different combinations of models trained on the original M30k T training set to translate from English into German. We built ensembles of different models by starting with our best performing multi-modal model on this language pair and data set, IMG D , and by adding new models to the ensemble one by one, until we reach a maximum of four independent models, all of which are trained separately and on the original M30k T training data only. In Table 3, we show results when translating the M30k T 's test set. These models were also evaluated in our recent participation in the WMT 2017 multi-modal MT shared task (Calixto et al., 2017a). We first note that to add more models to the ensemble seems to always improve translations, and by a considerable margin (∼ 3 BLEU/METEOR points). Adding model IMG 2W to the ensemble already consisting of models IMG E and IMG D improves translations according to all metrics evaluated. This is an interesting result, since compared to these other two multi-modal models, model IMG 2W performs poorly according to BLEU, ME-TEOR and chrF3. Regardless of that fact, our best results are obtained with an ensemble of four different multi-modal models, including model IMG 2W .
By using an ensemble of four different multimodal NMT models trained on the translated Multi30k training data, we were able to obtain translations comparable to or even better than those obtained with the strong multi-modal NMT model of Calixto et al. (2017b), which is pretrained on large amounts of English-German data and uses local image features. Finally, we have recently participated in the WMT 2017 multimodal MT shared task, and our system submissions ranked among the best performing systems under the constrained data regime (Calixto et al., 2017a). We note that our models performed particularly well on the ambiguous MSCOCO test set (Elliott et al., 2017), which indicate its ability to use the image information in disambiguating difficult source sentences into their correct translations.

Error Analysis
In Table 4 we show translations into German generated by different models for one entry of the M30k test set. In this example, the last three multimodal models extrapolate the reference+image and describe "ceremony" as a "wedding ceremony" (IMG 2W ) and as an "Olympics ceremony" (IMG E and IMG D ). This could be due to the fact that the training set is small, depicts a small variation of different scenes and contains different forms of biasses (van Miltenburg, 2015).
In Table 5, we draw attention to an example where some models generate what seems to be novel visual terms. Neither the source German sentence nor the English reference translation contained the translated units "having fun" or "Mexican restaurant", although both could have been inferred at least partially from the image. In this example, the visual term "having fun" is also generated by the baseline NMT model, making it clear that at times what seems like a translation extracted exclusively from the image may have been learnt from the training text data. However, none of the two baselines translated "Mexikanischen Setting" as "Mexican restaurant", but four out of the five multi-modal models did. The multi-modal models also had problems translating the German "trinkt Shots" (drinking shots). We observe translations such as "having drinks" (IMG 2W ), which src.
a woman with long hair is at a graduation ceremony . ref.
eine   although not a novel translation is still correct, but also "drinking apples" (IMG E ), "drinking food" (IMG D ), and "drinking dishes" (IMG E+D ), which are clearly incorrect.

Conclusions and future work
In this work, we introduced models that incorporate images into state-of-the-art attention-based NMT, by using images as words in the source sentence, to initialise the encoder's hidden state and as additional data in the initialisation of the decoder's hidden state. The intuition behind our effort is to use images to visually ground translations, and consequently increase translation quality. We demonstrate with extensive experiments that adding global image features into NMT significantly improves the translations of image descriptions compared to text-only NMT and PB-SMT. It also improves significantly on the previous state-of-the-art model of Huang et al. (2016) (English→German), and performs comparably to the best published results of Shah et al. (2016) (German→English). Overall, we note that using images as words in the source sequence (IMG 1W , IMG 2W ), an idea similarly entertained by Huang et al. (2016), does not fare as well as to directly incorporate the image either in the encoder or the decoder (IMG E and IMG D ), independently of the target language. The fact that multi-modal NMT models can benefit from back-translated data is also an interesting finding.
In future work, we will conduct a more systematic study on the impact that synthetic backtranslated data brings to multi-modal NMT, and run an error analysis to identify what particular types of errors our models make (and prevent).