Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.


Introduction
Neural Machine Translation (NMT) has been successfully tackled as a sequence to sequence learning problem (Kalchbrenner and Blunsom, 2013;Cho et al., 2014b;Sutskever et al., 2014) where each training example consists of one source and one target variable-length sequences, with no prior information on the alignment between the two.
In the context of NMT, Bahdanau et al. (2015) first proposed to use an attention mechanism in the decoder, which is trained to attend to the relevant source-language words as it generates each word of the target sentence. Similarly, Xu et al. (2015) proposed an attention-based model for the task of image description generation (IDG) where a model learns to attend to specific parts of an image representation (the source) as it generates its description (the target) in natural language.
We are inspired by recent successes in applying attention-based models to NMT and IDG. In this work, we propose an end-to-end attention-based multi-modal neural machine translation (MNMT) model which effectively incorporates two independent attention mechanisms, one over sourcelanguage words and the other over different areas of an image.
Our main contributions are: • We propose a novel attention-based MNMT model which incorporates spatial visual features in a separate visual attention mechanism; • We use a medium-sized, back-translated multi-modal in-domain data set and large general-domain text-only MT corpora to pretrain our models and show that our MNMT model can efficiently exploit them; • We show that images bring useful information into an NMT model, in situations in which sentences describe objects illustrated in the image.
To the best of our knowledge, previous MNMT models in the literature that utilised spatial visual features did not significantly improve over a comparable model that used global visual features or even only textual features (Caglayan et al., 2016;Calixto et al., 2016;Huang et al., 2016;Libovický et al., 2016;. In this work, we wish to address this issue and propose an MNMT model that uses, in addition to an attention mechanism over the source-language words, an additional visual attention mechanism to incorporate spatial visual features, and still improves on simpler text-only and multi-modal attention-based NMT models. The remainder of this paper is structured as follows. We first briefly revisit the attentionbased NMT framework ( §2) and expand it into an MNMT framework ( §3). In §4, we introduce the datasets we use to train and evaluate our models, in §5 we discuss our experimental setup and analyse and discuss our results. Finally, in §6 we discuss relevant previous related work and in §7 we draw conclusions and provide some avenues for future work.

Attention-based NMT
We describe the attention-based NMT model introduced by Bahdanau et al. (2015) in this section. Given a source sequence X = (x 1 , x 2 , · · · , x N ) and its translation Y = (y 1 , y 2 , · · · , y M ), an NMT model aims to build a single neural network that translates X into Y by directly learning to model p(Y | X). The entire network consists of one encoder and one decoder with one attention mechanism, typically implemented as two Recurrent Neural Networks (RNN) and one multilayer perceptron, respectively. Each x i is a row index in a source lookup or word embedding matrix E x ∈ R |Vx|×dx , as well as each y j being an index in a target lookup or word embedding matrix E y ∈ R |Vy|×dy , V x and V y are source and target vocabularies, and d x and d y are source and target word embeddings dimensionalities, respectively.
The encoder is a bi-directional RNN with GRU (Cho et al., 2014a), where a forward RNN − → Φ enc reads X word by word, from left to right, and generates a sequence of forward annotation vectors ( Similarly, a backward RNN ← − Φ enc reads X from right to left, word by word, and generates a sequence of backward annotation vectors ( The final annotation vector is the concatenation of forward and backward vectors These annotation vectors are in turn used by the decoder, which is essentially a neural language model (LM) (Bengio et al., 2003) conditioned on the previously emitted words and the source sentence via an attention mechanism. A multilayer perceptron is used to initialise the decoder's hidden state s 0 at time step t = 0, where the input to this network is the concatenation of the last forward and backward vectors − → h N ; ← − h 1 . At each time step t of the decoder, a timedependent source context vector c t is computed based on the annotation vectors C and the decoder previous hidden state s t−1 . This is part of the formulation of the conditional GRU and is described further in §2.2. In other words, the encoder is a bi-directional RNN with GRU and the decoder is an RNN with a conditional GRU.
Given a hidden state s t , the probabilities for the next target word are computed using one projection layer followed by a softmax layer as described in eq. (1), where the matrices L o , L s , L w and L c are transformation matrices and c t is a timedependent source context vector generated by the conditional GRU.

Conditional GRU
The conditional GRU 1 has three main components computed at each time step t of the decoder: • REC 1 computes a hidden state proposal s t based on the previous hidden state s t−1 and the previously emitted wordŷ t−1 ; • ATT src 2 is an attention mechanism over the hidden states of the source-language RNN and computes c t using all source annotation vectors C and the hidden state proposal s t ; • REC 2 computes the final hidden state s t using the hidden state proposal s t and the timedependent source context vector c t .
We use the conditional GRU in our text-only attention-based NMT model. First, a single-layer feed-forward network is used to compute an expected alignment e src t,i between each source annotation vector h i and the target wordŷ t to be emitted at the current time step t, as shown in Equations (2) and (3): where α src t,i is the normalised alignment matrix between each source annotation vector h i and the wordŷ t to be emitted at time step t, and v src a , U src a and W src a are model parameters. Finally, a time-dependent source context vector c t is computed as a weighted sum over the source annotation vectors, where each vector is weighted p(y t = k | y <t , c t ) = Softmax(L o tanh(L s s t + L w E y [ŷ t−1 ] + L c c t )).
(1) Figure 1: A doubly-attentive decoder learns to attend to image patches and source-language words independently when generating translations. by the attention weight α src t,i , as in eq. (4): 3 Multi-modal NMT Our MNMT model can be seen as an expansion of the attention-based NMT framework described in §2.1 with the addition of a visual component to incorporate spatial visual features, and is comparable to the model evaluated by Calixto et al. (2016). We use publicly available pre-trained CNNs for image feature extraction. Specifically, we extract spatial image features for all images in our dataset using the 50-layer Residual network (ResNet-50) of . These spatial features are the activations of the res4f layer, which can be seen as encoding an image in a 14×14 grid, where each of the entries in the grid is represented by a 1024D feature vector that only encodes information about that specific region of the image. We vectorise this 3-tensor into a 196×1024 matrix A = (a 1 , a 2 , · · · , a L ), a l ∈ R 1024 where each of the L = 196 rows consists of a 1024D feature vector and each column, i.e. feature vector, represents one grid in the image.

NMT SRC+IMG : decoder with two independent attention mechanisms
Model NMT SRC+IMG integrates two separate attention mechanisms over the source-language words and visual features in a single decoder RNN. Our doubly-attentive decoder RNN is conditioned on the previous hidden state of the decoder and the previously emitted word, as well as the source sentence and the image via two independent attention mechanisms, as illustrated in Figure 1. We implement this idea expanding the conditional GRU described in §2.2 onto a doublyconditional GRU. To that end, in addition to the source-language attention, we introduce a new attention mechanism ATT img to the original conditional GRU proposal. This visual attention computes a time-dependent image context vector i t given a hidden state proposal s t and the image annotation vectors A = (a 1 , a 2 , · · · , a L ) using the "soft" attention (Xu et al., 2015). This attention mechanism is very similar to the source-language attention with the addition of a gating scalar, explained further below. First, a single-layer feed-forward network is used to compute an expected alignment e img t,l between each image annotation vector a l and the target word to be emitted at the current time step t, as in eqs. (6) and (7): where α img t,l is the normalised alignment matrix between all the image patches a l and the target word to be emitted at time step t, and v img a , U img a and W img a are model parameters. Note that Equations (2) and (3), that compute the expected source alignment e src t,i and the weight matrices α src t,i , and eqs. (6) and (7) that compute the expected image alignment e img t,l and the weight matrices α img t,l , both compute similar statistics over the source and image annotations, respectively.
In eq. (8) we compute β t ∈ [0, 1], a gating scalar used to weight the expected importance of the image context vector in relation to the next target word at time step t: where W β , b β are model parameters. It is in turn used to compute the time-dependent image context vector i t for the current decoder time step t, as in eq. (9): The only difference between Equations (4) (source context vector) and (9) (image context vector) is that the latter uses a gating scalar, whereas the former does not. We use β following Xu et al. (2015) who empirically found it to improve the variability of the image descriptions generated with their model. Finally, we use the time-dependent image context vector i t as an additional input to a modified version of REC 2 ( §2.2), which now computes the final hidden state s t using the hidden state proposal s t , and the time-dependent source and image context vectors c t and i t , as in (10): In Equation (5), the probabilities for the next target word are computed using the new multimodal hidden state s t , the previously emitted word y t−1 , and the two context vectors c t and i t , where L o , L s , L w , L cs and L ci are projection matrices and trained with the model.

Data
The Flickr30k data set contains 30k images and 5 descriptions in English for each image (Young et al., 2014). In this work, we use the Multi30k dataset , which consists of two multilingual expansions of the original Flickr30k: one with translated data and another one with comparable data, henceforth referred to as M30k T and M30k C , respectively.
For each of the 30k images in the Flickr30k, the M30k T has one of the English descriptions manually translated into German by a professional translator. Training, validation and test sets contain 29k, 1,014 and 1k images respectively, each accompanied by a sentence pair (the original English sentence and its translation into German). For each of the 30k images in the Flickr30k, the M30k C has five descriptions in German collected independently from the English descriptions. Training, validation and test sets contain 29k, 1,014 and 1k images respectively, each accompanied by five sentences in English and five sentences in German.
We use the entire M30k T training set for training our MNMT models, its validation set for model selection with BLEU (Papineni et al., 2002), and its test set for evaluation. In addition, since the amount of training data available is small, we build a back-translation model using the text-only NMT model described in §2.1 trained on the Multi30k T data set (German→English), without images. We use this model to back-translate the 145k German descriptions in the Multi30k C into English and include the triples (synthetic English description, German description, image) as additional training data (Sennrich et al., 2016a).
We also use the WMT 2015 text-only parallel corpora available for the English-German language pair, consisting of about 4.3M sentence pairs (Bojar et al., 2015). These include the Europarl v7 (Koehn, 2005), News Commentary and Common Crawl corpora, which are concatenated and used for pre-training.
We use the scripts in the Moses SMT Toolkit (Koehn et al., 2007) to normalise and tokenize English and German descriptions, and we also convert space-separated tokens into subwords (Sennrich et al., 2016b). All models use a common vocabulary of 83, 093 English and 91, 141 German subword tokens. If sentences in English or German are longer than 80 tokens, they are discarded. We train models to translate from English into German and report evaluation of cased, tokenized sentences with punctuation.

Experimental setup
Our encoder is a bidirectional RNN with GRU, one 1024D single-layer forward and one 1024D single-layer backward RNN. Source and target word embeddings are 620D each and trained jointly with the model. Word embeddings and other non-recurrent matrices are initialised by sampling from a Gaussian N (0, 0.01 2 ), recurrent matrices are random orthogonal and bias vectors are all initialised to zero. Visual features are obtained by feeding images to the pre-trained ResNet-50 and using the activations of the res4f layer . We apply dropout with a probability of 0.5 in the encoder bidirectional RNN, the image features, the decoder RNN and before emitting a target word. We follow Gal and Ghahramani (2016) and apply dropout to the encoder bidirectional and the decoder RNN using one same mask in all time steps.
All models are trained using stochastic gradient descent with ADADELTA (Zeiler, 2012) with minibatches of size 80 (text-only NMT) or 40 (MNMT), where each training instance consists of one English sentence, one German sentence and one image (MNMT). We apply early stopping for model selection based on BLEU4, so that if a model does not improve on BLEU4 in the validation set for more than 20 epochs, training is halted.

Baselines
We train a text-only phrase-based SMT (PBSMT) system and a text-only NMT model for comparison. Our PBSMT baseline is built with Moses and uses a 5-gram LM with modified Kneser-Ney smoothing (Kneser and Ney, 1995). It is trained on the English-German descriptions of the M30k T , whereas its LM is trained on the German descriptions only. We use minimum error rate training to tune the model (Och, 2003) with BLEU. The textonly NMT baseline is the one described in §2.1 and is trained on the M30k T 's English-German descriptions.
We also compare our model against two multimodal attention-based NMT models. The first model is Huang et al. (2016)'s best model trained on the same data, and the second is their best model using additional object detections, respectively models m1 (image at head) and m3 in the authors' paper.

Results
In Table 1, we show results for our text-only baselines NMT and PBSMT, the multi-modal models of Huang et al. (2016) and our MNMT models trained on the M30k T , and pre-trained on the indomain back-translated M30k C and the generaldomain text-only English-German MT corpora from WMT 2015.
Training on M30k T One main finding is that our model consistently outperforms the comparable model of Huang et al. (2016), with improvements of +1.4 BLEU and +2.7 METEOR. In fact, even when their model has access to more data our model still improves by +0.9 METEOR, while maintaining the same BLEU4 scores.
Moreover, we can also conclude from Table 1 that PBSMT performs better at recall-oriented metrics, i.e. METEOR and chrF3, whereas NMT is better at precision-oriented ones, i.e. BLEU4. This is somehow expected, since the attention mechanism in NMT (Bahdanau et al., 2015) does not explicitly take attention weights from previous time steps into account, an thus lacks the notion of source coverage as in SMT (Koehn et al., 2003;Tu et al., 2016). We note that these ideas are complementary and incorporating coverage into model NMT SRC+IMG could lead to more improvements, especially in recall-oriented metrics. Nonetheless, our doubly-attentive model shows consistent gains in both precision-and recall-oriented metrics in comparison to the text-only NMT baseline, i.e. it is significantly better according to BLEU4, ME-TEOR and TER (p < 0.01), and it also improves chrF3 by +2.1. In comparison to the PBSMT baseline, our proposed model still significantly improves according to both BLEU4 and TER (p < 0.01), also increasing METEOR by +0.7 but with an associated p-value of p = 0.071, therefore not significant for p < 0.05. Although chrF3 is the  Table 1: BLEU4, METEOR, chrF3, character-level precision and recall (higher is better) and TER scores (lower is better) on the translated Multi30k (M30k T ) test set. Best text-only baselines results are underlined and best overall results appear in bold. We show Huang et al. (2016)'s improvements over the best text-only baseline in parentheses. Results are significantly better than the NMT baseline ( † ) and the SMT baseline ( ‡ ) with p < 0.01 (no pre-training) or p < 0.05 (when pre-training either on the back-translated M30k C or WMT'15 corpora). Best viewed in colour.
only metric in which the PBSMT model scores best, the difference between our model and the latter is only 0.1, meaning that they are practically equivalent. We note that model NMT SRC+IMG consistently increases character recall in comparison to the text-only NMT baseline. Although it can happen at the expense of character precision, gains in recall are always much higher than any eventual loss in precision, leading to consistent improvements in chrF3.
Pre-training We now discuss results for models pre-trained using different data sets. We first pre-trained the two text-only baselines PBSMT and NMT, and our MNMT model on the backtranslated M30k C , a medium-sized in-domain image description data set (145k training instances). We also pre-trained the same models on the English-German parallel sentences of much larger MT data sets, i.e. the concatenation of the Europarl (Koehn, 2005), Common Crawl and News Commentary corpora, used in WMT 2015 (∼4.3M parallel sentences). Model PBSMT (concat.) used the concatenation of the pre-training and training data for training, and model PBSMT (LM) used the general-domain German sentences as additional data to train the LM. From Table 1, it is clear that model NMT SRC+IMG can learn from both in-domain, multi-modal pre-training data sets as well as text-only, general domain ones.
Pre-training on M30k C When pre-training on the back-translated M30k C , the recall-oriented chrF3 shows a difference of 1.4 points between PBSMT and our model, mostly due to character recall; nonetheless, our model still improved by the same margin on the text-only NMT baseline. Our model still outperforms the PBSMT baseline according to BLEU4 and TER, and the text-only NMT baseline according to all metrics (p < .05).
(b) Source-target word alignments. Figure 2: Visualisation of image-and source-target word alignments for the M30k T test set.
Pre-training on WMT 2015 corpora We also pre-trained our models on the WMT 2015 corpora, which took 10 days, i.e. ∼6-7 epochs. Results show that model NMT SRC+IMG improves significantly over the NMT baseline according to BLEU4, and is consistently better than the PB-SMT baseline according to all four metrics. 4 This is a strong indication that model NMT SRC+IMG can exploit the additional pre-training data efficiently, both general-and in-domain. While the PBSMT model is still competitive when using additional in-domain data-according to METEOR and chrF3-the same cannot be said when using general-domain pre-training corpora. From our experiments, NMT models in general, and especially model NMT SRC+IMG , thrive when training and test domains are mixed, which is a very common real-world scenario.
Textual and visual attention In Figure 2, we visualise the visual and textual attention weights for an entry of the M30k T test set. In the visual attention, the β gate (written in parentheses after each word) caused the image features to be used mostly to generate the words Mann (man) and Hut (hat), two highly visual terms in the sentence. We observe that in general visually grounded terms, e.g. Mann and Hut, usually have a high associated β value, whereas other less visual terms like mit (with) or auf (at) do not. That causes the model to use the image features when it is describing a vi-sual concept in the sentence, which is an interesting feature of our model. Interestingly, our model is very selective when choosing to use image features: it only assigned β > 0.5 for 20% of the outputted target words, and β > 0.8 to only 8%. A manual inspection of translations shows that these words are mostly concrete nouns with a strong visual appeal. Lastly, using two independent attention mechanisms is a good compromise between model compactness and flexibility. While the attentionbased NMT model baseline has ∼200M parameters, model NMT SRC+IMG has ∼213M, thus using just ∼6.6% more parameters than the latter.

Related work
Multi-modal MT was just recently addressed by the MT community by means of a shared task . However, there has been a considerable amount of work on natural language generation from non-textual inputs. Mao et al. (2014) introduced a multi-modal RNN that integrates text and visual features and applied it to the tasks of image description generation and image-sentence ranking. In their work, the authors incorporate global image features in a separate multi-modal layer that merges the RNN textual representations and the global image features. Vinyals et al. (2015) proposed an influential neural IDG model based on the sequenceto-sequence framework, which is trained end-toend. Elliott et al. (2015) put forward a model to generate multilingual descriptions of images by learning and transferring features between two independent, non-attentive neural image description models. 5 Venugopalan et al. (2015) introduced a model trained end-to-end to generate textual descriptions of open-domain videos from the video frames based on the sequence-to-sequence framework. Finally, Xu et al. (2015) introduced the first attention-based IDG model where an attentive decoder learns to attend to different parts of an image as it generates its description in natural language.
In the context of NMT, Dong et al. (2015) proposed a multi-task learning approach where a model is trained to translate from one source language into multiple target languages. They used attention-based decoders where each language has one decoder RNN with a separate attention mechanism. Each translation task has a shared sourcelanguage encoder in common with all the other translation tasks. Firat et al. (2016) proposed a multi-way model trained to translate between many different source and target languages. Instead of one attention mechanism per language pair as in Dong et al. (2015), which would lead to a quadratic number of attention mechanisms in relation to language pairs, they use a shared attention mechanism where each target language has one attention shared by all source languages. Luong et al. (2016) proposed a multi-task approach where they train a model using two tasks and a shared decoder: the main task is to translate from German into English and the secondary task is to generate English image descriptions. They show improvements in the main translation task when also training for the secondary image description task. Although not an NMT model, Hitschler et al. (2016) recently used image features to re-rank translations of image descriptions generated by an SMT model and reported significant improvements.
Although no purely neural multi-modal model to date significantly improves on both text-only NMT and SMT models , different research groups have proposed to include global and spatial visual features in re-ranking n-best lists generated by an SMT system or directly in an NMT framework with some success (Caglayan et al., 2016;Calixto et al., 2016;Huang et al., 2016;Libovický et al., 2016;Shah et al., 2016). To the best of our knowledge, the best published results of a purely MNMT model are those of Huang et al. (2016), who proposed to use global visual features extracted with the VGG19 network (Simonyan and Zisserman, 2015) for an entire image, and also for regions of the image obtained using the RCNN of Girshick et al. (2014). Their best model improves over a strong text-only NMT baseline and is comparable to results obtained with an SMT model trained on the same data. For that reason, their models are used as baselines in our experiments.
Our work differs from previous work in that, first, we propose attention-based MNMT models. This is an important difference since the use of attention in NMT has become standard and is the current state-of-the-art (Jean et al., 2015;Luong et al., 2015;Firat et al., 2016;Sennrich et al., 2016b). Second, we propose a doublyattentive model where we effectively fuse two mono-modal attention mechanisms into one multimodal decoder, training the entire model jointly and end-to-end. In addition, we are interested in how to merge textual and visual representations into multi-modal representations when generating words in the target language, which differs substantially from text-only translation tasks even when these translate from many source languages into many target languages (Dong et al., 2015;Firat et al., 2016). To the best of our knowledge, we are the first to integrate multi-modal inputs in NMT via independent attention mechanisms.

Conclusions and Future Work
We have introduced a novel attention-based, multi-modal NMT model to incorporate spatial visual information into NMT. We have reported new state-of-the-art results on the M30k T test set, improving on previous multi-modal attentionbased models. We also pre-trained our model on one in-domain multi-modal data set and many general-domain text-only MT corpora, finding that it learns efficiently and is able to exploit the additional data regardless of the domain. Our model also compares favourably to both NMT and PB-SMT baselines evaluated on the same training data.
In the future, we will incorporate coverage into our model and study how to apply it to other Natural Language Processing tasks.