Attention Strategies for Multi-Source Sequence-to-Sequence Learning

Modeling attention in neural multi-source sequence-to-sequence learning remains a relatively unexplored area, despite its usefulness in tasks that incorporate multiple source languages or modalities. We propose two novel approaches to combine the outputs of attention mechanisms over each source sequence, flat and hierarchical. We compare the proposed methods with existing techniques and present results of systematic evaluation of those methods on the WMT16 Multimodal Translation and Automatic Post-editing tasks. We show that the proposed methods achieve competitive results on both tasks.


Introduction
Sequence-to-sequence (S2S) learning with attention mechanism recently became the most successful paradigm with state-of-the-art results in machine translation (MT) Sennrich et al., 2016a), image captioning (Xu et al., 2015;Lu et al., 2016), text summarization (Rush et al., 2015) and other NLP tasks.
All of the above applications of S2S learning make use of a single encoder. Depending on the modality, it can be either a recurrent neural network (RNN) for textual input data, or a convolutional network for images.
In this work, we focus on a special case of S2S learning with multiple input sequences of possibly different modalities and a single output-generating recurrent decoder. We explore various strategies the decoder can employ to attend to the hidden states of the individual encoders.
The existing approaches to this problem do not explicitly model different importance of the inputs to the decoder Zoph and Knight, 2016). In multimodal MT (MMT), where an image and its caption are on the input, we might expect the caption to be the primary source of information, whereas the image itself would only play a role in output disambiguation. In automatic post-editing (APE), where a sentence in a source language and its automatically generated translation are on the input, we might want to attend to the source text only in case the model decides that there is an error in the translation.
We propose two interpretable attention strategies that take into account the roles of the individual source sequences explicitly-flat and hierarchical attention combination. This paper is organized as follows: In Section 2, we review the attention mechanism in single-source S2S learning. Section 3 introduces new attention combination strategies. In Section 4, we evaluate the proposed models on the MMT and APE tasks. We summarize the related work in Section 5, and conclude in Section 6.

Attentive S2S Learning
The attention mechanism in S2S learning allows an RNN decoder to directly access information about the input each time before it emits a symbol. Inspired by content-based addressing in Neural Turing Machines (Graves et al., 2014), the attention mechanism estimates a probability distribution over the encoder hidden states in each decoding step. This distribution is used for computing the context vector-the weighted average of the encoder hidden states-as an additional input to the decoder. The standard attention model as described by  defines the attention energies e ij , attention distribution α ij , and the con-text vector c i in i-th decoder step as: The trainable parameters W a and U a are projection matrices that transform the decoder and encoder states s i and h j into a common vector space and v a is a weight vector over the dimensions of this space. T x denotes the length of the input sequence. For the sake of clarity, bias terms (applied every time a vector is linearly projected using a weight matrix) are omitted.
Recently, Lu et al. (2016) introduced sentinel gate, an extension of the attentive RNN decoder with LSTM units (Hochreiter and Schmidhuber, 1997). We adapt the extension for gated recurrent units (GRU) , which we use in our experiments: where W y and W s are trainable parameters, y i is the embedded decoder input, and s i−1 is the previous decoder state. Analogically to Equation 1, we compute a scalar energy term for the sentinel: where W a , U (ψ) a are the projection matrices, v a is the weight vector, and ψ i s i is the sentinel vector. Note that the sentinel energy term does not depend on any hidden state of any encoder. The sentinel vector is projected to the same vector space as the encoder state h j in Equation 1. The term e ψ i is added as an extra attention energy term to Equation 2 and the sentinel vector ψ i s i is used as the corresponding vector in the summation in Equation 3.
This technique should allow the decoder to choose whether to attend to the encoder or to focus on its own state and act more like a language model. This can be beneficial if the encoder does not contain much relevant information for the current decoding step.

Attention Combination
In S2S models with multiple encoders, the decoder needs to be able to combine the attention information collected from the encoders.
A widely adopted technique for combining multiple attention models in a decoder is concatenation of the context vectors c (Zoph and Knight, 2016;. As mentioned in Section 1, this setting forces the model to attend to each encoder independently and lets the attention combination to be resolved implicitly in the subsequent network layers. In this section, we propose two alternative strategies of combining attentions from multiple encoders. We either let the decoder learn the α i distribution jointly over all encoder hidden states (flat attention combination) or factorize the distribution over individual encoders (hierarchical combination).
Both of the alternatives allow us to explicitly compute distribution over the encoders and thus interpret how much attention is paid to each encoder at every decoding step.

Flat Attention Combination
Flat attention combination projects the hidden states of all encoders into a shared space and then computes an arbitrary distribution over the projections. The difference between the concatenation of the context vectors and the flat attention combination is that the α i coefficients are computed jointly for all encoders: is the length of the input sequence of the n-th encoder and e (k) ij is the attention energy of the j-th state of the k-th encoder in the i-th decoding step. These attention energies are computed as in Equation 1. The parameters v a and W a are shared among the encoders, and U a is different for each encoder and serves as an encoder-specific projection of hidden states into a common vector space.
The states of the individual encoders occupy different vector spaces and can have a different dimensionality, therefore the context vector cannot be computed as their weighted sum. We project them into a single space using linear projections: where U (k) c are additional trainable parameters. The matrices U (k) c project the hidden states into a common vector space. This raises a question whether this space can be the same as the one that is projected into in the energy computation using matrices U a . In our experiments, we explore both options. We also try both adding and not adding the sentinel α

Hierarchical Attention Combination
The hierarchical attention combination model computes every context vector independently, similarly to the concatenation approach. Instead of concatenation, a second attention mechanism is constructed over the context vectors.
We divide the computation of the attention distribution into two steps: First, we compute the context vector for each encoder independently using Equation 3. Second, we project the context vectors (and optionally the sentinel) into a common space (Equation 8), we compute another distribution over the projected context vectors (Equation 9) and their corresponding weighted average (Equation 10): where c (k) i is the context vector of the k-th encoder, additional trainable parameters v b and W b are shared for all encoders, and U

Experiments
We evaluate the attention combination strategies presented in Section 3 on the tasks of multimodal translation (Section 4.1) and automatic post-editing (Section 4.2). The models were implemented using the Neural Monkey sequence-to-sequence learning toolkit (Helcl and Libovický, 2017). 1 In both setups, we process the textual input with bidirectional GRU network  with 300 units in the hidden state in each direction and 300 units in embeddings. For the attention projection space, we use 500 hidden units. We optimize the network to minimize the output cross-entropy using the Adam algorithm (Kingma and Ba, 2014) with learning rate 10 −4 .

Multimodal Translation
The goal of multimodal translation  is to generate target-language image captions given both the image and its caption in the source language.
We train and evaluate the model on the Multi30k dataset . It consists of 29,000 training instances (images together with English captions and their German translations), 1,014 validation instances, and 1,000 test instances. The results are evaluated using the BLEU (Papineni et al., 2002) and ME-TEOR (Denkowski and Lavie, 2011).
In our model, the visual input is processed with a pre-trained VGG 16 network (Simonyan and Zisserman, 2014) without further fine-tuning. Attention distribution over the visual input is computed from the last convolutional layer of the network.
The decoder is an RNN with 500 conditional GRU units  in the recurrent layer. We use byte-pair encoding (Sennrich et al., 2016b) with a vocabulary of 20,000 subword units shared between the textual encoder and the decoder.
The results of our experiments in multimodal MT are shown in Table 1. We achieved the best results using the hierarchical attention combination without the sentinel mechanism, which also showed the fastest convergence. The flat combination strategy achieves similar results eventually. Sharing the projections for energy and context vector computation does not improve over the concatenation baseline and slows the training almost prohibitively. Multimodal models were not able to surpass the textual baseline (BLEU 33.0).
Using the conditional GRU units brought an improvement of about 1.5 BLEU points on average, with the exception of the concatenation scenario where the performance dropped by almost 5 BLEU points. We hypothesize this is caused by the fact the model has to learn the implicit attention combination on multiple places -once in the output projection and three times inside the conditional GRU unit (Firat and Cho, 2016, Equations 10-12). We thus report the scores of the introduced attention combination techniques trained with conditional GRU units and compare them with the concatenation baseline trained with plain GRU units.

Automatic MT Post-editing
Automatic post-editing is a task of improving an automatically generated translation given the source sentence where the translation system is treated as a black box.
We used the data from the WMT16 APE Task , which consists of 12,000 training, 2,000 validation, and 1,000 test sentence triplets from the IT domain. Each triplet contains an English source sentence, an automatically generated German translation of the source sentence, and a manually post-edited German sentence as a reference. In case of this dataset, the MT outputs are almost perfect in and only little effort was required to post-edit the sentences. The results are evaluated using the humantargeted error rate (HTER) (Snover et al., 2006) and BLEU score (Papineni et al., 2002).
Following Libovický et al. (2016), we encode the target sentence as a sequence of edit operations transforming the MT output into the reference. By this technique, we prevent the model from paraphrasing the input sentences. The decoder is a GRU network with 300 hidden units. Unlike in the MMT setup (Section 4.1), we do not use the conditional GRU because it is prone to overfitting on the small dataset we work with. The models were able to slightly, but significantly improve over the baseline -leaving the MT output as is (HTER 24.8). The differences between the attention combination strategies are not significant.

Related Work
Attempts to use S2S models for APE are relatively rare . Niehues et al. (2016) concatenate both inputs into one long sequence, which forces the encoder to be able to work with both source and target language. Their attention is then similar to our flat combination strategy; however, it can only be used for sequential data.
The best system from the WMT'16 competition (Junczys-Dowmunt and Grundkiewicz, 2016) trains two separate S2S models, one translating from MT output to post-edited targets and the second one from source sentences to post-edited targets. The decoders average their output distributions similarly to decoder ensembling. The biggest source of improvement in this state-of-theart posteditor came from additional training data generation, rather than from changes in the network architecture. Caglayan et al. (2016) used an architecture very similar to ours for multimodal translation. They made a strong assumption that the network can be trained in such a way that the hidden states of the encoder and the convolutional network occupy the same vector space and thus sum the context vectors from both modalities. In this way, their multimodal MT system (BLEU 27.82) remained far bellow the text-only setup (BLEU 32.50).
New state-of-the-art results on the Multi30k dataset were achieved very recently by Calixto et al. (2017). The best-performing architecture uses the last fully-connected layer of VGG-19 network (Simonyan and Zisserman, 2014) as decoder initialization and only attends to the text encoder hidden states. With a stronger monomodal baseline (BLEU 33.7), their multimodal model achieved a BLEU score of 37.1. Similarly to Niehues et al. (2016) in the APE task, even further improvement was achieved by synthetically extending the dataset.

Conclusions
We introduced two new strategies of combining attention in a multi-source sequence-to-sequence setup. Both methods are based on computing a joint distribution over hidden states of all encoders.
We conducted experiments with the proposed strategies on multimodal translation and automatic post-editing tasks, and we showed that the flat and hierarchical attention combination can be applied to these tasks with maintaining competitive score to previously used techniques.
Unlike the simple context vector concatenation, the introduced combination strategies can be used with the conditional GRU units in the decoder. On top of that, the hierarchical combination strategy exhibits faster learning than than the other strategies.