Input Combination Strategies for Multi-Source Transformer Decoder

In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierarchical. We evaluate our methods on tasks of multimodal translation and translation with multiple source languages. The experiments show that the models are able to use multiple sources and improve over single source baselines.


Introduction
The Transformer model  recently demonstrated superior performance in neural machine translation (NMT) and other sequence generation tasks such as text summarization or image captioning . However, all of these setups consider only a single input to the decoder part of the model.
In the Transformer architecture, the representation of the source sequence is supplied to the decoder through the encoder-decoder attention. This attention sub-layer is applied between the self-attention and feed-forward sub-layers in each Transformer layer. Such arrangement leaves many options for the incorporation of multiple encoders.
So far, attention in sequence-to-sequence learning with multiple source sequences was mostly studied in the context of recurrent neural networks (RNNs).  explicitly capture the distribution over multiple inputs by projecting the input representations to a shared vector space and either computing the attention over all hidden states at once, or hierarchically, using another level of attention applied on the con-text vectors. Zoph and Knight (2016) employ a gating mechanism for combining the context vectors. Voita et al. (2018) adapted the gating mechanism for use within the Transformer model for context-aware MT. The other aproaches are however not directly usable in the Transformer model.
We propose a number of strategies of combining the different sources in the Transformer model. Some of the strategies described in this work are an adaptation of the strategies previously used with recurrent neural networks , whereas the rest of them is a novel contribution devised for the Transformer architecture. We test these strategies on multimodal machine translation (MMT) and multi-source machine translation (MSMT) tasks. This paper is organized as follows. In Section 2, we briefly describe the decoder part of the Transformer model. We propose a number of input combination strategies for the multi-source Transformer model in Section 3. Section 4 describes the experiments we performed, and Section 5 shows the results of quantitative evaluation. An overview of the related work is given in Section 6. We discuss the results and conclude in Section 7.

Transformer Decoder
The Transformer architecture is based on the use of attention. Attention, as conceptualized by , can be viewed as a softlookup function operating on an associative memory. For each query vector in query set Q, the attention computes a set of weighted sums of values V associated with a set of keys K, based on their similarity to the query.
The variant of the attention function used in the Transformer architecture is called multi-head scaled dot-product attention. Scaled dot-product of queries and keys is used as the similarity measure. Given the dimension of the input vectors d, the attention is computed as follows: In the multi-head variant, the vectors that represent the queries, keys, and values are linearly transformed to a number of projections (usually with smaller dimension), called attention heads. The attention is computed in each head independently and the outputs are concatenated and projected back to the original dimension: where W O i ∈ R d h ×d are trainable parameter matrices used as projections of the attention head outputs of dimension d h to the model dimension d, and where W Q , W K , and W V ∈ R d×d h , are trainable projection matrices used to project the attention inputs to the attention heads. The model itself consists of a number of layers, each of which is divided in three sub-layers: self-attention, encoder-decoder (or cross) attention, and a feed-forward layer. Both of the attention types use identical sets for keys and values. The states of the previous layer are used as the query set. The self-attention sub-layer attends to the previous decoder layer (i.e. the sets of queries and keys are identical). Since the decoder works autoregressively from left to right, during training, the self-attention is masked to prevent attending to the future positions in the sequence. The encoder-decoder attention sub-layer attends to the final layer of the encoder. The feed-forward sub-layer consists of a single non-linear projection (usually to a space with larger dimension), followed by a linear projection back to the vector space with the original dimension. The input of each sub-layer is summed with the output, creating a residual connection chain throughout the whole layer stack.

Proposed Strategies
We propose four input combination strategies for multi-source variant of the Transformer network, as illustrated in Figure 1. Two of them, serial and parallel, model the encoder-decoder attentions independently and are a natural extension of the sub-layer scheme in the transformer decoder. The other two versions, flat and hierarchical, are inspired by approaches proposed for RNNs by  and model joint distributions over the inputs.
Serial. The serial strategy (Figure 1a) computes the encoder-decoder attention one by one for each input encoder. The query set of the first crossattention is the set of the context vectors computed by the preceding self-attention. The query set of each subsequent cross-attention is the output of the preceding sub-layer. All of these sub-layers are interconnected with residual connections.
Parallel. In the parallel combination strategy (Figure 1b), the model attends to each encoder independently and then sums up the context vectors. Each encoder is attended using the same set of queries, i.e. the output of the self-attention sublayer. Residual connection link is used between the queries and the summed context vectors from the parallel attention.
Flat. The encoder-decoder attention in the flat combination strategy (Figure 1c) uses all the states of all input encoders as a single set of keys and values. Thus, the attention models a joint distribution over a flattened set of all encoder states. Unlike the approach taken in the recurrent setup , where the flat combination strategy requires an explicit projection of the encoder states to a shared vector space, in the Transformer models, the vector spaces of all layers are tied with residual connections. Therefore, the intermediate projection of the states of each encoder is not necessary.
Hierarchical. In the hierarchical combination (Figure 1d), we first compute the attention independently over each input. The resulting contexts are then treated as states of another input and the attention is computed once again over these states.

Experiments
We conduct our experiments on two different tasks: multimodal translation and multi-source machine translation. We use Neural Monkey ) 1 for design, training, and evaluation of the experiments.
In all experiments, the encoder part of the network follows the Transformer architecture as described by .
We optimize the model parameters using Adam optimizer (Kingma and Ba, 2014) with initial learning rate 0.2, and Noam learning rate decay  with β 1 = 0.9, β 2 = 0.98, = 10 −9 , and 4,000 warm-up steps. The size of a mini-batch size of 32 for MMT, and 24 for multisource MT experiments.
During decoding, we use beam search of width 10 and length normalization of 1.0 (Wu et al., 2016).

Multimodal Translation
The goal of MMT  is translating image captions from one language into another given both the source and image as the input. We use Multi30k dataset  containing triplets of images, English captions and their English translations into German, French and Czech. The dataset contains 29k triplets for training, 1,014 for validation and a test set of 1,000. We experiment with all language pairs available in this dataset.
We extract image feature using the last convolutional layer of the ResNet network (He et al., 2016) trained for ImageNet classification. We apply a linear projection into 512 dimensions on the image representation, so it has the same dimension as the rest of the model. For each language pair, we create a shared wordpiece-based vocabulary of approximately 40k subwords. We share the embedding matrices across the languages and we use the transposed embedding matrix as the output projection matrix as proposed by Press and Wolf (2017).
We use 6 layers in the textual encoder and decoder, and set the model dimension to 512. We set the dimension of the hidden layers in the feedforward sub-layers to 4096. We use 16 heads in the attention layers.
During the evaluation, we follow the preprocessing used in WMT Multimodal Translation Shared Task .
Conclusions of previous work show (Elliott and Kádár, 2017) that the improved performance of the multimodal models compared to textual models can come from improving the input representation. In order to test whether it is also the case with our models or the models explicitly use the visual input, we perform an adversarial evaluation similar to Elliott (2018). We evaluate the model while providinng a random image and observe how it affects the score and observe whether their quality drops.

Multi-Source MT
In this set of experiment, we attempt to generate a sentence in a target language, given equivalent sentences in multiple source languages.
We use the Europarl corpus (Tiedemann, 2012) for training and testing the MSMT. We use Spanish, French, German, and English as source languages and Czech as a target language. We selected an intersection of the bilingual sub-corpora using English as a pivot language. Our dataset contains 511k 5-tuples of sentences for training, 1k for validation and another 1k for testing.
Due of the memory demands of having four encoders, we use a smaller model than in the previous experiment. The encoders only have 4 layers and the decoder has 6 layers with embeddings size 256, feed-forward layers dimension 2048, and 8 attention heads. We use a shared word-piece vocabulary of 48k subwords. As in the MMT experiments, the transposition of the embedding matrix is reused as the parameters of the output projection layer (Press and Wolf, 2017).
We use bilingual English-to-Czech translation as a single source baseline. The baseline uses vocabulary of 42k subwords from Czech and English only.
Similarly to the MMT, we also perform adversarial evaluation. To evaluate the importance of the source languages for the translation quality, when randomizing one of the source languages.

Results
We evaluate the results using BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2011) as implemented in MultEval. 2 The results of the MMT task are tabulated in Table 1. The results of the multi-source MT are shown in Table 2.
In MMT, the input combination significantly surpassed the text-only baseline in English-to-French translation. The performance in other target languages is only slightly better than the textual baseline.
The only worse score was achieved by the flat combination strategy. We hypothesize this might be because the optimization failed to find a common representation of the input modalities that could be used to compute the joint distribution.
The adversarial evaluation with randomly selected input images shows that all our models rely on both inputs while generating the target sentence and that providing incorrect visual input harms the model performance. The modality gating in the hierarchical attention combination seems to make the models more robust to noisy visual input.
In the multi-source translation task, all the proposed strategies perform better than single-source translation from English to Czech. Among the combination strategies, the best-scoring is the serial stacking of the attentions. In multimodal translation, the flat combination has shown to be the best-performing strategy.
Analysis of the attention distribution shows that the serial strategy use information from all source languages. The parallel strategy almost does not use the Spanish source and the flat strategy prefers the English source. The hierarchical strategy uses information from all source languages, however the attentions are sometimes more fuzzy than in the previous strategies. Figure 2 shows what source languages were attended on different layers of the encoder. Other examples of the attention visualization are shown in Appendix A.

Related Work
MMT was so far solved only within the RNNbased architectures. Elliott et al. (2015) report significant improvements with a non-attentive model. With attentive models (Bahdanau et al., 2014), the additional visual information usually did not improve the models significantly (Caglayan et al., 2016; in terms of BLEU score. Our models slightly outperform these models in the single model setup. Except for using the image features direct input to the model, they can be used as an auxiliary objective (Elliott and Kádár, 2017). In this setup, the visually grounded representation, improves the MMT significantly, achieving similar results that our models achieved using only the Multi30k dataset.
To our knowledge, multi-source MT has also been studied only using the RNN-based models. Dabre et al. (2017) use simple concatenation of source sentences in various languages and process them with a single multilingual encoder. Zoph and Knight (2016) try context concatenation and hierarchical gating method for combining context vectors in attention models with multiple inputs encoded by separate encoders. In all of their experiments, the multi-source methods significantly surpass the single-source baseline. Nishimura et al. (2018) extend the former approach for situations when of the source languages is missing, so that the translation system does not overly rely on a single source language like some of the models presented in this work.

Conclusions
We proposed several input combination strategies for multi-source sequence-to-sequence learning using the Transformer model . Two of the strategies are a straightforward extension of cross-attention in the Trans-former model: the cross-attentions are combined either serially interleaved by residual connections or in parallel. The two remaining strategies are an adaptation of the flat and the hierarchical attention combination strategies introduced by Libovický and  in context of recurrent sequence-to-sequence models.
The results on the MMT task show similar properties an in RNN-based models (Caglayan et al., 2017;. Adding visual features significantly improves translation into French and brings minor improvements on other language pairs. All the attention combinations perform similarly with the exception of the flat strategy which probably struggles with learning a shared representation of the input tokens and the image representation.
Evaluation on multi-source MT shows significant improvements over the single-source baseline. However, the adversarial evaluation suggests that the model relies heavily on the English input and only uses the additional source languages for minor modifications of the output. All attention combinations performed similarly.