An Evaluation of Language-Agnostic Inner-Attention-Based Representations in Machine Translation

In this paper, we explore a multilingual translation model with a cross-lingually shared layer that can be used as fixed-size sentence representation in different downstream tasks. We systematically study the impact of the size of the shared layer and the effect of including additional languages in the model. In contrast to related previous work, we demonstrate that the performance in translation does correlate with trainable downstream tasks. In particular, we show that larger intermediate layers not only improve translation quality, especially for long sentences, but also push the accuracy of trainable classification tasks. On the other hand, shorter representations lead to increased compression that is beneficial in non-trainable similarity tasks. We hypothesize that the training procedure on the downstream task enables the model to identify the encoded information that is useful for the specific task whereas non-trainable benchmarks can be confused by other types of information also encoded in the representation of a sentence.


Introduction
Neural Machine Translation (NMT) has rapidly become the new Machine Translation (MT) paradigm, significantly improving over the traditional statistical machine translation procedure . Recently, several models and variants have been proposed with increased research efforts towards multilingual machine translation (Firat et al., 2016;Lakew et al., 2018;Wang et al., 2018;Blackwood et al., 2018;Lu et al., 2018). The main motivation of multilingual models is the effect of transfer learning that enables machine translation systems to benefit from relationships between languages and training signals that come from different datasets (Ha et al., 2016;Johnson et al., 2017;Gu et al., 2018). Another aspect that draws interest in translation models is the effective computation of sentence representations using the translation task as an auxiliary semantic signal (Hill et al., 2016;McCann et al., 2017;Schwenk and Douze, 2017;Subramanian et al., 2018). An important feature that enables an immediate use of the MT-based representations in other downstream tasks is the creation of fixed-sized sentence embeddings (Cífka and Bojar, 2018).
However, the effects of the size of sentence embeddings and the relation between translation performance and meaning representation quality are not entirely clear. Recent studies based on NMT either focus entirely on the use of MT-based sentence embeddings in other tasks (Schwenk, 2018), on translation quality (Lu et al., 2018), on speed comparison (Britz et al., 2017), or only exploring a bilingual scenario (Cífka and Bojar, 2018).
In this paper, we are interested in exploring a cross-lingual intermediate shared layer (called attention bridge) in an attentive encoder-decoder MT model. This shared layer serves as a fixedsize sentence representation that can be straightforwardly applied to downstream tasks. We examine this model with a systematic evaluation on different sizes of the attention bridge and extensive experiments to study the abstractions it learns from multiple translation tasks. In contrast to previous work (Cífka and Bojar, 2018), we demonstrate that there is a correlation between translation performance and trainable downstream tasks when adjusting the size of the intermediate layer.
The trend is different for non-trainable tasks that benefit from the increased compression that denser representations achieve, which typically hurts the translation performance because of the decreased capacity of the model. We also show that multilingual models improve trainable downstream tasks even further, demonstrating the additional abstraction that is pushed into the representations through additional translation tasks involved in training.

Architecture
Our architecture follows the standard setup of an encoder-decoder model in machine translation with a traditional attention mechanism (Luong et al., 2015). However, we augment the network with language specific encoders and decoders to enable multilingual training as in Lu et al. (2018), plus we introduce an inner-attention layer (Liu et al., 2016;Lin et al., 2017) that summarizes the encoder information in a fixed-size vector representation that can easily be shared among different translation tasks with the language-specific encoders and decoders connecting to it. The overall architecture is illustrated in Figure 1 (see also Vázquez et al., 2019). Due to the attentive connection between encoders and decoders we call this layer attention bridge, and its architecture is an adaptation from the model proposed by Cífka and Bojar (2018). Finally, each decoder follows a common attention mechanism in NMT, with the only exception that the context vector is computed on the attention bridge, and the initialization is performed by a mean pooling over it. Hence, the decoder receives the information only through the shared attention bridge.
The fixed-sized representation coming out of the shared layer can immediately be applied to downstream tasks. 1 However, selecting a reasonable size of the attention bridge in terms of attention heads (m i in Figure 1) is crucial for the performance both in a bilingual and multilingual sce-1 As in Lu et al. (2018), we note that the attention bridge is independent of the underlying encoder and decoder. While we use LSTM, it could be easily replaced with a transformer type network (Vaswani et al., 2017) or with a CNN (Gehring et al., 2017). nario as we will see in the experiments below.

Experimental setup
All models are implemented using the OpenNMT framework (Klein et al., 2017) trained using the same set of hyper-parameters. 2 We use embedding layers of 512 dimensions, two stacked bidirectional LSTM layers with 512 hidden units (256 per direction) and an attentive decoder composed of two unidirectional LSTM layers with 512 units. Regarding the attention bridge, we experimented with four different configurations: 1, 10, 25 and 50 attention heads with 1024 hidden units each. For multilingual models, we used a language-rotating scheduler, in which each mini-batch contains sentences from a different language pair, cycling through all the language pairs uniformly. We selected the best model according to the BLEU score on the validation set. We train all the models using the Europarl Corpus v7 (Koehn, 2005), focusing on 4 languages: English (EN), French (FR), German (DE) and Spanish (ES). First we train bilingual models for EN→DE; then we train multilingual models {DE,ES,FR}↔EN; lastly we train a final Many-to-Many model using the biggest size, i.e., 50 attention heads, involving all translation directions between the three languages, i.e., we also include DE-ES, DE-FR and ES-FR.
To evaluate the sentence representations we utilize the SentEval toolkit (Conneau and Kiela, 2018)  sentences. 3 In order to obtain a sentence vector out of multiple attention heads we apply mean pooling over the attention bridge.
We are also interested in the translation quality to verify the appropriateness of our models with respect to the main objective they are trained for. For this, we adopt the in-domain development and evaluation dataset from the ACL-WMT07 shared task. Sentences are encoded using Byte-Pair Encoding (Sennrich et al., 2016), with 32,000 merge operations for each language.
4 SentEval: Classification tasks Table 1 shows the performance of our models on two popular tasks (SNLI and SICK-E) as in Cífka and Bojar (2018) as well as the average of all 10 SentEval downstream tasks. The experiments reveal two important findings: (1) In contrast with the results from Cífka and Bojar (2018), our scores demonstrate that an increasing number of attention heads is beneficial for classification-based downstream tasks. All models perform best with more than one attention head and the general trend is that the accuracies improve with larger representations. The previous claim was that there is the opposite effect and lower numbers of attention heads lead to higher performances in downstream tasks, but we do not see that effect in our setup, at least not in the classification tasks.
(2) The second outcome is the positive effect 3 Due to the large number of SentEval tasks, we report results on natural language inference (SNLI, SICK-E/SICK-R) and the average of all tasks.  Table 2: Results from supervised similarity tasks (SICK-R and STSB), measured using Pearson's (r) and Spearman's (ρ) correlation coefficients (r/ρ). The average across unsupervised similarity tasks on Pearson's measures are displayed in the right-most column. Results with † taken from Cífka and Bojar (2018). of multilingual training. We can see that multilingual training objectives are generally helpful for the trainable downstream tasks.
Particularly interesting is the fact that the Manyto-Many model performs best on average even though it does not add any further training examples for English (compared to the other multilingual models), which is the target language of the downstream tasks. This suggests that the model is able to improve generalizations even from other language pairs (DE-ES, FR-ES, FR-DE) that are not directly involved in training the representations of English sentences.
Comparing against benchmarks, our results are in line with competitive baselines (Arora et al., 2017). While our aim is not to beat the state of the art trained on different data, but rather to understand the impact of various sizes of attention heads in a bi-and multilingual scenario, we argue that a larger attention bridge and multilinguality constitute a preferable starting point to learn more meaningful sentence representations. Table 2 summarizes the results using Pearson's and Spearman's coefficient on the two SentEval supervised textual similarity tasks, SICK-R and STSB, and the average Pearson's measure on the remaining unsupervised similarity tasks.

SentEval: Similarity tasks
Two different trends become visible: i) On the unsupervised textual similarity tasks, having fewer attention heads is beneficial. Contrary to the results in the classification tasks, the best overall  Table 3: BLEU scores for multilingual models. Baseline system in the right-most column.
model is provided by a bilingual setting with only one attention head. This is in line with the findings of Cífka and Bojar (2018) and could also be expected as the model is more strongly pushed into a dense semantic abstraction that is beneficial for measuring similarities without further training. More surprising is the negative effect of the multilingual models. We believe that the multilingual information encoded jointly in the attention bridge hampers the results for the monolingual semantic similarity measured with the cosine distance, while it becomes easier in a bilingual scenario where the vector encodes only one source language, English in this case. ii) On the supervised textual similarity tasks, we find a similar trend as in the previous section for SICK: both a higher number of attention heads and multilinguality contribute to better scores, while for STSB, we notice a different pattern.
This general discrepancy between results in supervised and unsupervised tasks is not new in the literature (Hill et al., 2016). We hypothesize that the training procedure is able to pick up the information needed for the task, while in the unsupervised case a more dense representation is essential.

Translation quality
Finally, we also look at the translation performance of the multilingual models we have introduced above compared with a baseline, an standard encoder-decoder model with attention (Luong et al., 2015). In this section, we verify that the attention bridge model is stable and successfully learns to translate in the multilingual case. Table 3 shows the comparison between the multilingual models. In general, we observe the same trend as in the bilingual evaluation concerning the size of the attention bridge. Namely, more attention heads lead to a higher BLEU score. The model with 50 heads achieves the best results among our models. It obtains scores that range in the same ballpark as the baseline, only in a few Figure 2: The BLEU scores obtained by the multilingual models and baseline system with respect to different sentence length.
cases there is a degradation of few BLEU points. Notably, we do not see any increase in translation quality from the {DE,ES,FR}↔EN model to the Many-to-Many model; the BLEU scores are statistically equivalent for all six translation directions.
One of the main motivations for having more attention heads lies in the better support of longer sentences. To study the effect, we group sentences of similar length and compute the BLEU score for each group. As we can see from Figure 2 a larger number of attention heads has, indeed, a positive impact when translating longer sentences. Interestingly enough, on sentences with up to 45 words, there is no real gap between the results of the baseline model and our bridge models with a high number of attention heads. It looks like the performance drop of the attention bridge models is entirely due to sentences longer than 45 words.
We hypothesize that this might be due to the increasing syntactic divergences between the languages that have to be encoded. The shared selfattention layer needs to learn to focus on different parts of a sentence depending on the language it reads and, with increasing lengths of a sentence, this ability becomes harder and more difficult to pick up from the data alone.

Conclusion
We have shown that fixed-size sentence representations can effectively be learned with multilingual machine translation using a inner-attention layer and scheduled training with multiple translation tasks. The performance of the model heavily depends on the size of the intermediate representation layer and we show that a higher number of attention heads leads to improved translation and stronger representations in supervised downstream tasks (contradicting earlier findings) and multilinguality also helps in the same downstream tasks. Our analysis reveals that the attention bridge model mainly suffers on long sentences. The next steps will include a deeper linguistic analysis of the translation model and the extension to multilingual models with more languages with greater linguistic diversity.