Multilingual NMT with a Language-Independent Attention Bridge

In this paper, we propose an architecture for machine translation (MT) capable of obtaining multilingual sentence representations by incorporating an intermediate attention bridge that is shared across all languages. We train the model with language-specific encoders and decoders that are connected through an inner-attention layer on the encoder side. The attention bridge exploits the semantics from each language for translation and develops into a language-agnostic meaning representation that can efficiently be used for transfer learning. We present a new framework for the efficient development of multilingual neural machine translation (NMT) using this model and scheduled training. We have tested the approach in a systematic way with a multi-parallel data set. The model achieves substantial improvements over strong bilingual models and performs well for zero-shot translation, which demonstrates its ability of abstraction and transfer learning.


Introduction
Neural machine translation (NMT) provides an ideal setting for multilingual MT because it can efficiently share model parameters and take advantage of the various similarities found by the model in the hidden layers and word embeddings (Firat et al., 2016a;Johnson et al., 2017;Blackwood et al., 2018). Furthermore, multilingual NMT has the potential of considerably improving the performance of neural translation systems for low-resource languages (Lakew et al., 2017) and enables zero-shot translation, i.e., translating between language pairs that were not seen during training (Firat et al., 2016b;Johnson et al., 2017).
For this study we focus on models for multilingual translation that learn language-agnostic representations, where we outline the development of a language-independent representation based on an attention bridge shared across all languages. For this, we apply an architecture based on shared self-attention with language-specific encoders and decoders that can easily scale to a large number of languages while addressing the task of obtaining language-independent sentence embeddings (Cífka and Bojar, 2018;Lu et al., 2018;Lin et al., 2017). Those embeddings are created from the encoder's self-attention and connect to the language-specific decoders that attend to them, hence the name 'bridge'. Additionally, we add a penalty term to avoid redundancy in the shared layer. More details of the architecture are given in section 2.
To summarise our contributions, we i) present a multilingual translation system that efficiently tackles the task of learning language-agnostic sentence representations; ii) verify that this model enables effective transfer learning and zeroshot translation through the shared representation layer; and iii) show that multilingually trained embeddings improve the majority of downstream and sentence probing tasks demonstrating the abstractions learned from the combined translation tasks.

Model Architecture
Our architecture follows the standard setup of an encoder-decoder model of machine translation with a traditional attention mechanism (Bahdanau et al., 2015;Luong et al., 2015). However, to enable multilingual training we augment the network with language-specific encoders and decoders trainable with a language-rotating scheduler (Dong et al., 2015;Schwenk and Douze, 2017). We also incorporate a self-attention layer (attention bridge), shared among all language pairs, to serve as a language-agnostic layer (Cífka and Bojar, 2018;Lu et al., 2018) Attention bridge: Each encoder takes as input a sequence of tokens (x 1 , . . . , x n ) and produces n d h -dimensional hidden states, H = (h 1 , . . . , h n ) with h i ∈ R d h , in our case using a bidirectional long short-term memory (LSTM) (Graves and Schmidhuber, 2005) 1 . Next, we encode this variable length sentence-embedding matrix H into a fixed size M ∈ R d h ×k capable of focusing on k different components of the sentence (Lin et al., 2017;Chen et al., 2018;Cífka and Bojar, 2018), using self-attention as follows: where W 1 ∈ R dw×d h and W 2 ∈ R k×dw are weight matrices, with d w a hyper-parameter set arbitrarily, and k is the number of attention heads in the attention bridge. Each decoder follows a common attention mechanism in NMT (Luong et al., 2015), with an initial state computed by mean pooling over M , and using M instead of the hidden states of the encoder for computing the context vector.
Penalty term: The attention bridge matrix M from Eq. (2) could learn repetitive information for different attention heads. To address this issue, we add a penalty term to the loss function, proven effective in related work (Lin et al., 2017;Chen et al., 2018;Tao et al., 2018), which forces each vector to focus on different aspects of the sentence by making the columns of A to be approximately orthogonal in the Frobenius norm: where the Frobenius norm of a matrix A can be defined as the sum of the squared singular values of A. By incorporating this term into the loss function we force matrix AA T to be similar to the identity matrix, that is, j a ij a ji ≈ 1. Additionally, considering the fact that the rows of A sum to 1, with entries in [0, 1], it follows that the columns of A will be forced to be approximately orthogonal, and hence penalize redundancy, similar to the double stochastic attention in Xu et al. (2015).

Experimental Setup
We conducted four translation experiments and tested the learned sentence representations via downstream tasks. We used the multi30k dataset (Elliott et al., 2016) for training and validation in all available languages: Czech, German, French and English, and tested the trained model with the flickr 2016 test data of the same dataset and obtained BLEU scores using the sacreBLEU script 2 (Post, 2018). We lowercased, normalized and tokenized using the Moses toolkit (Koehn et al., 2007), and applied a 10K-operations Byte Pair Encoding (BPE) model per language (Sennrich et al., 2016). Each encoder consists of 2 stacked BiLSTMs of size d h = 512, i.e., the hidden states per direction are of size 256. Each decoder includes 2 stacked unidirectional LSTMs with hidden states of size 512. For the model input and output, the word embeddings have dimension d x = d y = 512. We used an attention bridge layer with 10 attention heads with d w = 1024, the dimensions of W 1 and W 2 from Eq. (1). We chose k = 10 because the mean length of a preprocessed sentence in the training data is 13.2 tokens in our case. Choosing a much smaller k would create a bottleneck in the flow of information, and a bigger one would make the model slower and prone to overfitting (Raganato et al., 2019).
We used a Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1.0 and batch size 64, and selected the best model on the development set for each experiment. We implemented our model on top of an OpenNMT-py (Klein et al., 2017) fork, which we make available for reproducibility purposes. 3

Results
First, we verify the correct functionality of the architecture in a bilingual setting, which will become our baseline for comparison to the multilingual models -both with and without an attention bridge.
On the left side of expected since we pass the information through a fixed size representation made out of 10 selfattention heads without including multilingual information. However, the drop is less than one BLEU point except for English to French, which seems to be an exceptional outlier.
With this result we can justify the validity of the architecture assuring that the additional bottleneck does not create significant deterioration and we can move on with the multilingual models.

Many-To-One and One-To-Many Models
The power of the attention bridge comes from its ability to share information across various language pairs. We now assess the effects of multilingual information on the translation of individual language pairs, by training many-to-one and oneto-many models. This setup allows us to test the abstraction potential of the attention bridge and its effectiveness to encode multilingual information in zero-shot translation.
First we trained a {De,Fr,Cs}↔En model (Table 1 (center-top)), which resulted in substantial improvements for the language pairs seen during training, exceeding both bilingual baselines. However, this model is entirely incapable of performing zero-shot translations. We believe that the inability of the model to generalize to unseen language-pairs arises from the fact that every non-English encoder (or decoder) only learned to process information that was to be decoded into English (or encoded from English input), a finding consistent with Lu et al. (2018). To address this problem, we incorporate monolingual data during training, that is, for each available language A, we included pairs of identical copies of each sentence in A in the training data. All examples come from the same parallel corpus as before and no additional data is used.
As a consequence, we see a remarkable increase in the BLEU scores, including a substantial boost for the language pairs not seen during training (Table 1 (center-bottom)). It seems that the monolingual data informs the model that English is not the unique source/target language. Additionally, there is a positive effect on the seen language pairs (up to almost 2 BLEU points for French to English), the cause of which is not immediately evident. It is possible that the shared layer acquires additional information that can be included in the abstraction process yet not available to the other models.

Many-to-Many Models
We also tested the architecture in a many-to-many setting with all language pairs included, and summarize our results in Table 1 (right). As in the previous case, we compare settings with and without monolingual training data.
The inclusion of language pairs results in an improved performance when compared to the bilingual baselines, as well as the Many↔En cases, except for the En→Fr and En→De tasks. Moreover, the addition of monolingual data leads to even higher scores, producing the overall best model. The improvements in BLEU range from 1.40 to a remarkable 4.43 when compared to the standard bilingual model.
The zero-shot translation capabilities also deserve a closer look. Figure 1 summarizes a systematic evaluation in which we trained six different models where we include all but one of the available language pairs in training. The cyan bars illustrate the performance of the model on the unseen language pairs compared to our best multi-lingual model (in red) and the bilingual, fully supervised model (in dark blue). Note, that those zero-shot models are generally better than the ones from the previously discussed {De,Fr,Cs,En}↔En model in Table 1. In most cases, they come very close to the supervised model and even fare well against the multilingual ones. Figure 1: For every language pair, we compare the BLEU scores between our best model (M-2-M with monolingual data), the zero-shot of the model trained without that specific language pair and the bilingual model of that language pair.

Downstream Tasks
We apply the sentence representations learned by our model to downstream tasks collected in the SentEval toolkit (Conneau and Kiela, 2018) to evaluate the quality of our language-agnostic sentence embeddings. We run each experiment with five different seeds, and present the average of these scores in Table 2, where we compare our bilingual models against a baseline consisting of the best score achieved by the bilingual models with attention bridge. Since our models were trained on limited data and are not directly comparable to models trained on large-scale data sets, for comparison purposes we present results obtained with GloVe-BoW vectors (Pennington et al., 2014) trained with the same BPE-encoded data as the models.
The sentence embeddings produced by the multilingual models show consistent improvements, for the classification tasks of the SentEval collection, with only two exceptions. Moreover, our many-to-many model obtains better results in the SICK Relatedness (SICKR) and STS-Benchmark (STS-B); that is, the trainable semantic similarity tasks. 2 For the SentEval probing tasks  we use the default recommended settings, i.e., a multilayer perceptron classifier with sigmoid nonlinearity, 200 hidden units, and 0.1 dropout rate. Again, we can observe improvements in the majority of cases when adding multiple languages to the training procedure. Remarkably, we observe a significant increment on the accuracy for the specific tasks of Length (superficial property), Top Constituents (syntactic property) and Object Number (semantic information) when training the encoders with multilingual data. Multilingual models outperform the bilingual models in all but one test.

Effect of the Penalty Term
In order to study the effect of the penalty term, we train additional bilingual models, without using the penalty term (Eq. 3) in the training. We then compare BLEU scores, where the penalty term is present and absent, as shown in Table 3.
Overall, both types of models show performance in the same ballpark yielding similar results. As discussed in Lin et al. (2017), the quantitative effect of the penalty term might not be obvious for some tasks, while keeping the positive effect of encouraging the attentive matrix to be focused on different aspects of the sentence.  While the effect of the penalty term might not be very significant in this case, we note that adding the penalty term does not hurt the performance while helping the model not to learn potential redundant information.

Conclusion
We propose a multilingual NMT architecture with three modifications to the common attentive encoder-decoder architecture: languagespecific encoders and decoders, a shared languageindependent attention bridge and a penalty term that forces this layer to attend different parts of the input sentence. This constitutes a multilingual translation system that efficiently incorporates transfer learning and can also tackle the task of learning multilingual sentence representations. The results suggest that the attention bridge layer can efficiently share parameters in a multilingual setting, increasing up to 4.4 BLEU points compared to the baselines. Additionally, we make use of the sentence representations produced by the shared attention bridge of the trained models for downstream-testing, which helped us to verify the generalization capabilities of the model. The results suggest that sentence embeddings improve with additional languages involved in training the underlying machine translation model.