Language-aware Interlingua for Multilingual Neural Machine Translation

Multilingual neural machine translation (NMT) has led to impressive accuracy improvements in low-resource scenarios by sharing common linguistic information across languages. However, the traditional multilingual model fails to capture the diversity and specificity of different languages, resulting in inferior performance compared with individual models that are sufficiently trained. In this paper, we incorporate a language-aware interlingua into the Encoder-Decoder architecture. The interlingual network enables the model to learn a language-independent representation from the semantic spaces of different languages, while still allowing for language-specific specialization of a particular language-pair. Experiments show that our proposed method achieves remarkable improvements over state-of-the-art multilingual NMT baselines and produces comparable performance with strong individual models.


Introduction
Neural Machine Translation (NMT) (Sutskever et al., 2014;Vaswani et al., 2017) has significantly improved the translation quality due to its end-to-end modeling and continuous representation. While conventional NMT performs single pair translation well, training a separate model for each language pair is resource consuming, considering there are thousands of languages in the world. Therefore multilingual NMT is introduced to handle multiple language pairs in one model, reducing the online serving and offline training cost. Furthermore, the multilingual NMT framework facilitates the cross-lingual knowledge transfer to improve translation performance on low resource language pairs (Wang et al., 2019).
Despite all the mentioned advantages, multilingual NMT remains a challenging task since the language diversity and model capacity limitations lead to inferior performance against individual models that are sufficiently trained. So recent efforts in multilingual NMT mainly focus on enlarging the model capacity, either by introducing multiple Encoders and Decoders to handle different languages (Firat et al., 2016;Zoph and Knight, 2016), or enhancing the attention mechanism with language-specific signals (Blackwood et al., 2018). On the other hand, there have been some efforts to model the specificity of different languages. Johnson et al. (2017) and Ha et al. (2016) tackle this by simply adding some pre-designed tokens at the beginning of the source/target sequence, but we argue that such signals are not strong enough to learn enough language-specific information to transform the continuous representation of each language into the shared semantic space based on our observations.
In this paper, we incorporate a language-aware Interlingua module into the Encoder-Decoder architecture. It explicitly models the shared semantic space for all languages and acts as a bridge between the Encoder and Decoder network. Specifically, we first introduce a language embedding to represent unique characteristics of each language and an interlingua embedding to capture the common semantics across languages. Then we use the two embeddings to augment the self-attention mechanism which transforms the Encoder representation into the shared semantic space. To minimize the information loss and keep the semantic consistency during transformation, we also introduce reconstruction loss and semantic consistency loss into the training objective. Besides, to further enhance the language-specific signal we incorporate language-aware positional embedding for both Encoder and Decoder, and take the language embedding as the initial state of the target side. We conduct experiments on both standard WMT data sets and large scale in-house data sets. And our proposed model achieves remarkable improvements over state-of-the-art multilingual NMT baselines and produces comparable performance with sufficiently trained individual models.

Model Architecture
As shown in Figure 1, we propose a universal Encoder-Interlingua-Decoder architecture for multilingual NMT. The Encoder and Decoder are identical to the generic self-attention TRANS-FORMER (Vaswani et al., 2017), except some modifications in the positional embedding. The Interlingua is shared across languages, but with language-specific embedding as input, so we call it language-aware Interlingua. The Interlingua module is composed of a stack of N identical layers. Each layer has a multi-head attention sub-layer and a feed-forward sub-layer.

Interlingua
The Interlingua module uses multi-head attention mechanism, mapping the Encoder output H enc of different languages to a language-independent representation I.
The H enc denotes the hidden states out of the Encoder, while the d is the hidden size, and the n denotes the length of the source sentence. ATT(.) is the multi-head attention mechanism (Vaswani et al., 2017). The (K, V ) here are computed from the hidden states of the Encoder output H enc . The Q is composed of two parts in simple linear combination. One part is from the language-specific part L emb , and the other part is a shared matrix I emb , which we called interlingua embedding. Note that, the interlingua embedding I emb has a fixed size of [d×r]. the i-th column of I emb represents a initial semantic subspace that guides what semantic information of the H enc should be attended to at the corresponding position i of the Interlingua output. The r means every Encoder H enc will be mapped into a fixed size representation of r hidden states, and it is set to 10 during all of our experiments, similar to the work of (Vázquez et al., 2018). By incorporating a shared interlingua embedding, we expect that it can exploit the semantics of various subspaces from encoded representation, and the same semantic components of different sentences from both same and different languages should be mapped into the same position i ∈ [1, r]. Language embedding L emb is used as an indicator for the Interlingua that which language it is attending to, as different languages have their own characteristics. So we call the module language-aware Interlingua. FFN(.) is a simple position-wise feed-forward network. By introducing Interlingua module into the Encoder-Decoder structure, we explicitly model the intermediate semantic. In this framework, the language-sensitive Enc is to model the characteristics of each language, and the language-independent Interlingua to enhance cross-language knowledge transfer.

Language Embedding as Initial State
The universal Encoder-Decoder model (Johnson et al., 2017) use a special token (e.g. <2en>) at the beginning of the source sentence, which gives a signal to the Decoder to translate sentences into the right target language. But it is a weak signal as the language information must go through N = 6 Encoder self-attention, and then N = 6 Encoder-Decoder attention before the Decoder attends to it. Inspired by Wang et al. (2018), we build a language embedding explicitly, and directly use it as the initial state of the Decoder.

Language-aware Positional Embedding
Considering the structural differences between languages, each language should have a specific positional embedding. Wang et al. (2018) use trigonometric functions with different orders or offsets in the Decoder for different language. Inspired by this, we provide language-aware positional embedding for both Encoder and Decoder by giving language-specific offsets to the original sine(x), cosine(x) functions in TRANSFORMER. The offset is calculated from W L L emb , where W L is a weight matrix and L emb is the language embedding.

Training Objective
We introduce three types of training objectives in our model, similar to (Escolano et al., 2019).
(i) Translation objective: Generally, a bilingual NMT model adopts the cross-entropy loss as the training objective, which we denote as L s2t , meanwhile, we incorporate another loss L t2s for translation from the target to the source.
(ii) Reconstruction objective: The Interlingua transforms the Encoder output into an intermediate representation I. During translation, the Decoder only uses the I instead of any Encoder information. Inspired by Lample et al. (2017), Tu et al. (2017) and Lample et al. (2018), we incorporate an reconstruction loss for the purpose of minimizing information loss. We denote the X = Decoder(Interlingua(Encoder(X))) as the reconstruction of X. So we employ crossentropy between X and X as our reconstruction loss, and denote L s2s for the source, L t2t for the target.
(iii) Semantic consistency objective: Obviously, sentences from different languages with the same semantics should have the same intermediate rep-resentation. So we leverage a simple but effective method, cosine similarity to measure the consistency. Similar objectives were incorporated in zero-shot translation (Al-Shedivat and Parikh, 2019;Arivazhagan et al., 2019) Where, I s and I t denote the Interlingua representation of the source and target sides respectively. I i is the i-th column of matrix I. L dist = 1−sim(I s , I t ) is used as distance loss in our training objective. Finally, the objective function of our learning algorithm is thus: (5) 3 Experiments

Experimental Settings
We conduct our experiments on both WMT data and in-house data. For WMT data, we use the WMT13 English-French (En-Fr) and English-Spanish (En-Es) data. The En-Fr and En-Es data consist of 18M and 15M sentence pairs respectively. We use newstest2012 and newstest2013 as our validation set and test set. Our in-house data contains about 130M parallel sentences for each language pair in En-Fr, En-Es, En-Pt (Portuguese), and 80M for En-Tr (Turkish). During all our experiments, we follow the settings of TRANS-FORMER-base (Vaswani et al., 2017) with hidden/embedding size 512, 6 hidden layers and 8 attention heads. We set 3 layers for Interlingua, and r = 10 similar to the work of (Vázquez et al., 2018). We apply sub-word NMT (Sennrich et al., 2015), where a joint BPE model is trained for all languages with 50,000 operations. We used a joint vocabulary of 50,000 sub-words for all language pairs.

Multilingual NMT vs Bilingual NMT
We take the UNIV model introduced by Johnson et al. (2017) as our multilingual NMT baseline, and individual models trained for each language pair as our bilingual NMT baseline.
The experimental results on WMT data are shown in Table 1 (Johnson et al., 2017), our model get statistically significant improvements in both manyto-one and one-to-many translation directions on WMT data. Note that we set the Encoder of the UNIV model to 9 layers, which makes it comparable to this work in the term of model size. Compared with the individual models, our model is slightly better for Fr/Es-En in many-to-one scenario. In the one-to-many scenario, the individual models get the best BLEU score, while our model outperforms the universal model in all language pairs. Similarly, the experimental results on in-house large-scale data are shown in Table 2. In one-to-many settings, our model acquires comparable BLEU scores with the bilingual NMT baselines (Individual model), and around 1 BLEU point improvement in En-Pt translation. Our model gets the best BLEU score in many-toone directions for all language pairs. Besides, the proposed model significantly exceeds the multilingual baseline (Universal model) in all directions.
The results show that multilingual NMT models perform better in big data scenarios. This might the reason that intermediate representation can be trained more fully and stronger in a large-scale setting.

Zero-shot Translation
To examine whether our language-aware Interlingua can help cross-lingual knowledge transfer, we perform zero-shot translation on WMT data. The Fr-Es and Es-Fr translation directions are the zeroshot translations. As shown in Table 1, our method yields more than 10 BLEU points improvement compared with the universal Encoder-Decoder approach and significantly shortens the gap with sufficiently trained individual models.

Ablation study on training objectives
We further verify the impact of different training objectives in Table 1. Compared with the INTL baseline, the REC training objective can further improve the translation quality of both supervised and zero-shot language pairs. However, the SIM objective contributes to zero-shot translation quality significantly, with a slight decrease in supervised language pairs. The integration of both REC and SIM in INTL ultimately achieves balance increments between supervised and zero-shot language pairs. This suggests that constraints on Interlingua can lead to better intermediate semantic representations and translation quality.

Related Work
Multilingual NMT is first proposed by Dong et al. (2015) in a one-to-many scenario and generalized by Firat et al. (2016) to many-to-many scenario. Multilingual NMT suffered from the language diversity and model capacity problem. So one direction is to enlarge the model capacity, such as introducing multiple Encoders and Decoders to handle different languages (Luong et al., 2015;Dong et al., 2015;Firat et al., 2016;Zoph and Knight, 2016), or enhancing the attention mechanism with language-specific signals (Blackwood et al., 2018). The other direction is aimed at a unified framework to handle all language pairs (Ha et al., 2016;Johnson et al., 2017). They try to handle diversity by enhancing language-specific signals, by adding designed language tokens (Ha et al., 2016) or language-dependent positional encoding (Wang et al., 2018). Our work follows the second line by explicitly building a languageaware Interlingua network which provides a much stronger language signal than the previous works.
In regards to generating language-independent representation, Lu et al. (2018) and Vázquez et al. (2018) both attempted to build a similar language-independent representation. However, their work is all based on multiple languagedependent LSTM Encoder-Decoders, which significantly increase the model complexity. And they don't have the specially designed training objective to minimize the information loss and keep the semantic consistency. Whereas our work is more simple and effective in these regards and testified on a much stronger TRANSFORMER based system.

Conclusion
We have introduced a language-aware Interlingua module to tackle the language diversity problem for multilingual NMT. Experiments show that our method achieves remarkable improvements over state-of-the-art multilingual NMT baselines and produces comparable performance with strong individual models.