A Compact and Language-Sensitive Multilingual Translation Method

Multilingual neural machine translation (Multi-NMT) with one encoder-decoder model has made remarkable progress due to its simple deployment. However, this multilingual translation paradigm does not make full use of language commonality and parameter sharing between encoder and decoder. Furthermore, this kind of paradigm cannot outperform the individual models trained on bilingual corpus in most cases. In this paper, we propose a compact and language-sensitive method for multilingual translation. To maximize parameter sharing, we first present a universal representor to replace both encoder and decoder models. To make the representor sensitive for specific languages, we further introduce language-sensitive embedding, attention, and discriminator with the ability to enhance model performance. We verify our methods on various translation scenarios, including one-to-many, many-to-many and zero-shot. Extensive experiments demonstrate that our proposed methods remarkably outperform strong standard multilingual translation systems on WMT and IWSLT datasets. Moreover, we find that our model is especially helpful in low-resource and zero-shot translation scenarios.

nant paradigm of Multi-NMT contains one encoder to represent multiple languages and one decoder to generate output tokens of separate languages (Johnson et al., 2017;Ha et al., 2016). This paradigm is widely used in Multi-NMT systems due to simple implementation and convenient deployment. However, this paradigm has two drawbacks. For one hand, using single encoder-decoder framework for all language pairs usually yields inferior performance compared to individually trained single-pair models in most cases (Lu et al., 2018;Platanios et al., 2018;Wang et al., 2018). For the other hand, although this paradigm saves lots of parameters compared to another Multi-NMT framework which employs separate encoders and decoders to handle different languages (Dong et al., 2015;Firat et al., 2016), parameter sharing between encoder and decoder are not fully explored. Since both encoder and decoder have similar structures but use different parameters, the commonality of languages cannot be fully exploited in this paradigm. A natural question arises that why not share the parameters between encoder and decoder on multilingual translation scenario?
To address these issues, we present a compact and language-sensitive method in this work, as shown in Figure 1. We first propose a unified representor by tying encoder and decoder weights in Multi-NMT model, which can not only reduce parameters but also make full use of language commonality and universal representation. To enhance the model ability to distinguish different languages, we further introduce languagesensitive embedding, attention, and discriminator.
We conduct extensive experiments to verify the effectiveness of our proposed model on various Multi-NMT tasks including one-to-many and many-to-many which is further divided into balanced, unbalanced and zero-shot. Experimental results demonstrate that our model can significantly outperform the strong standard baseline multilingual systems and achieve even better performance than individually trained models on most of the language pairs. Specifically, our contributions are three-fold in this work: (1) We present a universal representor to replace encoder and decoder, leading to a compact translation model, which fully explores the commonality between languages.
(2) We introduce language-sensitive embedding, attention, and discriminator which augment the ability of Multi-NMT model in distinguishing different languages.
(3) Extensive experiments demonstrate the superiority of our proposed method on various translation tasks including one-to-many, many-to-many and zero-shot scenarios. Moreover, for many-tomany using unbalance translation pairs, we can achieve the new state-of-the-art results on IWSLT-15 English-Vietnamese. For zero-shot translation, our methods can achieve even better results than individually trained models with the parallel corpus.

Background
In this section, we will introduce the background of the encoder-decoder (Sutskever et al., 2014;Cho et al., 2014) framework and self-attentionbased Transformer (Vaswani et al., 2017).

Encoder-Decoder Framework
Given a set of sentence pairs D = {(x, y)}, the encoder f enc with parameters θ enc maps an input sequence x = (x 1 , x 2 , · · · , x n ) to a sequence of continuous representations h enc = (h enc 1 , h enc 2 , · · · , h enc n ) whose size varies concerning the source sentence length. The decoder f dec with θ dec generates an output sequence y = (y 1 , y 2 , · · · , y m ) by computing P(y t |y <t ) as follows: where h dec is a sequence of continuous representations for the decoder and c t is the context vector which can be calculated as follows: where a i,t is attention weight: where e i,t is a similarity score between the source and target representations. The parameters of calculating cross-attention weight a i,t are denoted as θ attn . The encoder and decoder are trained to maximize the conditional probability of target sequence given a source sequence: log P(y t |y <t , x; θ enc , θ dec , θ attn ) (4) where M is target sentence length. For simplicity, we do not specify d in this formula.
Both the encoder and decoder can be implemented by the different basic neural models structures, such as RNN (LSTM/GRU) (Sutskever et al., 2014;Cho et al., 2014), CNN (Gehring et al., 2017), and self-attention (Vaswani et al., 2017). Our proposed method can be applied to any encoder-decoder architecture. Considering the excellent translation performance of self-attention based Transformer (Vaswani et al., 2017), we implement our method based on this architecture.

Transformer Network
Transformer is a stacked network with several layers containing two or three basic blocks in each layer. For a single layer in the encoder, it consists of a multi-head self-attention and a position-wise feed-forward network. For the decoder model, besides the above two basic blocks, a multi-head cross-attention follows multi-head self-attention. In this block, the calculation method of similarity score e t in Equation 3 is a little different from Luong et al. (2015) and Bahdanau et al. (2015): where d m is the dimension of hidden units, W k and W q are parameters of this cross-attention block, which are denoted as θ attn in Equation 4. All the basic blocks are associated with residual connections, followed by layer normalization (Ba et al., 2016). Since the Transformer network contains no recurrence, positional embeddings are used in the model to make use of sequence order. More details regarding the architecture can be found in Vaswani et al. (2017).

Multilingual Translation
In contrast to NMT models, multilingual models perform the multi-task paradigm with some degree of parameter sharing, in which models are jointly trained on multiple language pairs. We mainly focus on mainstream multilingual translation method proposed by Johnson et al. (2017), which has a unified encoder-decoder framework with a shared attention module for multiple language pairs. They decompose the probability of the target sequences into the products of per token probabilities in all translation forms: log P(y l t |x l , y l <t ; θ enc , θ dec , θ attn ) where L is the number of translation pairs and P(y l t |x l , y l <t ; θ) denotes the translation probability of t-th word of the d-th sentence in l-th translation pair. Note that the translation process for all target languages uses the same parameter set θ.

Our Method
In this section, we introduce our compact and language-sensitive method for multilingual translation, which can compress the model by a representor and improve model ability with languagesensitive modules.

A Compact Representor
In Multi-NMT model, the encoder and decoder are two key components, which play analogous roles and have a similar structure in each layer. We argue that encoder and decoder can share the same parameters if necessary. Thus, we introduce a representor to replace both encoder and decoder by sharing weight parameters of the self-attention block, feed-forward block and the normalization block, as shown in Figure 2. The representor parameters are denoted θ rep . Therefore, the objective function (Equation 6) becomes: log P(y l t |x l , y l <t ; θ rep , θ attn ) (7) This representor (θ rep ) coordinates the semantic presentation of multiple languages in a closely related universal level, which also increases the utilization of commonality for different languages.

Language-Sensitive Modules
The compact representor maximizes the sharing of parameters and makes full use of language commonality. However, it lacks the ability to discriminate different languages. In our method, we introduce three language-sensitive modules to enhance our model as follows: 1) Language-Sensitive Embedding: Previously, Press and Wolf (2017) conduct the weight tying of input and output embedding in NMT model. Generally, a shared vocabulary is built upon subword units like BPE (Sennrich et al., 2016b) and wordpiece (Wu et al., 2016;Schuster and Nakajima, 2012). However, it remains under-exploited which kind of embedding sharing is best for Multi-NMT. We divide the sharing manners into four categories including languagebased manner (LB, different languages have separate input embeddings), direction-based manner (DB, languages in source side and target side have different input embeddings), representorbased manner (RB, shared input embeddings for all languages) and three-way weight tying manner (TWWT) proposed in Press and Wolf (2017), in which the output embedding of the target side is also shared besides representor-based sharing. We compare these four sharing manners for Multi-NMT in our experiments, and we will discuss the results in Section 5.
Considering the last three sharing manners cannot model a sense of which language a token belongs to, we propose a new language-sensitive embedding in our method to specify different languages explicitly. Similar to the position embeddings described in Section 2, this kind of embedding is added to the embedding of each token for corresponding language, which can indicate the translation direction on the source side and guide the generation process for target languages. This embedding is denoted as E lang ∈ R |K | * d model , where |K | is the number of languages involved, and d model is the dimension of hidden states in our model. Note that this embedding can be learned during training.
2) Language-Sensitive Attention: In NMT architecture, cross-attention only appearing in the decoder network locates the most-relevant source part when generating each token in target language. For Multi-NMT, we introduce three different ways to design the cross-attention mechanism, consisting of i) shared-attention, ii) hybridattention, and iii)) language-sensitive attention utilized in our method.
i): In our proposed compact representor, we share self-attention block between encoder and decoder. For the shared-attention, we make a further step to share parameters of cross-attention and self-attention, which can be regarded as coordination of information from both the source side and target side.
ii): Different from the above attention mechanism, the hybrid-attention utilizes independent cross-attention modules but it is shared for all translation tasks.
iii): In the language-sensitive attention, it allows the model to select the cross-attention parameters associated with specific translation tasks dynamically.
In our paper, we investigate these three attention mechanisms. We argue that both the shared and hybrid mechanisms tend to be confused to extract information from different source languages when decoding multiple source languages with different word orders. Thus, we mainly focus on languages-sensitive attention in our method.
To this end, we use multiple sets of parameters θ attn to represent cross-attention modules of different translation tasks. However, language-sensitive attention does not support zero-shot translation because there is no explicit training set for this specific translation task. Therefore, we employ hybrid-attention mechanism in our zero-shot experiments.
3) Language-Sensitive Discriminator: In our method, the representor which shares encoder and decoder makes full use of language commonality, but it weakens the model ability to distinguish different languages. Hence we introduce a new language-sensitive discriminator to strengthen model representation.
In NMT framework, the hidden states on the top layer can be viewed as a fine-grained abstraction (Anastasopoulos and Chiang, 2018). For this language-sensitive module, we first employ a neural model f dis on the top layer of reprensentor h rep top , and the output of this model is a language judgment score P lang .
where P lang (d) is language judgment score for sentence pair d, W dis , b dis are parameters, which are denoted as θ dis . We test two different types of neural models for f dis , including convolutional network with max pooling layer and two-layer feedforward network. And then, we obtain an discriminant objective function as follows: where I {·} is indicator function, and g d belongs to language k.
Finally, we incorporate the language-sensitive discriminator into our Multi-NMT model, and it can be optimized through an end-to-end manner for all translation language pairs D with the following objective function.
where λ is learned or pre-defined weight to balance the translation task and language judgment task.
4 Experimental Settings

Data
In this section, we describe the datasets using in our experiments on one-to-many and many-tomany multilingual translation scenarios.
Many-to-Many: For many-to-many translation, we test our methods on IWSLT-17 5 translation datasets, including English, Italian, Romanian, Dutch (briefly, En, It, Ro, Nl). In order to perform zero-shot translation, we discard some particular language pairs. We also evaluate our method on the unbalanced training corpus. To this end, we construct the training corpus using resource-rich En-De, En-Fi in WMT datasets and low-resource English-Vietnamese (briefly, En-Vi) in IWSLT-15 6 .
The statistical information of all the datasets is detailed in Table 1.

Training Details
We implement our compact and languagesensitive method for Multi-NMT based on the ten-sor2tensor 7 library. We use wordpiece method (Wu et al., 2016;Schuster and Nakajima, 2012)   encode the combination of both source side sentences and target side sentences. The vocabulary size is 37,000 for both sides. We train our models using configuration transformer base adopted by Vaswani et al. (2017), which contains a 6layer encoder and a 6-layer decoder with 512dimensional hidden representations. Each minibatch contains roughly 3,072 source and 3,072 target tokens, which belongs to one translation direction. We use Adam optimizer (Kingma and Ba, 2014) with β 1 =0.9, β 2 =0.98, and =10 −9 . For evaluation, we use beam search with a beam size of k = 4 and length penalty α = 0.6. All our methods are trained and tested on a single Nvidia P40 GPU.

Results and Analysis
In this section, we discuss the results of our experiments about our compact and language-sensitive method on Multi-NMT. The translation performance is evaluated by character-level BLEU5 for En→Zh translation and case-sensitive BLEU4 (Papineni et al., 2002) for other translation tasks.
In our experiments, the models trained on individual language pair are denoted by NMT Baselines, and the baseline Multi-NMT models are denoted by Multi-NMT Baselines.

Main Results
The main results on the one-to-many translation scenario, including one-to-two, one-to-three and one-to-four translation tasks are reported in Ta Figure 3: The comparison of model scale among individually trained system, baselines Multi-NMT system and our methods. Y-axis represents the model parameters per language pair, which is calculated by averaging model parameters on all translation tasks involved.
With respect to our proposed method, it is clear that our compact method consistently outperforms the baseline systems. Compared with another strong one-to-many translation model Three-Stgy proposed by Wang et al. (2018), our compact method can achieve better results as well. Moreover, our method can perform even better than individually trained systems in most cases (eleven out of sixteen cases). The results demonstrate the effectiveness of our method.

Model Size
Besides improving the translation results, we also compress the model size by introducing the representor. We investigate the scale of parameters used on average in each translation direction. We compare three models, including NMT Baselines model, Multi-NMT Baselines model, and our compact Multi-NMT model. As shown in Figure 3, all  the multilingual translation models reduce the parameters. Compared with Multi-NMT Baselines, we can observe that our method further reduces the model size of Multi-NMT. Considering Table 2 and Figure 3 together, we note that even though our proposed method in one-to-four translation task only uses 18.8% parameters of NMT Baselines, we can achieve better performance on En→Zh and En→Lv. Table 2 shows that our proposed languagesensitive modules are complementary with each other. In this subsection, we will analyze each module in detail. Language-Sensitive Embedding: As mentioned in section 3.2, embedding sharing man-  ners for Multi-NMT are divided into four categories. We show the results of these sharing manners in Table 3. To make a fair comparison, we sample 4.5M sentence pairs from En-Zh dataset. As shown in this table, our representorbased sharing manner consistently outperforms both the direction-based manner and three-way weight tying manner. Furthermore, even though the representor-based manner has about 40% fewer parameters than the language-based manner, it achieves comparable or even better performance. We find that language-based sharing manner is unstable because it achieves the highest BLEU score on Multi-NMT of similar languages (En→De/Zh), but the worst quality on dissimilar languages (En→De/Lv). Taking into account of translation quality and stability, we choose to use representor-based sharing manner in our method.

Discussion of Language-Sensitive Modules
As described in Section 3.2, our proposed language-sensitive embedding is added to the input embedding of each token, which is unlike convention Multi-NMT method adding a special token into source side sentences or vocabularies (Johnson et al., 2017;Ha et al., 2016). There exists a question, is this kind of embeddings essential in our representor? To make a verification, we do the ablation study without this module. We observe that Multi-NMT model does not converge during training, which demonstrates these language-sensitive embeddings play a significant role in our model.

Language-Sensitive Attention:
We present three types of cross-attention mechanisms in Section 3.2. We adopt shared-attention and language-sensitive attention for Rep+Emb and Rep+Emb+Attn separately. Comparing these two methods in Table 2, Rep+Emb+Attn method outperforms Rep+Emb method in all cases, which demonstrates the language-sensitive is useful for multiple language pairs with different word order. We also conduct the experiment of our representor with the hybrid-attention mechanism. Since this method has similar performance with Rep+Emb but is larger in size, we ignore its results here.
Language-Sensitive Discriminator: In section 3.2, we employ two different types of the neural model as a language-sensitive discriminator, and there is a hyper-parameter λ in Equation 10. We present the effect of convolutional network and feed-forward network with different hyper-parameters on development datasets in Figure 4. Considering that distinguishing between languages is only an auxiliary task in Multi-NMT, we set the maximum of λ to be 0.1. As shown in Figure 4, when we adopt the convolution network as our discriminator with λ = 0.05, our languagesensitive method performs best. We also conduct the experiments in which the hyper-parameter λ is learnable. The experiment results are similar to the best settings mentioned above both on En→De (23.35 vs. 23.19) and En→Lv (22.97 vs. 22.72). For simplicity, all our experiments listed in Table 2 and 4 adopt convolution network as the languagesensitive discriminator with λ = 0.05. Table 4 reports the detailed results of different methods under the many-to-many translation scenario. We will analyze the performance below.  Table 4: Translation performance under the many-to-many scenario, consisting of supervised four-to-four and zero-shot translation on the balanced corpus, and supervised three-to-three on the unbalanced corpus. Note that we do not use the Nl-Ro and It-Nl language pairs in our many-to-many translation task for the balanced corpus.

Results of Balanced Corpus
In part I of

Zero-Shot Results
Part II in Table 4 shows the performance of zeroshot translation. Note that we conduct experiments of this translation scenario using hybridattention mechanism. Compared with Multi-NMT Baselines, our compact and language-sensitive method performs significantly better with the improvement as large as 2.28 BLEU points on Nl→It. Note that the training datasets do not contain parallel data for Nl-Ro and It-Nl. It is interesting to figure out the translation performance of Nl↔Ro and It↔Nl when bilingual training corpus is available. We conduct experiments of NMT Baselines on Nl-Ro and It-Nl with all sentence pairs in , which is similar to other training pairs in our balanced corpus. As shown in part II, Multi-NMT Baselines underperform the NMT Baselines on all cases. However, our method performs better than NMT Baselines, and it achieves the improvement up to 1.76 BLEU points on Nl→It translation task.

Related Work
Our work is related to two lines of research, and we describe each of them as follows: Model Compactness and Multi-NMT: To reduce the model size in NMT, weight pruning, knowledge distillation, quantization, and weight sharing (Kim and Rush, 2016;See et al., 2016;He et al., 2018;Zhou et al., 2018) have been ex-plored. Due to the benefit of compactness, multilingual translation has been extensively studied in Dong et al. (2015),  and Johnson et al. (2017). Owing to excellent translation performance and ease of use, many researchers (Blackwood et al., 2018;Lakew et al., 2018) have conducted translation based on the framework of Johnson et al. (2017) and Ha et al. (2016). Zhou et al. (2019) propose to perform decoding in two translation directions synchronously, which can be applied on different target languages and is a new research area for Multi-NMT. In our method, we present a compact method for Multi-NMT, which can not only compress the model but also yield superior performance.
Low-Resource and Zero-Shot NMT: Many researchers have explored low-resource NMT using transfer learning Neubig and Hu, 2018) and data augmenting (Sennrich et al., 2016a;Zhang and Zong, 2016) approaches. For zero-shot translation,  and  utilize a pivot-based method, which bridges the gap between sourceto-pivot and pivot-to-target two steps. Multilingual translation is another direction to deal with both low-resource and zero-shot translation. Gu et al. (2018) enable sharing of lexical and sentence representation across multiple languages, especially for extremely low-resource Multi-NMT. Firat et al. (2016), Lakew et al. (2017), and Johnson et al. (2017 propose to make use of multilinguality in Multi-NMT to address the zero-shot problem. In this work, we propose a method for Multi-NMT to boost the accuracy of the multilingual translation, which better fits on both lowresource scenario and zero-shot scenario.

Conclusion
In this paper, we have proposed a compact and language-sensitive method for multilingual translation. We first introduce a representor for replacing both encoder and decoder so as to fully explore the commonality among languages. Based on the representor architecture, we then propose three language-specific modules dealing with embedding, attention and language discrimination respectively, in order to enhance the multilanguage translation model with the ability of distinguishing among different languages. The empirical experiments demonstrate that our proposed methods can outperform strong standard multilingual trans-lation systems on one-to-many and many-to-many translation tasks. Moreover, our method is proved to be especially helpful in the low-resource and zero-shot translation scenarios.