Three Strategies to Improve One-to-Many Multilingual Translation

Due to the benefits of model compactness, multilingual translation (including many-to-one, many-to-many and one-to-many) based on a universal encoder-decoder architecture attracts more and more attention. However, previous studies show that one-to-many translation based on this framework cannot perform on par with the individually trained models. In this work, we introduce three strategies to improve one-to-many multilingual translation by balancing the shared and unique features. Within the architecture of one decoder for all target languages, we first exploit the use of unique initial states for different target languages. Then, we employ language-dependent positional embeddings. Finally and especially, we propose to divide the hidden cells of the decoder into shared and language-dependent ones. The extensive experiments demonstrate that our proposed methods can obtain remarkable improvements over the strong baselines. Moreover, our strategies can achieve comparable or even better performance than the individually trained translation models.


Introduction
Encoder-decoder based neural machine translation (NMT) has achieved the new state-of-the-art due to powerful end-to-end modeling (Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016;. Under this end-to-end framework, many researchers attempt to improve the translation quality between two languages by exploiting monolingual data (Sennrich et al., 2016;Zhang and Zong, 2016), taking advantage of both NMT and statistical machine translation (Wang et al., 2017a;Tang et al., 2016;Zhao et al., 2018;Zhou et al., 2017) and so on.
Another research direction about how to perform multilingual translation within this encoderdecoder architecture has recently drawn more and more attention Dong et al., 2015;Luong et al., 2016;Johnson et al., 2017;Firat et al., 2016b).
In multilingual translation scenarios, one can employ multi-task learning framework to perform many-to-one or one-to-many translation using multiple encoders or multiple decoders (Luong et al., 2016;Dong et al., 2015). Firat et al. (2016a) and Lu et al. (2018) further propose to share a universal attention mechanism for many-to-many translations. In these methods, encoder or decoder is language dependent and network parameters increase linearly with the number of languages. Johnson et al. (2017) and Ha et al. (2016) present an appealing approach in which a universal encoder-decoder framework is designed for manyto-one, many-to-many and one-to-many multilingual translation tasks. The network model is compact and the model size does not grow as the number of languages increases. However, Johnson et al. (2017) observe that only the many-toone paradigm can achieve better translation results than the individually trained models. For the other two paradigms, there are various degrees of quality degradation. In this work, we focus on one-tomany multilingual translation under the universal encoder-decoder framework and attempt to boost its performance while maintaining the model compactness.
To this end, we propose three strategies which exploit the unique features of each target language and keep as many parameters shared as possible. First, we design two special labels at the tail of encoder and the head of decoder to mark the target language and guide the generation of different target languages. Then, we introduce languagedependent positional embeddings into the bottom layer of the decoder network and correspondingly the structural difference between target languages can be well captured. Finally and especially, we propose a new parameter-sharing mechanism in which we divide the hidden units of each decoder layer into shared and language-dependent ones.
We verify the effectiveness of our proposed methods on two one-to-many tasks: Chineseto-English/Japanese translation and English-to-German/French translation. The experimental results demonstrate that the three strategies can significantly outperform the baseline multilingual models and they can achieve comparable or even better performance than the individually trained translation models.
Specifically, our contributions in this paper are two-fold: • The proposed three strategies can take advantage of unique features of each target language while sharing the network parameters as many as possible.
• The extensive experiments on multiple translation tasks show that the three proposed strategies improve the translation quality. Moreover, the effects of the strategies are complementary and the combined one can perform on par with or better than the individually optimized translation models.

Background
Our proposed approach can be applied to any encoder-decoder architecture. Considering the excellent translation performance of Transformer network (Vaswani et al., 2017), we implement our method entirely based on it in this work. Transformer consists of stacked encoder and decoder layers. The encoder maps an input sequence x = (x 1 , x 2 , · · · , x n ) to a sequence of continuous representations z = (z 1 , z 2 , · · · , z n ) whose size varies with respect to the source sentence length. The decoder generates an output sequence y = (y 1 , y 2 , · · · , y m ) from the continuous representations z. Since the Transformer network contains no recurrence, positional embeddings are used in model to make use of sequence order. The encoder and decoder are trained to maximize the conditional probability of target sequence given a source sequence: For the sake of brevity, we refer the reader to Vaswani et al. (2017) for more details regarding the architecture.

Method Description
In this section, we introduce our general strategies for extending the transformer network to one-tomany translation task. We decompose the probability of the target sequences into the products of per token probabilities in all translation forms: where M is number of target languages, and P (y l t |x, y l <t ; θ) denotes the translation probability of t-th word of the l-th target language. Note that the translation process for all target languages uses the same parameter set θ.
Our methods mainly concentrate on improving one-to-many multilingual translation by designing new decoder structure under the universal encoder-decoder framework. The idea is to exploit the shared and unique features of different target languages, and we respectively propose three strategies including special label initialization, language-dependent positional embedding and a new parameter-sharing mechanism.

Special Label Initialization
In the universal encoder-decoder network for oneto-many multilingual translation (Johnson et al., 2017), a special token (e.g. en2fr) is added at the end of the source sentence to indicate the translation direction. Although it is an effective mechanism, we find that the initial states of the decoder are very important to guide the generation process for different target languages. In order to enhance the model, we utilize another special languagedependent label at the beginning of the decoder and we regard it as the first generated token of the target language (e.g. 2fr).

Language-dependent Positional Embedding
Positional embeddings give the model the sense of which part of the sequence is currently being dealt with. Intuitively, different target languages should have different positional embeddings to distinguish the structural difference between multiple target languages. Therefore, we design languagedependent positional embeddings in the universal encoder-decoder multilingual translation. For the fixed embedding method (Vaswani et al., 2017), sine(x) and cosine(x) functions are used to generate positional embeddings. In this case, we introduce trigonometric functions with different orders or offsets on the decoder to distinguish different target languages. For the dynamic embedding method (Gehring et al., 2017), we equip the target inputs by embedding the absolute position of different languages separately.

Shared and Language-dependent Hidden Units per Layer
In the universal encoder-decoder multilingual translation, the hidden layers of the decoder are responsible for generating different target language sentences. As a result, the hidden layers should embody some language-dependent information.
In this work, we propose to divide the hidden units of each decoder layer into shared units and language-dependent ones. On the one hand, shared units can learn the commonality of languages and enable one-to-many translation to share the network parameters as many as possible. On the other hand, language-dependent units are capable of capturing the characteristic of each specific language. Figure 1 gives a brief description of our proposed strategy. For instance, in training step for one target language (tar-1), we tune the shared units and the language-dependent units of tar-1, and mask out other parts. In decoding step, we only use the shared and language-dependent hidden units of target language tar-1 to predict translation results.

Experiments Settings
In this section, we test the proposed methods on two one-to-many translation tasks, including (i) Chinese→English/Japanese in general domain, and (ii) English→French/German in WMT14 task.
Chinese→English/Japanese For this translation task, the training sets of Chinese-to-English (briefly, Zh→En) and Chinese-to-Japanese (briefly, Zh→Ja) both contain about 10 million parallel corpora. We evaluate our methods on NIST03-06 (MT03-06) for Zh→En translation and 400 sentences extracted from our general corpus for Zh→Ja translation.
English→French/German The training set consists of about 4.5 million bilingual sentence pairs in WMT14 English-German (briefly, En→De) task and about 36 million sentence pairs in WMT14 English-French (briefly, En→Fr) task 1 . We use the combination of newstest2012 and newstest2013 as our validation set, and we use newstest2014 as our test set on En→De and En→Fr tasks.
We adopt the tensor2tensor 2 library for training and evaluating our basic Transformer translation model. We use wordpiece method (Wu et al., 2016;Schuster and Nakajima, 2012) to encode source side sentences and the combination of target side sentences. The vocabulary size is 37,000 for both sides. We train our models using configuration transformer big adopted by Vaswani et al. (2017), which contains a 6-layer encoder and a 6layer decoder with 1024-dimensional hidden representations. During training, each mini-batch on one GPU contains a set of sentence pairs with roughly 3,072 source and 3,072 target tokens. We use Adam optimizer (Kingma and Ba, 2014) with β 1 =0.9, β 2 =0.98, and =10 −9 . For our model, we train for 400,000 steps on one machine with 8 NVIDIA Tesla M40 GPUs.

Results and Analysis
We show the results of one-to-many translation experiments using our proposed strategies. The translation performance is evaluated by case-insensitive BLEU4 for Zh→En translation, character-level BLEU5 for Zh→Ja translation, and case-sensitive BLEU4 (Papineni et al., 2002) for En→De/Fr translation task. Table 1 reports the main translation results of Zh→En/Ja and En→De/Fr translation tasks. We conduct universal one-to-many translation using  Table 1: Translation performance of our methods on Zh→En/Ja and En→De/Fr tasks. Indiv means translation model of individual pair. O2M is the our baseline system. 1 , 2 and 3 denote our proposed three strategies of special label initialization, language-dependent positional embedding and the new parameter-sharing mechanism separately. 2 (Dyn) and 2 (Fixed) represent the two ways of language-dependent positional embedding method. For shared and language-dependent method, we set one-half of hidden units as shared units, and for another half, we use a quarter hidden units to denote two output languages respectively. Johnson et al. (2017) method on Transformer framework as our baseline system (briefly, O2M method). From the first two lines, we can see that the O2M method cannot perform on par with the individually trained systems in most cases. We mentioned before that our goal is to improve the universal one-to-many multilingual translation framework while maintaining the parameter sharing property. We can observe from the table that all our proposed strategies (last part in Table 1) improve the translation performance compared to the baseline (O2M). Specifically, the combined use of three strategies performs best and it can achieve the improvements up to 1.96 BLEU points (45.51 vs. 43.55 on Zh→En MT04). As for languagedependent positional embedding, we find that both fixed and dynamic styles perform similarly.

Our Strategies vs. Baseline
Our ultimate goal is to make the universal oneto-many framework as good as or better than the individually trained systems. Table 1 demonstrates some encouraging results. It is shown in the table that the universal one-to-many architecture enhanced with our strategies can outperform the individually trained models on three out of four language translations (Zh→En, Zh→Ja, En→Fr). The results verify the effectiveness of our proposed methods.

Comparison of Shared Unit Size
For the new parameter-sharing mechanism, it is an open question to decide how many hidden units should be shared and how many ones should be language dependent. To figure out this question, we further conduct an experiment to investigate  Figure 2 reports the results. We can observe different trends for different language pairs. On the En→De/Fr translation task, the performance is best when we share one-half of the hidden units. In contrast, it obtains the best results when we share only 37.5% of hidden units on Zh→En/Ja translation. It indicates that similar languages (De/Fr) can share more hidden units and languages with a great difference (En/Ja) may share less hidden units.

Related Work
In this work, we explore the balancing problem of shared and unique parameters, and attempt to incorporate the language-dependent presentation features to distinguish different target languages under the scenario of one-to-many multilingual translation.
Multilingual translation has been extensively studied in Dong et al. (2015), Firat et al. (2016a), Luong et al. (2016) and Johnson et al. (2017). Owing to excellent translation performance and ease of use, many researchers (Blackwood et al., 2018;Lakew et al., 2018) have conduct translation of multiple languages based on the framework of Johnson et al. (2017) and Ha et al. (2016). As for low-resource translation scenario Wang et al., 2017b), similar to above method, Gu et al. (2018) enable sharing of lexical and sentence representation across multiple languages especially for lowresource multilingual NMT. Different from previous methods, our work mainly focuses on improving the one-to-many multilingual translation framework while sharing as many parameters as possible.

Conclusion
In this paper, we have proposed three effective strategies to improve the universal one-to-many multilingual translation, including special label initialization, language-dependent positional embedding and a new parameter-sharing mechanism. The empirical experiments on four language pairs demonstrate that our strategies can obtain significant improvement over the strong baseline, and can achieve comparable or even better results than the individually trained models.
For future work, we plan to extend our strategies on many-to-many multilingual translation scenarios, and explore other effective strategies to balance parameter sharing.