Multi-Task Learning for Multiple Language Translation

In this paper, we investigate the problem of learning a machine translation model that can simultaneously translate sentences from one source language to multiple target languages. Our solution is inspired by the recently proposed neural machine translation model which generalizes machine translation as a sequence learning problem. We extend the neural machine translation to a multi-task learning framework which shares source language representation and separates the modeling of different target language translation. Our framework can be applied to situations where either large amounts of parallel data or limited parallel data is available. Experiments show that our multi-task learning model is able to achieve signiﬁcantly higher translation quality over individually learned model in both situations on the data sets publicly available.


Introduction
Translation from one source language to multiple target languages at the same time is a difficult task for humans. A person often needs to be familiar with specific translation rules for different language pairs. Machine translation systems suffer from the same problems too. Under the current classic statistical machine translation framework, it is hard to share information across different phrase tables among different language pairs. Translation quality decreases rapidly when the size of training corpus for some minority language pairs becomes smaller. To conquer the problems described above, we propose a multi-task learning framework based on a sequence learning model to conduct machine translation from one source language to multiple target languages, inspired by the recently proposed neural machine translation(NMT) framework proposed by . Specifically, we extend the recurrent neural network based encoder-decoder framework to a multi-task learning model that shares an encoder across all language pairs and utilize a different decoder for each target language.
The neural machine translation approach has recently achieved promising results in improving translation quality. Different from conventional statistical machine translation approaches, neural machine translation approaches aim at learning a radically end-to-end neural network model to optimize translation performance by generalizing machine translation as a sequence learning problem.
Based on the neural translation framework, the lexical sparsity problem and the long-range dependency problem in traditional statistical machine translation can be alleviated through neural networks such as long shortterm memory networks which provide great lexical generalization and long-term sequence memorization abilities.
The basic assumption of our proposed framework is that many languages differ lexically but are closely related on the semantic and/or the syntactic levels. We explore such correlation across different target languages and realize it under a multi-task learning framework. We treat a separate translation direction as a sub RNN encode-decoder task in this framework which shares the same encoder (i.e. the same source language representation) across different translation directions, and use a different decoder for each specific target language. In this way, this proposed multi-task learning model can make full use of the source language corpora across different language pairs. Since the encoder part shares the same source language representation across all the translation tasks, it may learn semantic and structured predictive representations that can not be learned with only a small amount of data. Moreover, during training we jointly model the alignment and the translation process simultaneously for different language pairs under the same framework. For example, when we simultaneously translate from English into Korean and Japanese, we can jointly learn latent similar semantic and structure information across Korea and Japanese because these two languages share some common language structures.
The contribution of this work is three folds. First, we propose a unified machine learning framework to explore the problem of translating one source language into multiple target languages. To the best of our knowledge, this problem has not been studied carefully in the statistical machine translation field before. Second, given large-scale training corpora for different language pairs, we show that our framework can improve translation quality on each target language as compared with the neural translation model trained on a single language pair. Finally, our framework is able to alleviate the data scarcity problem, using language pairs with large-scale parallel training corpora to improve the translation quality of those with few parallel training corpus.
The following sections will be organized as follows: in section 2, related work will be described, and in section 3, we will describe our multi-task learning method. Experiments that demonstrate the effectiveness of our framework will be described in section 4. Lastly, we will conclude our work in section 5.

Related Work
Statistical machine translation systems often rely on large-scale parallel and monolingual training corpora to generate translations of high quality. Unfortunately, statistical machine translation system often suffers from data sparsity problem due to the fact that phrase tables are extracted from the limited bilingual corpus. Much work has been done to address the data sparsity problem such as the pivot language approach (Wu and Wang, 2007;Cohn and Lapata, 2007) and deep learning techniques (Devlin et al., 2014;Gao et al., 2014;Sundermeyer et al., 2014;.
On the problem of how to translate one source language to many target languages within one model, few work has been done in statistical machine translation. A related work in SMT is the pivot language approach for statistical machine translation which uses a commonly used language as a "bridge" to generate source-target translation for language pair with few training corpus. Pivot based statistical machine translation is crucial in machine translation for resource-poor language pairs, such as Spanish to Chinese. Considering the problem of translating one source language to many target languages, pivot based SMT approaches does work well given a large-scale source language to pivot language bilingual corpus and large-scale pivot language to target languages corpus. However, in reality, language pairs between English and many other target languages may not be large enough, and pivot-based SMT sometimes fails to handle this problem. Our approach handles one to many target language translation in a different way that we directly learn an end to multi-end translation system that does not need a pivot language based on the idea of neural machine translation.
Neural Machine translation is a emerging new field in machine translation, proposed by several work recently (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;, aiming at end-to-end machine translation without phrase table extraction and language model training.
Different from traditional statistical machine translation, neural machine translation encodes a variable-length source sentence with a recurrent neural network into a fixed-length vector representation and decodes it with another recurrent neural network from a fixed-length vector into variable-length target sentence. A typical model is the RNN encoder-decoder approach proposed by , which utilizes a bidirectional recurrent neural network to compress the source sentence information and fits the conditional probability of words in target languages with a recurrent manner. Moreover, soft alignment parameters are considered in this model. As a specific example model in this paper, we adopt a RNN encoder-decoder neural machine translation model for multi-task learning, though all neural network based model can be adapted in our framework.
In the natural language processing field, a notable work related with multi-task learning was proposed by Collobert et al. (2011) which shared common representation for input words and solve different traditional NLP tasks such as part-of-Speech tagging, name entity recognition and semantic role labeling within one framework, where the convolutional neural network model was used. Hatori et al. (2012) proposed to jointly train word segmentation, POS tagging and dependency parsing, which can also be seen as a multi-task learning approach. Similar idea has also been proposed by  in Chinese dependency parsing. Most of multi-task learning or joint training frameworks can be summarized as parameter sharing approaches proposed by Ando and Zhang (2005) where they jointly trained models and shared center parameters in NLP tasks. Researchers have also explored similar approaches (Sennrich et al., 2013;Cui et al., 2013) in statistical machine translation which are often refered as domain adaption. Our work explores the possibility of machine translation under the multitask framework by using the recurrent neural networks. To the best of our knowledge, this is the first trial of end to end machine translation under multi-task learning framework.

Multi-task Model for Multiple Language Translation
Our model is a general framework for translating from one source language to many targets. The model we build in this section is a recurrent neural network based encoder-decoder model with multiple target tasks, and each task is a specific translation direction. Different tasks share the same translation encoder across different language pairs. We will describe model details in this section.

Objective Function
Given a pair of training sentence {x, y}, a standard recurrent neural network based encoder-decoder machine translation model fits a parameterized model to maximize the conditional probability of a target sentence y given a source sentence x , i.e., argmax p(y|x). We extend this into multiple languages setting. In particular, suppose we want to translate from English to many different languages, for instance, French(Fr), Dutch(Nl), Spanish(Es). Parallel training data will be collected before training, i.e. En-Fr, En-Nl, En-Es parallel sentences. Since the English representation of the three language pairs is shared in one encoder, the objective function we optimize is the summation of several conditional probability terms conditioned on representation generated from the same encoder.
And Θ trg Tp is the parameter set of the T p th target language. N p is the size of parallel training corpus of the pth language pair. For different target languages, the target encoder parameters are seperated so we have T m decoders to optimize. This parameter sharing strategy makes different language pairs maintain the same semantic and structure information of the source language and learn to translate into target languages in different decoders.

Model Details
Suppose we have several language pairs (x Tp , y Tp ) where T p denotes the index of the T p th language pair. For a specific language pair, given a sequence of source sentence input (x Tp 1 , x Tp 2 , · · · , x Tp n ), the goal is to jointly maximize the conditional probability for each generated target word.
The probability of generating the tth target word is estimated as: (2) where the function g is parameterized by a feedforward neural network with a softmax output layer. And g can be viewed as a probability predictor with neural networks. s Tp t is a recurrent neural network hidden state at time t, which can be estimated as: the context vector c Tp t depends on a sequence of annotations (h 1 , · · · , h Lx ) to which an encoder maps the input sentence, where L x is the number of tokens in x. Each annotation h i is a bidirectional recurrent representation with forward and backward sequence information around the ith word.
a Tp tj is a normalized score of e tj which is a soft alignment model measuring how well the input context around the jth word and the output word in the tth position match. e tj is modeled through a perceptron-like function: To compute h j , a bidirectional recurrent neural network is used. In the bidirectional recurrent neural network, the representation of a forward sequence and a backward sequence of the input sentence is estimated and concatenated to be a single vector. This concatenated vector can be used to translate multiple languages during the test time.
From a probabilistic perspective, our model is able to learn the conditional distribution of several target languages given the same source corpus. Thus, the recurrent encoder-decoders are jointly trained with several conditional probabilities added together. As for the bidirectional recurrent neural network module, we adopt the recently proposed gated recurrent neural network . The gated recurrent neural network is shown to have promising results in several sequence learning problem such as speech recognition and machine translation where input and output sequences are of variable length. It is also shown that the gated recurrent neural network has the ability to address the gradient vanishing problem compared with the traditional recurrent neural network, and thus the long-range dependency problem in machine translation can be handled well. In our multi-task learning framework, the parameters of the gated recurrent neural network in the encoder are shared, which is formulated as follows.
Where I is identity vector and denotes element wise product between vectors. tanh(x) and σ(x) are nonlinear transformation functions that can be applied element-wise on vectors. The recurrent computation procedure is illustrated in 1, where x t denotes one-hot vector for the tth word in a sequence. Figure 1: Gated recurrent neural network computation, where r t is a reset gate responsible for memory unit elimination, and z t can be viewed as a soft weight between current state information and history information.
The overall model is illustrated in Figure 2 where the multi-task learning framework with four target languages is demonstrated.
The soft alignment parameters A i for each encoderdecoder are different and only the bidirectional recurrent neural network representation is shared.

Optimization
The optimization approach we use is the mini-batch stochastic gradient descent approach (Bottou, 1991). The only difference between our optimization and the commonly used stochastic gradient descent is that we learn several minibatches within a fixed language pair for several mini-batch iterations and then move onto the next language pair. Our optimization procedure is shown in Figure 3.

Translation with Beam Search
Although parallel corpora are available for the encoder and the decoder modeling in the training phrase, the ground truth is not available during test time. During test time, translation is produced by finding the most likely sequence via beam search.
Given the target direction we want to translate to, beam search is performed with the shared encoder and a specific target decoder where search space belongs to the decoder T p . We adopt beam search algorithm similar as it is used in SMT system (Koehn, 2004) except that we only utilize scores produced by each decoder as features. The size of beam is 10 in our experiments for speedup consideration. Beam search is ended until the endof-sentence eos symbol is generated.

Experiments
We conducted two groups of experiments to show the effectiveness of our framework. The goal of the first experiment is to show that multi-task learning helps to improve translation performance given enough training corpora for all language pairs. In the second experiment, we show that for some resource-poor language pairs with a few parallel training data, their translation performance could be improved as well.

Dataset
The Europarl corpus is a multi-lingual corpus including 21 European languages. Here we only choose four language pairs for our experiments. The source language is English for all language pairs. And the target languages are Spanish (Es), French (Fr), Portuguese (Pt) and Dutch (Nl).
To demonstrate the validity of our learning framework, we do some preprocessing on the training set. For the source language, we use 30k of the most frequent words for source language vocabulary which is shared across different language pairs and 30k most frequent words for each target language. Outof-vocabulary words are denoted as unknown words, and we maintain different unknown word labels for different languages. For test sets, we also restrict all words in the test set to be from our training vocabulary and mark the OOV words as the corresponding labels as in the training data. The size of training corpus in experiment 1 and 2 is listed in Table 1 where  • Initialization of all parameters are from uniform distribution between -0.01 and 0.01. • We use stochastic gradient descent with recently proposed learning rate decay strategy Ada-Delta (Zeiler, 2012).
• Mini batch size in our model is set to 50 so that the convergence speed is fast. • We train 1000 mini batches of data in one language pair before we switch to the next language pair. • For word representation dimensionality, we use 1000 for both source language and target language. • The size of hidden layer is set to 1000.
We trained our multi-task model with a multi-GPU implementation due to the limitation of Graphic memory. And each target decoder is trained within one GPU card, and we synchronize our source encoder every 1000 batches among all GPU card. Our model costs about 72 hours on full large parallel corpora training until convergence and about 24 hours on partial parallel corpora training. During decoding, our implementation on GPU costs about 0.5 second per sentence.

Evaluation
We evaluate the effectiveness of our method with EuroParl Common testset and WMT 2013 dataset. BLEU-4 (Papineni et al., 2002) is used as the evaluation metric. We evaluate BLEU scores on EuroParl Common test set with multi-task NMT models and single NMT models to demonstrate the validity of our multi-task learning framework. On the WMT 2013 data sets, we compare performance of separately trained NMT models, multi-task NMT models and Moses. We use the EuroParl Common test set as a development set in both neural machine translation experiments and Moses experiments. For single NMT models and multi-task NMT models, we select the best model with the highest BLEU score in the EuroParl Common testset and apply it to the WMT 2013 dataset. Note that our experiment settings in NMT is equivalent with Moses, considering the same training corpus, development sets and test sets.

Experimental Results
We report our results of three experiments to show the validity of our methods. In the first experiment, we train multi-task learning model jointly on all four parallel corpora and compare BLEU scores with models trained separately on each parallel corpora. In the second experiment, we utilize the same training procedures as Experiment 1, except that we mimic the situation where some parallel corpora are resource-poor and maintain only 15% data on two parallel training corpora. In experiment 3, we test our learned model from experiment 1 and experiment 2 on WMT 2013 dataset. Table 3 and 4 show the case-insensitive BLEU scores on the Europarl common test data. Models learned from the multitask learning framework significantly outperform the models trained separately. Table 4 shows that given only 15% of parallel training corpus of English-Dutch and English-Portuguese, it is possible to improve translation performance on all the target languages as well. This result makes sense because the correlated languages benefit from each other by sharing the same predictive structure, e.g. French, Spanish and Portuguese, all of which are from Latin. We also notice that even though Dutch is from Germanic languages, it is also possible to increase translation performance under our multi-task learning framework which demonstrates the generalization of our model to multiple target languages.

Lang-Pair
En

Model Analysis and Discussion
We try to make empirical analysis through learning curves and qualitative results to explain why multi-task learning framework works well in multiple-target machine translation problem.
From the learning process, we observed that the speed of model convergence under multi-task learning is faster than models trained separately especially when a model is trained for resourcepoor language pairs. The detailed learning curves are shown in Figure 4. Here we study the learning curve for resource-poor language pairs, i.e. English-Dutch and En-Portuguese, for which only 15% of the bilingual data is sampled for training. The BLEU scores are evaluated on the Europarl common test set. From Figure 4, it can be seen that in the early stage of training, given the same amount of training data for each language pair, the translation performance of the multi-task learning model is improved more rapidly. And the multi-task models achieve better translation quality than separately trained models within three iterations of training. The reason of faster and better convergence in performance is that the encoder parameters are shared across different language pairs, which can make full use of all the source language training data across the language pairs and improve the source language   representation.
The sharing of encoder parameters is useful especially for the resource-poor language pairs. In the multi-task learning framework, the amount of the source language is not limited by the resource-poor language pairs and we are able to learn better representation for the source language. Thus the representation of the source language learned from the multi-task model is more stable, and can be viewed as a constraint that leverages translation performance of all language pairs. Therefore, the overfitting problem and the data scarcity problem can be alleviated for language pairs with only a few training data. In Table 6, we list the three nearest neighbors of some source words whose similarity is computed by using the cosine score of the embeddings both in the multi-task learning framework (from Experiment two ) and in the single model (the resourcepoor English-Portuguese model). Although the nearest neighbors of the high-frequent words such as numbers can be learned both in the multi-task model and the single model, the overall quality of the nearest neighbors learned by the resource-poor single model is much poorer compared with the multi-task model.

Conclusion
In this paper, we investigate the problem of how to translate one source language into several different target languages within a unified translation model. Our proposed solution is based on the English Students, meanwhile, say the course is one of the most interesting around.

English
In addition, they limited the right of individuals and groups to provide assistance to voters wishing to register.
Reference-Fr De plus, ils ont limité le droit de personnes et de groupes de fournir une assistance auxélecteurs désirant s' inscrire.

Multi-Fr
De plus, ils restreignent le droit des individus et des groupesà fournir une assistance auxélecteurs qui souhaitent enregistrer. Table 7: Translation of different target languages given the same input in our multi-task model.
recently proposed recurrent neural network based encoder-decoder framework. We train a unified neural machine translation model under the multitask learning framework where the encoder is shared across different language pairs and each target language has a separate decoder. To the best of our knowledge, the problem of learning to translate from one source to multiple targets has seldom been studied. Experiments show that given large-scale parallel training data, the multitask neural machine translation model is able to learn good predictive structures in translating multiple targets. Significant improvement can be observed from our experiments on the data sets publicly available. Moreover, our framework is able to address the data scarcity problem of some resource-poor language pairs by utilizing largescale parallel training corpora of other language pairs to improve the translation quality. Our model is efficient and gets faster and better convergence for both resource-rich and resource-poor language pair under the multi-task learning.
In the future, we would like to extend our learning framework to more practical setting. For example, train a multi-task learning model with the same target language from different domains to improve multiple domain translation within one model. The correlation of different target languages will also be considered in the future work.