Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism

We propose multi-way, multilingual neural machine translation. The proposed approach enables a single neural translation model to translate between multiple languages, with a number of parameters that grows only linearly with the number of languages. This is made possible by having a single attention mechanism that is shared across all language pairs. We train the proposed multi-way, multilingual model on ten language pairs from WMT'15 simultaneously and observe clear performance improvements over models trained on only one language pair. In particular, we observe that the proposed model significantly improves the translation quality of low-resource language pairs.


Introduction
Neural Machine Translation It has been shown that a deep (recurrent) neural network can successfully learn a complex mapping between variablelength input and output sequences on its own. Some of the earlier successes in this task have, for instance, been handwriting recognition (Bottou et al., 1997;Graves et al., 2009) and speech recognition (Graves et al., 2006;. More recently, a general framework of encoderdecoder networks has been found to be effective at learning this kind of sequence-to-sequence mapping by using two recurrent neural networks (Cho et al., 2014b;Sutskever et al., 2014).
A basic encoder-decoder network consists of two recurrent networks. The first network, called an encoder, maps an input sequence of variable length into a point in a continuous vector space, resulting in a fixed-dimensional context vector. The other recurrent neural network, called a decoder, then generates a target sequence again of variable length starting from the context vector. This approach however has been found to be inefficient in (Cho et al., 2014a) when handling long sentences, due to the difficulty in learning a complex mapping between an arbitrary long sentence and a single fixed-dimensional vector.
In (Bahdanau et al., 2014), a remedy to this issue was proposed by incorporating an attention mechanism to the basic encoder-decoder network. The attention mechanism in the encoder-decoder network frees the network from having to map a sequence of arbitrary length to a single, fixed-dimensional vector. Since this attention mechanism was introduced to the encoder-decoder network for machine translation, neural machine translation, which is purely based on neural networks to perform full end-to-end translation, has become competitive with the existing phrase-based statistical machine translation in many language pairs Luong et al., 2015b).
Multilingual Neural Machine Translation Existing machine translation systems, mostly based on a phrase-based system or its variants, work by directly mapping a symbol or a subsequence of symbols in a source language to its corresponding symbol or subsequence in a target language. This kind of mapping is strictly specific to a given language pair, and it is not trivial to extend this mapping to work on multiple pairs of languages.
A system based on neural machine translation, on the other hand, can be decomposed into two mod-ules. The encoder maps a source sentence into a continuous representation, either a fixed-dimensional vector in the case of the basic encoder-decoder network or a set of vectors in the case of attentionbased encoder-decoder network. The decoder then generates a target translation based on this source representation. This makes it possible conceptually to build a system that maps a source sentence in any language to a common continuous representation space and decodes the representation into any of the target languages, allowing us to make a multilingual machine translation system. This possibility is straightforward to implement and has been validated in the case of basic encoderdecoder networks . It is however not so, in the case of the attention-based encoder-decoder network, as the attention mechanism, or originally called the alignment function in (Bahdanau et al., 2014), is conceptually language pair-specific. In (Dong et al., 2015), the authors cleverly avoided this issue of language pair-specific attention mechanism by considering only a one-tomany translation, where each target language decoder embedded its own attention mechanism. Also, we notice that both of these works have only evaluated their models on relatively small-scale tasks, making it difficult to assess whether multilingual neural machine translation can scale beyond lowresource language translation.
Multi-Way, Multilingual Neural Machine Translation In this paper, we first step back from the currently available multilingual neural translation systems proposed in Dong et al., 2015) and ask the question of whether the attention mechanism can be shared across multiple language pairs. As an answer to this question, we propose an attention-based encoder-decoder network that admits a shared attention mechanism with multiple encoders and decoders. We use this model for all the experiments, which suggests that it is indeed possible to share an attention mechanism across multiple language pairs. The next question we ask is the following: in which scenario would the proposed multi-way, multilingual neural translation have an advantage over the existing, single-pair model? Specifically, we consider a case of the translation between a low-resource language pair. The experiments show that the proposed multi-way, multilingual model generalizes better than the single-pair translation model, when the amount of available parallel corpus is small. Furthermore, we validate that this is not only due to the increased amount of target-side, monolingual corpus.
Finally, we train a single model with the proposed architecture on all the language pairs from the WMT'15; English, French, Czech, German, Russian and Finnish. The experiments show that it is indeed possible to train a single attention-based network to perform multi-way translation.

Background: Attention-based Neural Machine Translation
The attention-based neural machine translation was proposed in (Bahdanau et al., 2014). It was motivated from the observation in (Cho et al., 2014a) that a basic encoder-decoder translation model from (Cho et al., 2014b;Sutskever et al., 2014) suffers from translating a long source sentence efficiently. This is largely due to the fact that the encoder of this basic approach needs to compress a whole source sentence into a single vector. Here we describe the attention-based neural machine translation. Neural machine translation aims at building a single neural network that takes as input a source sequence X = (x 1 , . . . , x Tx ) and generates a corresponding translation Y = y 1 , . . . , y Ty . Each symbol in both source and target sentences, x t or y t , is an integer index of the symbol in a vocabulary.
The encoder of the attention-based model encodes a source sentence into a set of context vectors C = {h 1 , h 2 , . . . , h Tx }, whose size varies w.r.t. the length of the source sentence. This context set is constructed by a bidirectional recurrent neural network (RNN) which consists of a forward RNN and reverse RNN. The forward RNN reads the source sentence from the first token until the last one, resulting in the forward context vectors and E x ∈ R |Vx|×d is an embedding matrix containing row vectors of the source symbols. The reverse RNN in an opposite direction, resulting in − → Ψ enc and ← − Ψ enc are recurrent activation functions such as long short-term memory units (LSTM, (Hochreiter and Schmidhuber, 1997)) or gated recurrent units (GRU, (Cho et al., 2014b)). At each position in the source sentence, the forward and reverse context vectors are concatenated to form a full context vector, i.e., The decoder, which is implemented as an RNN as well, generates one symbol at a time, the translation of the source sentence, based on the context set returned by the encoder. At each time step t in the decoder, a time-dependent context vector c t is computed based on the previous hidden state of the decoder z t−1 , the previously decoded symbolỹ t−1 and the whole context set C.
This starts by computing the relevance score of each context vector as for all i = 1, . . . , T x . f score can be implemented in various ways (Luong et al., 2015b), but in this work, we use a simple single-layer feedforward network. This relevance score measures how relevant the i-th context vector of the source sentence is in deciding the next symbol in the translation. These relevance scores are further normalized: and we call α t,i the attention weight. The time-dependent context vector c t is then the weighted sum of the context vectors with their weights being the attention weights from above: With this time-dependent context vector c t , the previous hidden state z t−1 and the previously decoded symbolỹ t−1 , the decoder's hidden state is updated by where Ψ dec is a recurrent activation function. The initial hidden state z 0 of the decoder is initialized based on the last hidden state of the reverse RNN: where f init is a feedforward network with one or two hidden layers. The probability distribution for the next target symbol is computed by where g k is a parametric function that returns the unnormalized probability for the next target symbol being k.
Training this attention-based model is done by maximizing the conditional log-likelihood where the log probability inside the inner summation is from Eq. (7). It is important to note that the ground-truth target symbols y (n) t are used during training. The entire model is differentiable, and the gradient of the log-likelihood function with respect to all the parameters θ can be computed efficiently by backpropagation. This makes it straightforward to use stochastic gradient descent or its variants to train the whole model jointly to maximize the translation performance.

Multi-Way, Multilingual Translation
In this section, we discuss issues and our solutions in extending the conventional single-pair attentionbased neural machine translation into multi-way, multilingual model.

Problem Definition
We assume N > 1 source languages X 1 , X 2 , . . . , X N and M > 1 target languages Y 1 , Y 2 , . . . , Y M , and the availability of L ≤ M × N bilingual parallel corpora {D 1 , . . . , D L }, each of which is a set of sentence pairs of one source and one target languages. We use s(D l ) and t(D l ) to indicate the source and target languages of the l-th parallel corpus.
For each parallel corpus l, we can directly use the log-likelihood function from Eq. (7) to define a pair-specific log-likelihood L s(D l ),t(D l ) . Then, the goal of multi-way, multilingual neural machine translation is to build a model that maximizes the joint log-likelihood function . Once the training is over, the model can do translation from any of the source languages to any of the target languages included in the parallel corpora.

Existing Approaches
Neural Machine Translation without Attention In , the authors extended the basic encoder-decoder network for multitask neural machine translation. As they extended the basic encoder-decoder network, their model effectively becomes a set of encoders and decoders, where each of the encoder projects a source sentence into a common vector space. The point in the common space is then decoded into different languages.
The major difference between  and our work is that we extend the attentionbased encoder-decoder instead of the basic model. This is an important contribution, as the attentionbased neural machine translation has become de facto standard in neural translation literatures recently (Jean et al., 2014;Luong et al., 2015b;Sennrich et al., 2015b;Sennrich et al., 2015a), by opposition to the basic encoder-decoder.
There are two minor differences as well. First, they do not consider multilinguality in depth. The authors of  tried only a single language pair, English and German, in their model. Second, they only report translation perplexity, which is not a widely used metric for measuring translation quality. To more easily compare with other machine translation approaches it would be important to evaluate metrics such as BLEU, which counts the number of matched n-grams between the generated and reference translations.
One-to-Many Neural Machine Translation The authors of (Dong et al., 2015) earlier proposed a multilingual translation model based on the attention-based neural machine translation. Unlike this paper, they only tried it on one-to-many translation, similarly to earlier work by (Collobert et al., 2011) where one-to-many natural language processing was done. In this setting, it is trivial to extend the single-pair attention-based model into multilingual translation by simply having a single encoder for a source language and pairs of a decoder and attention mechanism (Eq. (2)) for each target language. We will shortly discuss more on why, with the attention mechanism, one-to-many translation is trivial, while multi-way translation is not.

Challenges
A quick look at neural machine translation seems to suggest a straightforward path toward incorporating multiple languages in both source and target sides. As described earlier already in the introduction, the basic idea is simple. We assign a separate encoder to each source language and a separate decoder to each target language. The encoder will project a source sentence in its own language into a common, language-agnostic space, from which the decoder will generate a translation in its own language.
Unlike training multiple single-pair neural translation models, in this case, the encoders and decoders are shared across multiple pairs. This is computationally beneficial, as the number of parameters grows only linearly with respect to the number of languages (O(L)), in contrary to training separate single-pair models, in which case the number of parameters grows quadratically (O(L 2 ).) The attention mechanism, which was initially called a soft-alignment model in (Bahdanau et al., 2014), aligns a (potentially non-contiguous) source phrase to a target word. This alignment process is largely specific to a language pair, and it is not clear whether an alignment mechanism for one language pair can also work for another pair.
The most naive solution to this issue is to have O(L 2 ) attention mechanisms that are not shared across multiple language pairs. Each attention mechanism takes care of a single pair of source and target languages. This is the approach employed in (Dong et al., 2015), where each decoder had its own attention mechanism.
There are two issues with this naive approach. First, unlike what has been hoped initially with multilingual neural machine translation, the number of parameters again grows quadratically w.r.t. the number of languages. Second and more importantly, having separate attention mechanisms makes it less likely for the model to fully benefit from having mul- tiple tasks (Caruana, 1997), especially for transfer learning towards resource-poor languages.
In short, the major challenge in building a multiway, multilingual neural machine translation is in avoiding independent (i.e., quadratically many) attention mechanisms. There are two questions behind this challenge. The first one is whether it is even possible to share a single attention mechanism across multiple language pairs. The second question immediately follows: how can we build a neural translation model to share a single attention mechanism for all the language pairs in consideration?

Multi-Way, Multilingual Model
We describe in this section a proposed multiway, multilingual attention-based neural machine translation. The proposed model consists of N encoders {Ψ n enc } N n=1 (see Eq. (1) (7)) and a shared attention mechanism f score (see Eq. (2) in the single language pair case).
Encoders Similarly to (Luong et al., 2015b), we have one encoder per source language, meaning that a single encoder is shared for translating the language to multiple target languages. In order to handle different source languages better, we may use for each source language a different type of encoder, for instance, of different size (in terms of the number of recurrent units) or of different architecture (con- volutional instead of recurrent.) This allows us to efficiently incorporate varying types of languages in the proposed multilingual translation model. This however implies that the dimensionality of the context vectors in Eq. (1) may differ across source languages. Therefore, we add to the original bidirectional encoder from Sec. 2, a linear transformation layer consisting of a weight matrix W n adp and a bias vector b n adp , which is used to project each context vector into a common dimensional space: h t) and b n adp ∈ R d . In addition, each encoder exposes two transformation functions φ n att and φ n init . The first transformer φ n att transforms a context vector to be compatible with a shared attention mechanism: This transformer can be implemented as any type of parametric function, and in this paper, we simply apply an element-wise tanh to h n t . The second transformer φ n init transforms the first context vector h n 1 to be compatible with the initializer of the decoder's hidden state (see Eq. (6)): Similarly to φ n att , it can be implemented as any type of parametric function. In this paper, we use a feedforward network with a single hidden layer and share one network φ init for all encoder-decoder pairs, like attention mechanism.
Decoders We first start with an initialization of the decoder's hidden state. Each decoder has its own parametric function ϕ m init that maps the last context vectorĥ n Tx of the source encoder from Eq. (10) into the initial hidden state: init can be any parametric function, and in this paper, we used a feedforward network with a single tanh hidden layer.
Each decoder exposes a parametric function ϕ m att that transforms its hidden state and the previously decoded symbol to be compatible with a shared attention mechanism. This transformer is a parametric function that takes as input the previous hidden state z m t−1 and the previous symbolỹ m t−1 and returns a vector for the attention mechanism: which replaces z t−1 in Eq. 2. In this paper, we use a feedforward network with a single tanh hidden layer for each ϕ m att . Given the previous hidden state z m t−1 , previously decoded symbolỹ m t−1 and the time-dependent context vector c m t , which we will discuss shortly, the decoder updates its hidden state: where f m adp affine-transforms the time-dependent context vector to be of the same dimensionality as the decoder. We share a single affine-transformation layer f m adp , for all the decoders in this paper. Once the hidden state is updated, the probability distribution over the next symbol is computed exactly as for the pair-specific model (see Eq. (7).) Attention Mechanism Unlike the encoders and decoders of which there is an instance for each language, there is only a single attention mechanism, shared across all the language pairs. This shared mechanism uses the attention-specific vectorsh n t andz m t−1 from the encoder and decoder, respectively. The relevance score of each context vector h n t is computed based on the decoder's previous hidden state z m t−1 and previous symbolỹ m t−1 : e m,n t,i =f score h n t ,z m t−1 ,ỹ m t−1 These scores are normalized according to Eq. (3) to become the attention weights α m,n t,i . With these attention weights, the time-dependent context vector is computed as the weighted sum of the original context vectors: c m,n t = Tx i=1 α m,n t,i h n i . See Fig. 1   We report the BLEU scores on the development and test sets (separated by /) by the single-pair model (Single), the singlepair model with monolingual corpus (Single+DF) and the proposed multi-way, multilingual model (Multi).

Datasets
We evaluate the proposed multi-way, multilingual translation model on all the pairs available from WMT'15-English (En) ↔ French (Fr), Czech (Cs), German (De), Russian (Ru) and Finnish (Fi)-, totalling ten directed pairs. For each pair, we concatenate all the available parallel corpora from WMT'15 and use it as a training set. We use newstest-2013 as a development set and newstest-2015 as a test set, in all the pairs other than Fi-En. In the case of Fi-En, we use newsdev-2015 and newstest-2015 as a development set and test set, respectively.
Data Preprocessing Each training corpus is tokenized using the tokenizer script from the Moses decoder. The tokenized training corpus is cleaned following the procedure in . Instead of using space-separated tokens, or words, we use sub-word units extracted by byte pair encoding, as recently proposed in (Sennrich et al., 2015b). For each and every language, we include 30k sub-word symbols in a vocabulary. See Table 1 for the statistics of the final, preprocessed training corpora.
Evaluation Metric We mainly use BLEU as an evaluation metric using the multi-bleu script from Moses. BLEU is computed on the tokenized text after merging the BPE-based sub-word symbols. We further look at the average log-probability assigned to reference translations by the trained model as an additional evaluation metric, as a way to measure the model's density estimation performance free from any error caused by approximate decoding.

Two Scenarios
Low-Resource Translation First, we investigate the effect of the proposed multi-way, multilingual model on low-resource language-pair translation. Among the five languages from WMT'15, we choose En, De and Fi as source languages, and En and De as target languages. We control the amount of the parallel corpus of each pair out of three to be 5%, 10%, 20% and 40% of the original corpus.
In other words, we train four models with different sizes of parallel corpus for each language pair (En-De, De-En, Fi-En.) As a baseline, we train a single-pair model for each multi-way, multilingual model. We further finetune the single-pair model to incorporate the target-side monolingual corpus consisting of all the target side text from the other language pairs (e.g., when a single-pair model was trained on Fi-En, the target-side monolingual corpus consists of the target sides from De-En.) This is done by the recently proposed deep fusion . The latter is included to tell whether any improvement from the multilingual model is simply due to the increased amount of target-side monolingual corpus.
Large-scale Translation We train one multi-way, multilingual model that has five encoders and five decoders, corresponding to the five languages from WMT'15;En,Fr,De,Cs,Ru,Fi → En,Fr,De,Cs,Ru,Fi. We use the full corpora for all of them.

Model Architecture
Each symbol, either source or target, is projected on a 620-dimensional space. The encoder is a bidirectional recurrent neural network with 1,000 gated recurrent units (GRU) in each direction, and the decoder is a recurrent neural network with also 1,000 GRU's. The decoder's output function g k from Eq. (7) is a feedforward network with 1,000 tanh hidden units. The dimensionalities of the context vector h n t in Eq. (8), the attention-specific context vectorh n t in Eq. (9) and the attention-specific decoder hidden stateh m t−1 in Eq. (11) are all set to 1,200. We use the same type of encoder for every source language, and the same type of decoder for every target language. The only difference between the single-pair models and the proposed multilingual ones is the numbers of encoders N and decoders M . We leave those multilingual translation specific components, such as the ones in Eqs. (8)-(11), in the single-pair models in order to keep the number of shared parameters constant.

Training
Basic Settings We train each model using stochastic gradient descent (SGD) with Adam (Kingma and Ba, 2015) as an adaptive learning rate algorithm. We use the initial learning rate of 2 · 10 −4 and leave all the other hyperparameters as suggested in (Kingma and Ba, 2015). Each SGD update is computed using a minibatch of 80 examples, unless the model is parallelized over two GPUs, in which case we use a minibatch of 60 examples. We only use sentences of length up to 50 symbols. We clip the norm of the gradient to be no more than 1 (Pascanu et al., 2012). All training runs are early-stopped based on BLEU on the development set. As we observed in preliminary experiments better scores on the development set when finetuning the shared parameters and output layers of the decoders in the case of multilingual models, we do this for all the multilingual models. During finetuning, we clip the norm of the gradient to be no more than 5.
Schedule As we have access only to bilingual corpora, each sentence pair updates only a subset of the parameters. Excessive updates based on a single language pair may bias the model away from the other pairs. To avoid it, we cycle through all the language pairs, one pair at a time, in Fi En, De En, Fr En, Cs En, Ru En order. 1 Model Parallelism The size of the multilingual model grows linearly w.r.t. the number of languages. We observed that a single model that handles five source and five target languages does not fit in a single GPU during training. We address this by distributing computational paths according to different translation pairs over multiple GPUs. The shared pa-  rameters, mainly related to the attention mechanism, is duplicated on both GPUs. The implementation was based on the work in (Ding et al., 2014).

Results and Analysis
Low-Resource Translation It is clear from Table 2 that the proposed model (Multi) outperforms the single-pair one (Single) in all the cases. This is true even when the single-pair model is strengthened with a target-side monolingual corpus (Sin-gle+DF). This suggests that the benefit of generalization from having multiple languages goes beyond that of simply having more target-side monolingual corpus. The performance gap grows as the size of target parallel corpus decreases.
Large-Scale Translation In Table 3, we observe that the proposed multilingual model outperforms or is comparable to the single-pair models for the majority of the all ten pairs/directions considered. This happens in terms of both BLEU and average logprobability. This is encouraging, considering that there are twice more parameters in the whole set of single-pair models than in the multilingual model.
It is worthwhile to notice that the benefit is more apparent when the model translates from a foreign language to English. This may be due to the fact that all of the parallel corpora include English as either a source or target language, leading to a better parameter estimation of the English decoder. In the future, a strategy of using a pseudo-parallel corpus to increase the amount of training examples for the decoders of other languages (Sennrich et al., 2015a) should be investigated to confirm this conjecture.

Conclusion
In this paper, we proposed multi-way, multilingual attention-based neural machine translation. The proposed approach allows us to build a single neural network that can handle multiple source and target languages simultaneously. The proposed model is a step forward from the recent works on multilingual neural translation, in the sense that we support attention mechanism, compared to  and multi-way translation, compared to (Dong et al., 2015). Furthermore, we evaluate the proposed model on large-scale experiments, using the full set of parallel corpora from WMT'15.
We empirically evaluate the proposed model in large-scale experiments using all five languages from WMT'15 with the full set of parallel corpora and also in the settings with artificially controlled amount of the target parallel corpus. In both of the settings, we observed the benefits of the proposed multilingual neural translation model over having a set of single-pair models. The improvement was especially clear in the cases of translating low-resource language pairs. We observed the larger improvements when translating to English. We conjecture that this is due to a higher availability of English in most parallel corpora, leading to a better parameter estimation of the English decoder. More research on this phenomenon in the future will result in further improvements from using the proposed model. Also, all the other techniques proposed recently, such as ensembling and large vocabulary tricks, need to be tried together with the proposed multilingual model to improve the translation quality even further. Finally, an interesting future work is to use the proposed model to translate between a language pair not included in a set of training corpus.