Zero-Resource Translation with Multi-Lingual Neural Machine Translation

In this paper, we propose a novel finetuning algorithm for the recently introduced multi-way, mulitlingual neural machine translate that enables zero-resource machine translation. When used together with novel many-to-one translation strategies, we empirically show that this finetuning algorithm allows the multi-way, multilingual model to translate a zero-resource language pair (1) as well as a single-pair neural translation model trained with up to 1M direct parallel sentences of the same language pair and (2) better than pivot-based translation strategy, while keeping only one additional copy of attention-related parameters.


Introduction
A recently introduced neural machine translation (Forcada and Ñeco, 1997;Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014) has proven to be a platform for new opportunities in machine translation research.Rather than word-level translation with language-specific preprocessing, neural machine translation has found to work well with statistically segmented subword sequences as well as sequences of characters (Chung et al., 2016;Luong and Manning, 2016;Sennrich et al., 2015b;Ling et al., 2015).Also, recent works show that neural machine translation provides a seamless way to incorporate multiple modalities other than natural ⋆ Work carried out while the author was at IBM Research.language text in translation (Luong et al., 2015a;Caglayan et al., 2016).Furthermore, neural machine translation has been found to translate between multiple languages, achieving better translation quality by exploiting positive language transfer (Dong et al., 2015;Firat et al., 2016;Zoph and Knight, 2016).
In this paper, we conduct in-depth investigation into the recently proposed multi-way, multilingual neural machine translation (Firat et al., 2016).Specifically, we are interested in its potential for zero-resource machine translation, in which there does not exist any direct parallel examples between a target language pair.Zero-resource translation has been addressed by pivot-based translation in traditional machine translation research (Wu and Wang, 2007;Utiyama and Isahara, 2007), but we explore a way to use the multi-way, multilingual neural model to translate directly from a source to target language.
In doing so, we begin by studying different translation strategies available in the multi-way, multilingual model in Sec.3-4.The strategies include a usual one-to-one translation as well as variants of many-to-one translation for multi-source translation (Zoph and Knight, 2016).We empirically show that the many-to-one strategies significantly outperform the one-to-one strategy.
We move on to zero-resource translation by first evaluating a vanilla multi-way, multilingual model on a zero-resource language pair, which revealed that the vanilla model cannot do zero-resource translation in Sec.6.1.Based on the many-to-one strategies we proposed earlier, we design a novel finetun-ing strategy that does not require any direct parallel corpus between a target, zero-resource language pair in Sec.5.2, which uses the idea of generating a pseudo-parallel corpus (Sennrich et al., 2015a).This strategy makes an additional copy of the attention mechanism and finetunes only this small set of parameters.
Large-scale experiments with Spanish, French and English show that the proposed finetuning strategy allows the multi-way, multilingual neural translation model to perform zero-resource translation as well as a single-pair neural translation model trained with up to 1M true parallel sentences.This result re-confirms the potential of the multi-way, multilingual model for low/zero-resource language translation, which was earlier argued by Firat et al. (2016).

Multi-Way, Multilingual Neural Machine Translation
Recently Firat et al. ( 2016) proposed an extension of attention-based neural machine translation (Bahdanau et al., 2015) that can handle multiway, multilingual translation with a shared attention mechanism.This model was designed to handle multiple source and target languages.In this section, we briefly overview this multi-way, multilingual model.For more detailed exposition, we refer the reader to (Firat et al., 2016).

Model Description
The goal of multi-way, multilingual model is to build a neural translation model that can translate a source sentence given in one of N languages into one of M target languages.Thus to handle those N source and M target languages, the model consists of N encoders and M decoders.Unlike these language-specific encoders and decoders, only a single attention mechanism is shared across all M × N language pairs.Encoder An encoder for the n-th source language reads a source sentence X = (x 1 , . . ., x Tx ) as a sequence of linguistic symbols and returns a set of context vectors C n = h n 1 , . . ., h n Tx .The encoder is usually implemented as a bidirectional recurrent network (Schuster and Paliwal, 1997), and each context vector h n t is a concatenation of the forward and reverse recurrent networks' hidden states at time t.Without loss of generality, we assume that the dimensionalities of the context vector for all source languages are all same.

Decoder and Attention Mechanism
A decoder for the m-th target language is a conditional recurrent language model (Mikolov et al., 2010).At each time step t ′ , it updates its hidden state by based on the previous hidden state z m t ′ −1 , previous target symbol ỹm t ′ −1 and the time-dependent context vector c m t ′ .ϕ m is a gated recurrent unit (GRU, (Cho et al., 2014)).
The time-dependent context vector is computed by the shared attention mechanism as a weighted sum of the context vectors from the encoder C n : where The scoring function f score returns a scalar and is implemented as a feedforward neural network with a single hidden layer.For more variants of the attention mechanism for machine translation, see (Luong et al., 2015b).
The initial hidden state of the decoder is initialized as With the new hidden state z m t ′ , the probability distribution over the next symbol is computed by where g m w is a decoder specific parametric function that returns the unnormalized probability for the next target symbol being w.

Learning
Training this multi-way, multilingual model does not require multi-way parallel corpora but only a set of bilingual corpora.For each bilingual pair, the conditional log-probability of a ground-truth translation given a source sentence is maximize by adjusting the relevant parameters following the gradient of the log-probability.

One-to-One Translation
In the original paper by Firat et al. (2016), only one translation strategy was evaluated, that is, one-toone translation.This one-to-one strategy works on a source sentence given in one language by taking the encoder of that source language, the decoder of a target language and the shared attention mechanism.These three components are glued together as if they form a single-pair neural translation model and translates the source sentence into a target language.
We however notice that this is not the only translation strategy available with the multi-way, multilingual model.As we end up with multiple encoders, multiple decoder and a shared attention mechanism, this model naturally enables us to exploit a source sentence given in multiple languages, leading to a many-to-one translation strategy which was proposed recently by Zoph and Knight (2016) in the context of neural machine translation.
Unlike (Zoph and Knight, 2016), the multi-way, multilingual model is not trained with multi-way parallel corpora.This however does not necessarily imply that the model cannot be used in this way.In the remainder of this section, we propose two alternatives for doing multi-source translation with the multi-way, multilingual model.

Many-to-One Translation
In this section, we consider a case where a source sentence is given in two languages, X 1 and X 2 .However, any of the approaches described below applies to more than two source languages trivially.
In this multi-way, multilingual model, multisource translation can be thought of as averaging two separate translation paths.For instance, in the case of Es+Fr to En, we want to combine Es→En and Fr→En so as to get a better English translation.We notice that there are two points in the multi-way, multilingual model where this averaging may happen.

Early Average
The first candidate is to averaging two translation paths when computing the timedependent context vector (see Eq. (1).)At each time t in the decoder, we compute a time-dependent context vector for each source language, c 1 t and c 2 t respectively for the two source languages.In this early averaging strategy, we simply take the average of these two context vectors: Similarly, we initialize the decoder's hidden state to be the average of the initializers of the two encoders: where φ init is the decoder's initializer (see Eq. ( 3).) Late Average Alternatively, we can average those two translation paths (e.g., Es→En and Fr→En) at the output level.At each time t, each translation path computes the distribution over the target vocabulary, i.e., p(y t = w|y <t , X 1 ) and p(y t = w|y <t , X 2 ).We then average them to get the multi-source output distribution: (p(y t = w|y <t , X 1 ) + p(y t = w|y <t )).
An advantage of this late averaging strategy over the early averaging one is that this can work even when those two translation paths were not from a single multilingual model.They can be two separately single-pair models.In fact, if X 1 and X 2 are same and the two translation paths are simply two different models trained on the same language pairdirection, this is equivalent to constructing an ensemble, which was found to greatly improve translation quality (Sutskever et al., 2014) Early+Late Average The two strategies above can be further combined by late-averaging the output distributions from the early averaged model and the late averaged one.We empirically evaluate this early+late average strategy as well.

Experiments: Translation Strategies and Multi-Source Translation
Before continuing on with zero-resource machine translation, we first evaluate the translation strategies described in the previous section on multisource translation, as these translation strategies form a basic foundation on which we extend the multi-way, multilingual model for zero-resource machine translation.

Settings
When evaluating the multi-source translation strategies, we use English, Spanish and French, and focus on a scenario where only En-Es and En-Fr parallel corpora are available.

Monolingual Corpora
We do not use any additional monolingual corpus.
Preprocessing All the sentences are tokenized using the tokenizer script from Moses (Koehn et al., 2007).We then replace special tokens, such as numbers, dates and URL's with predefined markers, which will be replaced back with the original tokens after decoding.After using byte pair encoding (BPE, (Sennrich et al., 2015b)

Models and Training
We start from the code made publicly available as a part of (Firat et al., 2016).We made two changes to the original code.First, we replaced the decoder with the conditional gated recurrent network with the attention mechanism as outlines in (Firat and Cho, 2016).Second, we feed a binary indicator vector of which encoder(s) the source sentence was processed by to the output layer of each decoder (g m w in Eq. ( 4)).Each dimension of the indicator vector corresponds to one source language, and in the case of multi-source translation, there may be more than one dimensions set to 1.
We train the following models: four single-pair models (Es↔En and Fr↔En) and one multi-way, multilingual model (Es,Fr,En↔Es,Fr,En).As proposed by Firat et al. ( 2016), we share one attention mechanism for the latter case.
Training We closely follow the setup from (Firat et al., 2016).Each symbol is represented as a 620-dimensional vector.Any recurrent layer, be it in the encoder or decoder, consists of 1000 gated recurrent units (GRU, (Cho et al., 2014)), and the attention mechanism has a hidden layer of 1200 tanh units (f score in Eq. ( 2)).We use Adam (Kingma and Ba, 2015) to train a model, and the gradient at each update is computed using a minibatch of at most 80 sentence pairs.The gradient is clipped to have the norm of at most 1 (Pascanu et al., 2012).We early-stop any training using the T-B score on a development set.

One-to-One Translation
We first confirm that the multi-way, multilingual translation model indeed works as well as singlepair models on the translation paths that were considered during training, which was the major claim in (Firat et al., 2016).In Table 2, we present the rehttps://github.com/nyu-dl/dl4mt-multiT-B score is defined as TER−BLEU 2 which we found to be more stable than either TER or BLEU alone for the purpose of early-stopping (Zhao and Chen, 2009).sults on four language pair-directions (Es↔En and Fr↔En).
It is clear that the multi-way, multilingual model indeed performs comparably on all the four cases with less parameters (due to the shared attention mechanism.) As observed earlier in (Firat et al., 2016), we also see that the multilingual model performs better when a target language is English.

Many-to-One Translation
We consider translating from a pair of source sentences in Spanish (Es) and French (Fr) to English (En).It is important to note that the multilingual model was not trained with any multi-way parallel corpus.Despite this, we observe that the early averaging strategy improves the translation quality (measured in BLEU) by 3 points in the case of the test set (compare Table 2 (a-b) and Table 3 (a).)We conjecture that this happens as training the multilingual model has implicitly encouraged the model to find a common context vector space across multiple source languages.
The late averaging strategy however outperforms the early averaging in both cases of multilingual model and a pair of single-pair models (see Table 3 (b)) albeit marginally.The best quality was observed when the early and late averaging strate-gies were combined at the output level, achieving up to +3.5 BLEU (compare Table 2 (a) and Table 3

(c).)
We emphasize again that there was no multi-way parallel corpus consisting of Spanish, French and English during training.The result presented in this section shows that the multi-way, multilingual model can exploit multiple sources effectively without requiring any multi-way parallel corpus, and we will rely on this property together with the proposed many-to-one translation strategies in the later sections where we propose and investigate zeroresource translation.

Zero-Resource Translation Strategies
The network architecture of multi-way, multilingual model suggests the potential for translating between two languages without any direct parallel corpus available.In the setting considered in this paper (see Sec. 4.1,) these translation paths correspond to Es↔Fr, as only parallel corpora used for training were Es↔En and Fr↔En.
The most naive approach for translating along a zero-resource path is to simply treat it as any other path that was included as a part of training.This corresponds to the one-to-one strategy from Sec. 3.1.
In our experiments, it however turned out that this naive approach does not work at all, as can be seen in Table 4 (a).
In this section, we investigate this potential of zero-resource translation with the multi-way, multilingual model in depth.More specifically, we propose a number of approaches that enable zeroresource translation without requiring any additional bilingual corpus.

Pivot-based Translation
The first set of approaches exploits the fact that the target zero-resource translation path can be decomposed into a sequence of highresource translation paths (Wu and Wang, 2007;Utiyama and Isahara, 2007).For instance, in our case, Es→Fr can be decomposed into a sequence of Es→En and En→Fr.In other words, we translate a source sentence (Es) into a pivot language (En) and then translate the English translation into a target language (Fr).

One-to-One Translation
The most basic approach here is to perform each translation path in the decomposed sequence independently from each other.This one-to-one approach introduces only a minimal computational complexity (the multiplicative factor of two.)We can further improve this oneto-one pivot-based translation by maintaining a set of k-best translations from the first stage (Es→En), but this increase the overall computational complexity by the factor of k, making it impractical in practice.We therefore focus only on the former approach of keeping the best pivot translation in this paper.

Many-to-One Translation
With the multi-way, multilingual model considered in this paper, we can extend the naive one-to-one pivot-based strategy by replacing the second stage (En→Fr) to be many-toone translation from Sec. 4.4 using both the original source language and the pivot language as a pair of source languages.We first translate the source sentence (Es) into English, and use both the original source sentence and the English translation (Es+En) to translate into the final target language (Fr).
Both approaches described and proposed above do not require any additional action on an alreadytrained multilingual model.They are simply different translation strategies specifically aimed at zeroresource translation.

Finetuning with Pseudo Parallel Corpus
The failure of the naive zero-resource translation earlier (see Table 4 (a)) suggests that the context vectors returned by the encoder are not compatible with the decoder, when the combination was not included during training.The good translation qualities of the translation paths included in training however imply that the representations learned by the encoders and decoders are good.Based on these two observations, we conjecture that all that is needed for a zero-resource translation path is a simple adjustment that makes the context vectors from the encoder to be compatible with the target decoder.Thus, we propose to adjust this zero-resource translation path however without any additional parallel corpus.
First, we generate a small set of pseudo bilingual pairs of sentences for the zero-resource language pair (Es→Fr) in interest.We randomly select N sentences pairs from a parallel corpus between the target language (Fr) and a pivot language (En) and translate the pivot side (En) into the source language (Es).Then, the pivot side is discarded, and we construct a pseudo parallel corpus consisting of sentence pairs of the source and target languages (Es-Fr).
We make a copy of the existing attention mechanism, to which we refer as target-specific attention mechanism.We then finetune only this targetspecific attention mechanism while keeping all the other parameters of the encoder and decoder intact, using the generated pseudo parallel corpus.We do not update any other parameters in the encoder and decoder, because they are already well-trained (evidenced by high translation qualities in Table 2) and we want to avoid disrupting the well-captured structures underlying each language.
Once the model has been finetuned with the pseudo parallel corpus, we can use any of the translation strategies described earlier in Sec. 3 for the finetuned zero-resource translation path.We expect a similar gain by using many-to-one translation, which we empirically confirm in the next section.

Settings
We use the same multi-way, multilingual model trained earlier in Sec.4.2 to evaluate the zeroresource translation strategies.We emphasize here that this model was trained only using Es-En and Fr-En bilingual parallel corpora without any Es-Fr parallel corpus.
We evaluate the proposed approaches to zeroresource translation with the same multi-way, multilingual model from Sec. 4.1.We specifically select the path from Spanish to French (Es→Fr) as a target zero-resource translation path.

Result and Analysis
As mentioned earlier, we observed that the multiway, multilingual model cannot directly translate between two languages when the translation path between those two languages was not included in training (Table 4 (a).)On the other hand, the model was able to translate decently with the pivot-based one-to-one translation strategy, as can be see in Table 4 (b).Unsurprisingly, all the many-to-one strategies resulted in worse translation quality, which is due to the inclusion of the useless translation path (direct path between the zero-resource pair, Es-Fr.)These results clearly indicate that the multi-way, multilingual model trained with only bilingual parallel corpora is not capable of direct zero-resource translation as it is.

Settings
The proposed finetuning strategy raises a number of questions.First, it is unclear how many pseudo sentence pairs are needed to achieve a decent translation quality.Because the purpose of this finetuning stage is simply to adjust the shared attention mechanism so that it can properly bridge from the sourceside encoder to the target-side decoder, we expect it to work with only a small amount of pseudo pairs.We validate this by creating pseudo corpora of different sizes-1k, 10k, 100k and 1m.
Second, we want to know how detrimental it is to use the generated pseudo sentence pairs compared to using true sentence pairs between the target language pair.
In order to answer this question, we compiled a true multi-way parallel corpus by combining the subsets of UN (7.8m), Europral-v7 (1.8m), OpenSubtitles-2013 (1m), news-commentary-v7 (174k), LDC2011T07 (335k) and news-crawl (310k), and use it to finetune the model.This allows us to evaluate the effect of the pseudo and true parallel corpora on finetuning for zero-resource translation.
Lastly, we train single-pair models translating directly from Spanish to French by using the true parallel corpora.These models work as a baseline See the last row of Table 1.
against which we compare the multi-way, multilingual models.
Training Unlike the usual training procedure described in Sec.4.2, we compute the gradient for each update using 60 sentence pairs only, when finetuning the model with the multi-way parallel corpus (either pseudo or true.)

Result and Analysis
Table 5 summarizes all the result.The most important observation is that the proposed finetuning strategy with pseudo-parallel sentence pairs outperforms the pivot-based approach (using the early averaging strategy from Sec. 4.4) even when we used only 1,000 such pairs (compare (b) and (d).)As we increase the size of the pseudo-parallel corpus, we observe a clear improvement.Furthermore, these models perform comparably to or better than the single-pair model trained with 1M true parallel sentence pairs, although they never saw a single true bilingual sentence pair of Spanish and French (compare (a) and (d).)Even when we trained a single-pair model with 11m true parallel pairs, the model could not match the multilingual model finetuned with 1m true parallel pairs by achieving the translation quality of 24.26 BLEU on the test set.
Another interesting finding is that it is only beneficial to use true parallel pairs for finetuning the multi-way, mulitilingual models when there are enough of them (1m or more).When there are only a small number of true parallel sentence pairs, we even found using pseudo pairs to be more beneficial than true ones.This effective as more apparent, when the direct one-to-one translation of the zeroresource pair was considered (see (c) in Table 5.)This applies that the misalignment between the encoder and decoder can be largely fixed by using pseudo-parallel pairs only, and we conjecture that it is easier to learn from pseudo-parallel pairs as they better reflect the inductive bias of the trained model.When there is a large amount of true parallel sentence pairs available, however, our results indicate that it is better to exploit them.
Unlike we observed with the multi-source translation in Sec.3.2, we were not able to see any improvement by further averaging the early-averaged and late-average decoding schemes (compare (d) and (e).)This may be explained by the fact that the context vectors computed when creating a pseudo source (e.g., En from Es when Es→Fr) already contains all the information about the pseudo source.It is simply enough to take those context vectors into account via the early averaging scheme.
These results clearly indicate and verify the potential of the multi-way, multilingual neural translation model in performing zero-resource machine translation.More specifically, it has been shown that the translation quality can be improved even without any direct parallel corpus available, and if there is a small amount of direct parallel pairs available, the quality may improve even further.

Conclusion: Implications and Limitations
Implications There are two main results in this paper.First, we showed that the multi-way, multilingual neural translation model by Firat et al. ( 2016) is able to exploit common, underlying structures across many languages in order to better translate when a source sentence is given in multiple languages.This confirms the usefulness of positive language transfer, which has been believed to be an important factor in human language learning (Odlin, 1989;Ringbom, 2007), in machine translation.Furthermore, our result significantly expands the applicability of multi-source translation (Zoph and Knight, 2016), as it does not assume the availability of multi-way parallel corpora for training.
Second, the experiments on zero-resource translation revealed that it is not necessary to have a direct parallel corpus, or deep linguistic knowledge, between two languages in order to build a machine translation system.Importantly we observed that the proposed approach of zero-resource translation is better both in terms of translation quality and data efficiency than a more traditional pivot-based translation (Wu and Wang, 2007;Utiyama and Isahara, 2007).Considering that this is the first attempt at such zero-resource, or extremely low-resource, translation using neural machine translation, we expect a large progress in near future.
Limitations Despite the promising empirical results presented in this paper, there are a number of shortcomings that needs to addressed in followup research.First, our experiments have been done only with three European languages-Spanish, French and English.More investigation with a diverse set of languages needs to be done in order to make a more solid conclusion, such as was done in (Firat et al., 2016;Chung et al., 2016).Furthermore, the effect of varying sizes of available parallel corpora on the performance of zero-resource translation must be studied more in the future.
Second, although the proposed many-to-one translation is indeed generally applicable to any number of source languages, we have only tested a source sentence in two languages.We expect even higher improvement with more languages, but it must be tested thoroughly in the future.
Lastly, the proposed finetuning strategy requires the model to have an additional set of parameters relevant to the attention mechanism for a target, zeroresource pair.This implies that the number of parameters may grow linearly with respect to the number of target language pairs.We expect future research to address this issue by, for instance, mixing in the parallel corpora of high-resource language pairs during finetuning as well.

Table 2 :
One-to-one translation qualities using the multi-way, multilingual model and four separate single-pair models.

Table 3 :
Many-to-one quality (Es+Fr→En) using three translation strategies.Compared to Table2(a-b) we observe a significant improvement (up to 3+ BLEU), although the model was never trained in these many-to-one settings.The second column shows the quality by the ensemble of two separate single-pair models.

Table 4 :
Zero-resource translation from Spanish (Es) to French

Table 5 :
Zero-resource translation from Spanish (Es) to French (Fr) with finetuning.When pivot is √ , English is used as a pivot language.Row (a) is from Table 4 (b).