A Teacher-Student Framework for Zero-Resource Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable progress recently, it still suffers from the data scarcity problem for low-resource language pairs and domains. In this paper, we propose a method for zero-resource NMT by assuming that parallel sentences have close probabilities of generating a sentence in a third language. Based on the assumption, our method is able to train a source-to-target NMT model (“student”) without parallel corpora available guided by an existing pivot-to-target NMT model (“teacher”) on a source-pivot parallel corpus. Experimental results show that the proposed method significantly improves over a baseline pivot-based model by +3.0 BLEU points across various language pairs.


Introduction
Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015), which directly models the translation process in an end-to-end way, has attracted intensive attention from the community. Although NMT has achieved state-of-the-art translation performance on resource-rich language pairs such as English-French and German-English (Luong et al., 2015;Jean et al., 2015;Johnson et al., 2016), it still suffers from the unavailability of large-scale parallel corpora for translating low-resource languages. Due to the large parameter space, neural models usually learn poorly from low-count events, resulting in a poor choice for low-resource language pairs. Zoph et al. (2016) indicate that NMT obtains much worse translation quality than a statistical machine translation (SMT) system on low-resource languages.
As a result, a number of authors have endeavored to explore methods for translating language pairs without parallel corpora available. These methods can be roughly divided into two broad categories: multilingual and pivot-based. Firat et al. (2016b) present a multi-way, multilingual model with shared attention to achieve zeroresource translation. They fine-tune the attention part using pseudo bilingual sentences for the zeroresource language pair. Another direction is to develop a universal NMT model in multilingual scenarios (Johnson et al., 2016;Ha et al., 2016). They use parallel corpora of multiple languages to train one single model, which is then able to translate a language pair without parallel corpora available. Although these approaches prove to be effective, the combination of multiple languages in modeling and training leads to increased complexity compared with standard NMT.
Another direction is to achieve source-to-target NMT without parallel data via a pivot, which is either text (Cheng et al., 2016a) or image (Nakayama and Nishida, 2016). Cheng et al. (2016a) propose a pivot-based method for zeroresource NMT: it first translates the source language to a pivot language, which is then translated to the target language. Nakayama and Nishida (2016) show that using multimedia information as pivot also benefits zero-resource translation. However, pivot-based approaches usually need to divide the decoding process into two steps, which is not only more computationally expensive, but also potentially suffers from the error propagation problem (Zhu et al., 2013).
In this paper, we propose a new method for zero-resource neural machine translation. Our method assumes that parallel sentences should Figure 1: (a) The pivot-based approach and (b) the teacher-student approach to zero-resource neural machine translation. X, Y, and Z denote source, target, and pivot languages, respectively. We use a dashed line to denote that there is a parallel corpus available for the connected language pair. Solid lines with arrows represent translation directions. The pivot-based approach leverages a pivot to achieve indirect source-to-target translation: it first translates x into z, which is then translated into y. Our training algorithm is based on the translation equivalence assumption: if x is a translation of z, then P (y|x; θ x→y ) should be close to P (y|z; θ z→y ). Our approach directly trains the intended source-totarget model P (y|x; θ x→y ) ("student") on a source-pivot parallel corpus, with the guidance of an existing pivot-to-target model P (y|z;θ z→y ) ("teacher").
have close probabilities of generating a sentence in a third language. To train a source-to-target NMT model without parallel corpora available ("student"), we leverage an existing pivot-to-target NMT model ("teacher") to guide the learning process of the student model on a source-pivot parallel corpus. Compared with pivot-based approaches (Cheng et al., 2016a), our method allows direct parameter estimation of the intended NMT model, without the need to divide decoding into two steps. This strategy not only improves efficiency but also avoids error propagation in decoding. Experiments on the Europarl and WMT datasets show that our approach achieves significant improvements in terms of both translation quality and decoding efficiency over a baseline pivot-based approach to zero-resource NMT on Spanish-French and German-French translation tasks.

Background
Neural machine translation (Sutskever et al., 2014;Bahdanau et al., 2015) advocates the use of neural networks to model the translation process in an end-to-end manner. As a data-driven approach, NMT treats parallel corpora as the major source for acquiring translation knowledge. Let x be a source-language sentence and y be a target-language sentence. We use P (y|x; θ x→y ) to denote a source-to-target neural translation model, where θ x→y is a set of model parameters. Given a source-target parallel corpus D x,y , which is a set of parallel source-target sentences, the model parameters can be learned by maximizing the log-likelihood of the parallel corpus: x,y ∈Dx,y log P (y|x; θ x→y ) .
Given learned model parametersθ x→y , the decision rule for finding the translation with the highest probability for a source sentence x is given byŷ = argmax y P (y|x;θ x→y ) . (1) As a data-driven approach, NMT heavily relies on the availability of large-scale parallel corpora to deliver state-of-the-art translation performance Johnson et al., 2016). Zoph et al. (2016) report that NMT obtains much lower BLEU scores than SMT if only small-scale parallel corpora are available. Therefore, the heavy dependence on the quantity of training data poses a severe challenge for NMT to translate zeroresource language pairs. Simple and easy-to-implement, pivot-based methods have been widely used in SMT for translating zero-resource language pairs (de Gispert and Mariño, 2006;Cohn and Lapata, 2007;Utiyama and Isahara, 2007;Wu and Wang, 2007;Bertoldi et al., 2008;Wu and Wang, 2009;Zahabi et al., 2013;Kholy et al., 2013). As pivotbased methods are agnostic to model structures, they have been adapted to NMT recently (Cheng et al., 2016a;Johnson et al., 2016). Figure 1(a) illustrates the basic idea of pivotbased approaches to zero-resource NMT (Cheng et al., 2016a). Let X, Y, and Z denote source, target, and pivot languages. We use dashed lines to denote language pairs with parallel corpora available and solid lines with arrows to denote translation directions.
Intuitively, the source-to-target translation can be indirectly modeled by bridging two NMT models via a pivot: P (y|x; θ x→z , θ z→y ) = z P (z|x; θ x→z )P (y|z; θ z→y ).
(2) As shown in Figure 1(a), pivot-based approaches assume that the source-pivot parallel corpus D x,z and the pivot-target parallel corpus D z,y are available. As it is impractical to enumerate all possible pivot sentences, the two NMT models are trained separately in practice: x,z ∈Dx,z log P (z|x; θ x→z ) , θ z→y = argmax θz→y z,y ∈Dz,y log P (y|z; θ z→y ) .
Due to the exponential search space of pivot sentences, the decoding process of translating an unseen source sentence x has to be divided into two steps: The above two-step decoding process potentially suffers from the error propagation problem (Zhu et al., 2013): the translation errors made in the first step (i.e., source-to-pivot translation) will affect the second step (i.e., pivot-to-target translation). Therefore, it is necessary to explore methods to directly model source-to-target translation without parallel corpora available.

Assumptions
In this work, we propose to directly model the intended source-to-target neural translation based on a teacher-student framework. The basic idea is to use a pre-trained pivot-to-target model ("teacher") to guide the learning process of a source-to-target model ("student") without training data available on a source-pivot parallel corpus. One advantage of our approach is that Equation (1) can be used as the decision rule for decoding, which avoids the error propagation problem faced by two-step decoding in pivot-based approaches.
As shown in Figure 1(b), we still assume that a source-pivot parallel corpus D x,z and a pivot-target parallel corpus D z,y are available. Unlike pivot-based approaches, we first use the pivot-target parallel corpus D z,y to obtain a teacher model P (y|z;θ z→y ), whereθ z→y is a set of learned model parameters. Then, the teacher model "teaches" the student model P (y|x; θ x→y ) on the source-pivot parallel corpus D x,z based on the following assumptions.
Assumption 1 If a source sentence x is a translation of a pivot sentence z, then the probability of generating a target sentence y from x should be close to that from its counterpart z.
We can further introduce a word-level assumption: Assumption 2 If a source sentence x is a translation of a pivot sentence z, then the probability of generating a target word y from x should be close to that from its counterpart z, given the already obtained partial translation y <j .
The two assumptions are empirically verified in our experiments (see Table 2). In the following subsections, we will introduce two approaches to zero-resource neural machine translation based on the two assumptions.

Sentence-Level Teaching
Given a source-pivot parallel corpus D x,z , our training objective based on Assumption 1 is defined as follows: where the KL divergence sums over all possible target sentences: KL P (y|z;θ z→y ) P (y|x; θ x→y ) = y P (y|z;θ z→y ) log P (y|z;θ z→y ) P (y|x; θ x→y ) . (6) As the teacher model parameters are fixed, the training objective can be equivalently written as In training, our goal is to find a set of source-totarget model parameters that minimizes the training objective: With learned source-to-target model parameterŝ θ x→y , we use the standard decision rule as shown in Equation (1) to find the translationŷ for a source sentence x.
However, a major difficulty faced by our approach is the intractability in calculating the gradients because of the exponential search space of target sentences. To address this problem, it is possible to construct a sub-space by either sampling (Shen et al., 2016), generating a k-best list (Cheng et al., 2016b) or mode approximation (Kim and Rush, 2016). Then, standard stochastic gradient descent algorithms can be used to optimize model parameters.

Word-Level Teaching
Instead of minimizing the KL divergence between the teacher and student models at the sentence level, we further define a training objective at the word level based on Assumption 2: where J(x, y, z,θ z→y , θ x→y ) = |y| j=1 KL P (y|z, y <j ;θ z→y ) P (y|x, y <j ; θ x→y ) . (10) Equation (9) suggests that the teacher model P (y|z, y <j ;θ z→y ) "teaches" the student model P (y|x, y <j ; θ x→y ) in a word-by-word way. Note that the KL-divergence between the two models is defined at the word level: where V y is the target vocabulary. As the parameters of the teacher model are fixed, the training objective can be equivalently written as: Therefore, our goal is to find a set of source-totarget model parameters that minimizes the training objective: We use similar approaches as described in Section 3.2 for approximating the full search space with sentence-level teaching. After obtaininĝ θ x→y , the same decision rule as shown in Equation (1) can be utilized to find the most probable target sentenceŷ for a source sentence x.

Setup
We evaluate our approach on the Europarl (Koehn, 2005) and WMT corpora. To compare with pivotbased methods, we use the same dataset as (Cheng et al., 2016a). All the sentences are tokenized by the tokenize.perl script. All the experiments treat English as the pivot language and French as the target language.
For the Europarl corpus, we evaluate our proposed methods on Spanish-French (Es-Fr) and German-French (De-Fr) translation tasks in a zero-resource scenario. To avoid the trilingual corpus constituted by the source-pivot and pivottarget corpora, we split the overlapping pivot sentences of the original source-pivot and pivot-target corpora into two equal parts and merge them separately with the non-overlapping parts for each language pair. The development and test sets are from WMT 2006 shared task. 1 The evaluation metric is case-insensitive BLEU (Papineni et al., 2002) as calculated by the multi-bleu.perl script. To deal with out-of-vocabulary words, we adopt byte pair encoding (BPE) (Sennrich et al., 2016) to split words into sub-words. The size of sub-words is set to 30K for each language.
For the WMT corpus, we evaluate our approach on a Spanish-French (Es-Fr) translation task with a zero-resource setting. We combine the following corpora to form the Es-En and En-Fr parallel corpora: Common Crawl, News Commentary, Europarl v7 and UN. All the sentences are tokenized by the tokenize.perl script. New-stest2011 serves as the development set and New-stest2012 and Newstest2013 serve as test sets. We use case-sensitive BLEU to evaluate translation results. BPE is also used to reduce the vocabulary size. The size of sub-words is set to 43K, 33K, 43K for Spanish, English and French, respectively. See Table 1 for detailed statistics for the Europarl and WMT corpora.
We leverage an open-source NMT toolkit dl4mt implemented by Theano 2 for all the experiments and compare our approach with state-of-the-art multilingual methods (Firat et al., 2016b) and pivot-based methods (Cheng et al., 2016a). Two variations of our framework are used in the exper-1 http://www.statmt.org/wmt07/shared-task.html 2 dl4mt-tutorial: https://github.com/nyu-dl iments: 1. Sentence-Level Teaching: for simplicity, we use the mode as suggested in (Kim and Rush, 2016) to approximate the target sentence space in calculating the expected gradients with respect to the expectation in Equation (7). We run beam search on the pivot sentence with the teacher model and choose the highest-scoring target sentence as the mode. Beam size with k = 1 (greedy decoding) and k = 5 are investigated in our experiments, denoted as sent-greedy and sent-beam, respectively. 3 2. Word-Level Teaching: we use the same mode approximation approach as in sentence-level teaching to approximate the expectation in Equation 12, denoted as word-greedy (beam search with k = 1) and word-beam (beam search with k = 5), respectively. Besides, Monte Carlo estimation by sampling from the teacher model is also investigated since it introduces more diverse data, denoted as wordsampling.

Assumptions Verification
To verify the assumptions in Section 3.1, we train a source-to-target translation model P (y|x; θ x→y ) and a pivot-to-target translation model P (y|z; θ z→y ) using the trilingual Europarl corpus. Then, we measure the sentence-level and word-level KL divergence from the source-totarget model P (y|x; θ x→y ) at different iterations to the trained pivot-to-target model P (y|z;θ z→y ) by caculating J SENT (Equation (5)) and J WORD 3 We can also adopt sampling and k-best list for approximation. Random sampling brings a large variance (Sutskever et al., 2014;Ranzato et al., 2015; for sentence-level teaching. For k-best list, we renormalize the probabilities P (y|z;θz→y) ∼ P (y|z;θz→y) α where Y k is the k-best list from beam search of the teacher model and α is a hyperparameter controling the sharpness of the distribution (Och, 2003). We set k = 5 and α = 5×10 −3 . The results on test set for Eureparl Corpus are 32.24 BLEU over Spanish-French translation and 24.91 BLEU over German-French translation, which are slightly better than the sent-beam method. However, considering the traing time and the memory consumption, we believe mode approximation is already a good way to approximate the target sentence space for sentence-level teaching.  (Equation (9)) on 2,000 parallel source-pivot sentences from the development set of WMT 2006 shared task. Table 2 shows the results. The source-to-target model is randomly initialized at iteration 0. We find that J SENT and J WORD decrease over time, suggesting that the source-to-target and pivot-totarget models do have small KL divergence at both sentence and word levels. Table 3 gives BLEU scores on the Europarl corpus of our best performing sentence-level method (sent-beam) and word-level method (word-sampling) compared with pivot-based methods (Cheng et al., 2016a). We use the same data preprocessing as in (Cheng et al., 2016a). We find that both the sent-beam and word-sampling methods outperform the pivot-based approaches in a zero-resource scenario across language pairs. Our word-sampling method improves over the best performing zero-resource pivot-based method (soft) on Spanish-French translation by +3.29 BLEU points and German-French translation by +3.24 BLEU points. In addition, the word-sampling mothod surprisingly obtains improvement over the likelihood method, which leverages a source-target parallel corpus. The

Method
Es→  Table 4: Comparison of our proposed methods on Spanish-French and German-French translation tasks from the Europarl corpus. English is treated as the pivot language. significant improvements can be explained by the error propagation problem of pivot-based methods, which propagates translation error of the source-to-pivot translation process to the pivot-to-target translation process. Table 4 shows BLEU scores on the Europarl corpus of our five proposed methods.
For sentence-level approaches, the sent-beam method outperforms the sent-greedy method by +0.59 BLEU points over Spanish-French translation and +2.51 BLEU points over German-French translation on the test set. The results are in line with our observation in Table 2 that sentence-level KL divergence by beam approximation is smaller than that by greedy approximation. However, as the   Table 5: Comparison with previous work on Spanish-French translation in a zero-resource scenario over the WMT corpus. The BLEU scores are case sensitive. †: the method depends on two-step decoding.
time complexity grows linearly with the number of beams k, the better performance is achieved at the expense of search time. For word-level experiments, we observe that the word-sampling method performs much better than the other two methods: +1.94 BLEU points on Spanish-French translation and +1.88 BLEU points on German-French translation over the word-greedy method; +2.65 BLEU points on Spanish-French translation and +2.84 BLEU points on German-French translation over the word-beam method. Although Table 2 shows that word-level KL divergence approximated by sampling is larger than that by greedy or beam, sampling approximation introduces more data diversity for training, which dominates the effect of KL divergence difference.
We plot validation loss 4 and BLEU scores over iterations on the German-French translation task in Figure 2. We observe that word-level models tend to have lower validation loss compared with sentence-level methods. Generally, models with lower validation loss tend to have higher BLEU.
Our results indicate that this is not necessarily the case: the sent-beam method converges to +0.31 BLEU points on the validation set with +13 validation loss compared with the word-beam method. Kim and Rush (2016) claim a similar observation in data distillation for NMT and provide an explanation that student distributions are more peaked for sentence-level methods. This is indeed the case in our result: on German-French translation task the argmax for the sent-beam student model (on average) approximately accounts for 3.49% of the total probability mass, while the corresponding number is 1.25% for the word-beam student model and 2.60% for the teacher model.

Results on the WMT Corpus
The word-sampling method obtains the best performance in our five proposed approaches according to experiments on the Europarl corpus. To further verify this approach, we conduct ex-groundtruth source Os sentáis al volante en la costa oeste , en San Francisco , y vuestra misión es llegar los primeros a Nueva York .
pivot You get in the car on the west coast , in San Francisco , and your task is to be the first one to reach New York .  Table 6: Examples and corresponding sentence BLEU scores of translations using the pivot and likelihood methods in (Cheng et al., 2016a) and the proposed word-sampling method. We observe that our approach generates better translations than the methods in (Cheng et al., 2016a). We italicize correct translation segments which are no short than 2-grams.
periments on the large scale WMT corpus for Spanish-French translation. Table 5 shows the results of our word-sampling method in comparison with other state-of-the-art baselines. Cheng et al. (2016a) use the same datasets and the same preprocessing as ours. Firat et al. (2016b) utilize a much larger training set. 5 Our method obtains significant improvement over the pivot baseline by +3.46 BLEU points on Newstest2012 and over many-to-one by +5.84 BLEU points on New-stest2013. Note that both methods depend on a source-pivot-target decoding path. Table 6 shows translation examples of the pivot and likelihood methods proposed in (Cheng et al., 2016a) and our proposed word-sampling method. For the pivot and likelihood methods, the Spainish sentence segment 'sentáis al volante' is lost when translated to English. Therefore, both methods miss this information in the translated French sentence. However, the word-sampling method generates 'volant sur', which partially translates 'sentáis al volante', resulting in improved translation quality of the target-language sentence.

Results with Small Source-Pivot Data
The word-sampling method can also be applied to zero-resource NMT with a small source-pivot corpus. Specifically, the size of the source-pivot corpus is orders of magnitude smaller than that of the pivot-target corpus.  Table 7: Comparison on German-French translation task from the Europarl corpus with 100K German-English sentences. English is regarded as the pivot language. Transfer represents the transfer learning method in (Zoph et al., 2016). 100K parallel German-French sentences are used for the MLE and transfer methods.
To fulfill this task, we combine our best performing word-sampling method with the initialization and parameter freezing strategy proposed in (Zoph et al., 2016). The Europarl corpus is used in the experiments. We set the size of German-English training data to 100K and use the same teacher model trained with 900K English-French sentences. Table 7 gives the BLEU score of our method on German-French translation compared with three other methods. Note that our task is much harder than transfer learning (Zoph et al., 2016) since the latter depends on a parallel German-French corpus. Surprisingly, our method outperforms all other methods. We significantly improve the baseline pivot method by +5.63 BLEU points and the state-of-the-art transfer learning method by +0.56 BLEU points.
Training NMT models in a zero-resource scenario by leveraging other languages has attracted intensive attention in recent years. Firat et al. (2016b) proposed an approach which delivers the multiway, multilingual NMT model proposed by (Firat et al., 2016a) for zero-resource translation. They used the multi-way NMT model trained by other language pairs to generate a pseudo parallel corpus and fine-tuned the attention mechanism of the multi-way NMT model to enable zero-resource translation. Several authors proposed a universal encoder-decoder network in multilingual scenarios to perform zero-shot learning (Johnson et al., 2016;Ha et al., 2016). This universal model extracts translation knowledge from multiple different languages, making zero-resource translation feasible without direct training.
Besides multilingual NMT, another important line of research attempts to bridge source and target languages via a pivot language. This idea is widely used in SMT (de Gispert and Mariño, 2006;Cohn and Lapata, 2007;Utiyama and Isahara, 2007;Wu and Wang, 2007;Bertoldi et al., 2008;Wu and Wang, 2009;Zahabi et al., 2013;Kholy et al., 2013). Cheng et al. (2016a) propose pivot-based NMT by simultaneously improving source-to-pivot and pivot-to-target translation quality in order to improve source-to-target translation quality. Nakayama and Nishida (2016) achieve zero-resource machine translation by utilizing image as a pivot and training multimodal encoders to share common semantic representation.
Our work is also related to knowledge distillation, which trains a compact model to approximate the function learned by a larger, more complex model or an ensemble of models (Bucila et al., 2006;Ba and Caurana, 2014;Li et al., 2014;Hinton et al., 2015). Kim and Rush (2016) first introduce knowledge distillation in neural machine translation. They suggest to generate a pseudo corpus to train the student network. Compared with their work, we focus on zero-resource learning instead of model compression.

Conclusion
In this paper, we propose a novel framework to train the student model without parallel corpora available under the guidance of the pre-trained teacher model on a source-pivot parallel corpus. We introduce sentence-level and word-level teach-ing to guide the learning process of the student model. Experiments on the Europarl and WMT corpora across languages show that our proposed word-level sampling method can significantly outperforms the state-of-the-art pivot-based methods and multilingual methods in terms of translation quality and decoding efficiency.
We also analyze zero-resource translation with small source-pivot data, and combine our wordlevel sampling method with initialization and parameter freezing suggested by (Zoph et al., 2016). The experiments on the Europarl corpus show that our approach obtains an significant improvement over the pivot-based baseline.
In the future, we plan to test our approach on more diverse language pairs, e.g., zero-resource Uyghur-English translation using Chinese as a pivot. It is also interesting to extend the teacherstudent framework to other cross-lingual NLP applications as our method is transparent to architectures.