Unsupervised Neural Machine Translation with Future Rewarding

In this paper, we alleviate the local optimality of back-translation by learning a policy (takes the form of an encoder-decoder and is defined by its parameters) with future rewarding under the reinforcement learning framework, which aims to optimize the global word predictions for unsupervised neural machine translation. To this end, we design a novel reward function to characterize high-quality translations from two aspects: n-gram matching and semantic adequacy. The n-gram matching is defined as an alternative for the discrete BLEU metric, and the semantic adequacy is used to measure the adequacy of conveying the meaning of the source sentence to the target. During training, our model strives for earning higher rewards by learning to produce grammatically more accurate and semantically more adequate translations. Besides, a variational inference network (VIN) is proposed to constrain the corresponding sentences in two languages have the same or similar latent semantic code. On the widely used WMT’14 English-French, WMT’16 English-German and NIST Chinese-to-English benchmarks, our models respectively obtain 27.59/27.15, 19.65/23.42 and 22.40 BLEU points without using any labeled data, demonstrating consistent improvements over previous unsupervised NMT models.


Introduction
Neural Machine Translation (Sutskever et al., 2014;Bahdanau et al., 2015) directly models the entire translation process through training an encoder-decoder model that has achieved remarkable performance Gehring et al., 2017;Vaswani et al., 2017) when provided with massive amounts of parallel corpora. However, the lack of large-scale parallel data is a serious problem for the vast majority of language pairs. * Corresponding Author.
As a result, several works have recently tried to get rid of the dependence on parallel corpora using unsupervised setting, in which the NMT model only has access to two independent monolingual corpora with one for each language (Lample et al., 2018a;Artetxe et al., 2018b;Yang et al., 2018). Among these works, the encoder and decoder act as a standard auto-encoder (AE) that are trained to reconstruct the inputs from their noised versions. Due to the lack of cross-language signals, unsupervised NMT usually requires pseudo parallel data generated with the back-translation method for achieving the final goal of translating between source and target languages.
Back-translation typically uses beam search (Sennrich et al., 2016a) or just greedy search (Lample et al., 2018a,b) to generate synthetic sentences. Both are approximate algorithms to identify the maximum a posteriori (MAP) output, i.e. the sentence with the highest estimated probability given an input. Although back-translation with MAP prediction has been proved to be successful, it suffers from several apparent issues when trained with maximum likelihood estimation (MLE) only, including exposure bias and loss-evaluation mismatch. Thus, this method often fails to produce the optimal synthetic sentences for the subsequent training.
In this paper, we address the problem mentioned above with future rewarding for unsupervised NMT. The basic idea is to model the future direction of a translation and optimize the global word predictions under the policy gradient reinforcement learning framework. More concretely, we sample N translations via the policy for each input sentence and build a new objective function by combining the cross-entropy loss used in prior works with sequence-level rewards from policy gradient reinforcement learning. We consider the sequence-level reward from two aspects: 1) n-gram matching, which is the precision or recall of all sub-sequences of 1, 2, 3 and 4 tokens in generated sequence and is responsible for measuring the accuracy of surface word predictions; 2) semantic adequacy, which is the similarity between the underlying semantic representations of the generated translation and the input sentence. These two aspects of rewards are inspired by the general criteria of what properties a high-quality translation should have and are complementary to each other. Additionally, a variational inference network (VIN) is proposed to model the underlying semantics of monolingual sentences explicitly. It is used to map the source and target languages into a shared semantic space during autoencoding, as well as constrain the sentences and their translated counterparts have the same or similar semantic code during cross-language training.
The major contributions of this paper can be summarized as follows: • We propose a novel learning paradigm for unsupervised NMT that models future rewards to optimize the global word predictions via policy gradient reinforcement learning. To enforce the underlying semantic space, we introduce a VIN into our model.
• We introduce an effective reward function that jointly accounts for the n-gram matching and the semantic adequacy of generated translations.
• We conduct extensive experiments on English-French, English-German and NIST Chinese-to-English translation tasks. Experimental results show that the proposed approach achieves significant improvements across different language pairs.

Unsupervised Neural Machine Translation
In this section, we first describe the composition of the introduced model and then give details of the newly proposed unsupervised training method.

Model Composition
The introduced translation model consists of six components: including two encoders with sharing last few layers, two completely independent decoders with one for each language, and two newly introduced VINs with one for each language. For the encoders and decoders, we follow the recently emerged Transformer (Vaswani et al., 2017). Specifically, each encoder is composed of a stack of four identical layers, and each layer consists of a multi-head self-attention sub-layer and a fully connected feed-forward sub-layer. The encoders of the source and target languages are respectively parameterized as Θ enc src and Θ enc tgt , and the encoding operation is denoted as e(x l ; Θ enc l ), x l is the input sequence of word embeddings, l ∈ {src, tgt}. The decoders are also composed of four identical layers. In addition to the two sublayers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack, the details we refer the reader to (Vaswani et al., 2017). Similar to encoders, we denote source decoder as Θ dec src , target decoder as Θ dec tgt , and decoding operation as d(x l ; Θ dec l ), l ∈ {src, tgt}. For VINs, each of them is composed of a standard Gaussian distribution N (0, 1) as the prior, and a neural posterior that is implemented as feed-forward neural network and parameterized by ψ l , l ∈ {src, tgt}.
In this work, the entire model is trained in an unsupervised manner by optimizing two objectives: 1) variational denoising auto-encoding; 2) crosslanguage training with future rewarding.

Variational Denoising Auto-Encoding
Firstly, two auto-encoders are respectively trained to learn to reconstruct their inputs. In this form, each encoder should learn to compose the input sentence of its corresponding language, and each decoder is expected to learn to recover the original input sentence from this composition. However, without any constraint, the auto-encoder would make very literal word-by-word copies, without capturing any internal structure of the input sentence involved. To address this issue, prior works often adapt the same strategy as Denosing Auto-Encoding (DAE) (Vincent et al., 2008), and add some noise to the input sentences (Hill et al., 2016). As shown in Figure 1, we augment the DAE with a variational inference network (VIN) to model underlying semantics of monolingual sentences explicitly, which assumes that there exists a latent variable z from this semantic space. And this variable, together with the noised input sentence, guides the decoding process. With this assumption, we define the objective function of reconstruction as follow: where Θ l→l = Θ enc l •Θ dec l •ψ l represents the combination of Θ enc l , Θ dec l and ψ l , l ∈ {src, tgt}. C denotes a stochastic noise model, in which we apply the same method as in (Lample et al., 2018a).
The continuous latent variable z, acts as the underlying semantics here, is approximated by a neural posterior inference network q ψ l (z|x l ). Following , the posterior approximation is regarded as a diagonal Gaussian N (µ, diag(σ 2 )), and its mean µ and variance σ 2 are parameterized with deep neural networks. We also reparameterize z as a function of µ and σ (i.e., z = µ + σ ε, ε is a standard Gaussian variable that plays a role of introducing noises) rather than using the standard sampling method. We aim to map source and target languages into a shared semantic space and use the following objective function for VINs: where l ∈ {src, tgt}. KL(Q||P ) is the Kullback-Leibler divergence between Q and P . We finally incorporate the auto-encoder and the VIN into an end-to-end neural network, and the overall training objective of auto-encoding is to minimize the following loss function:

Cross-language Training with Future Rewarding
In spite of the auto-encoding, the second objective of unsupervised NMT is to constrain the model to be able to map an input sentence from the source (target) language to the target (source) language. Due to the lack of alignment information between two independent monolingual corpora, the back-translation (Sennrich et al., 2016a) method is used to synthetise a pseudo parallel corpus for cross-language training. More concretely, given an input sentence in one language, which can be firstly translated into the other language (i.e. use the corresponding encoder and the decoder of the other language) by applying the model in inference mode with greedy decoding. And then, the model is trained to reconstruct the original sentence from this translation. The most widely used method in previous works to train the model for sequence generation, called maximum likelihood estimation (MLE for short), it assumes that the ground-truth is provided at each step during training. The objective of MLE is defined as the maximization of the following log-likelihood: where Θ l 2 →l 1 = Θ enc l 2 • Θ dec l 1 • ψ l 2 represents the combination of Θ enc l 2 , Θ dec l 1 and ψ l 2 . z p is approximated by the introduced VIN (i.e., reparameterized from the Gaussian q ψ l 2 (z p |x l 2 )).x l 2 = d(e(x l 1 ; Θ enc l 1 ); Θ dec l 2 ) is obtained by greedy decoding in inference mode (l 1 = src, l 2 = tgt or l 1 = tgt, l 2 = src).

Future Rewarding
Unfortunately, maximizing L l 1 mle does not always produce the best results on discrete evaluation metrics such as BLEU (Papineni et al., 2002), as the accumulation of errors caused by exposure bias as well as the inconsistency between training and testing measurements lead to the models tend to be short-sighted. We bridge the discrepancy between training and testing modes caused by MLE through learning a policy to model future rewards, which can directly optimize the global word predictions and is made possible with reinforcement learning, as illustrated in Figure 2. To reduce the variance of the model, we use the selfcritical policy gradient learning algorithm (Rennie et al., 2017).
For self-critical policy gradient learning, we produce two separate output sequences at each training iteration:x, the sampled translation, which is obtained by sampling from the final output probability distribution, andx g , the baseline output, obtained by performing a greedy search. Thus, the objective function of cross-language training can be redefined as the expected advantages of the sampled sequence over the baseline Figure 2: Illustration of the proposed method for cross-language training with future rewarding. Three aspects of losses are respectively abbreviated as L l1 z , L l1 mle and L l1 rl . And L l1 z is an auxiliary function that constrains the sentences and their translated counterparts in other language have the same or similar semantic codes. sequence: where a terminal reward r is observed after the generation reaches the end of each sentence. It is worth noting that considering a baseline reward into training objective can reduce the variance of the model. And we can see that maximizing L rl is equivalent to maximizing the conditional likelihood of the sampled sequencex if it obtains a higher reward than the baselinex g , thus increasing the expected reward of our model.

Reward r in Equation 5
denotes the sequence-level reward that evaluates the quality of generated translations.
In this subsection, we discuss two major factors that contribute to the success of a translation, that is, n-gram matching and semantic adequacy, and describe how to approximate these factors through computable reward functions. N-gram matching For a translation generated by a NMT model, we need to measure the accuracy of surface word predictions. For that purpose, the BLEU (Papineni et al., 2002) score is often utilized in previous works. However, the BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. Thus, we apply the smoothed version of GLEU  as the reward for measuring n-gram precision or recall. More concretely, given a generated translationx l 1 in one language and the ground-truth reference x l 1 , we record all sub-sequences of 1, 2, 3 and 4 tokens in x l 1 and x l 1 , and start all n-gram counts from 1 instead of 0. Then we compute a recall R gleu , which is the ratio of the number of matching n-grams to the number of total n-grams in x l 1 (ground-truth), and a precision P gleu , which is the ratio of the number of matching n-grams to the number of total n-grams inx l 1 (generated output). Finally, the reward of the generated translationx l 1 on n-gram matching is defined as: where r 1 ranges from zero to one and it is symmetrical when switchingx and x. Semantic adequacy We want the model can adequately convey the meaning of the source sentence to the target as much as possible. Thus, we introduce another crucial reward function that is used to measure the semantic adequacy of the generated translations. More concretely, for a generated translationx l 1 in one language, we compute the representation ofx l 1 as: e i = TFIDF(w i ), w i ∈x l 1 w i = e i /Sum(e 1 , e 2 , ..., e Tx l 1 ) Identically, for the corresponding input sentence in another language, its representation cx l 2 can be extracted from the embedding matrixx l 2 . As the source and target word embeddings are often mapped to a shared-latent space in unsupervised NMT, we therefore can directly use the following cosine similarity as the reward for semantic adequacy: where (, ) indicates the dot product operation.
The final reward for a translationx l 1 is a linear combination of the rewards discussed above: where r 1 (x l 1 ) and r 2 (x l 1 ) complement to each other and work jointly to guide the learning of our model. Note that the combination of these two aspects of rewards helps because it can prevent the cases that the generated translation with high ngram matching but low semantic adequacy to have relatively high rewards, and vice versa.

Overall Objective Function
In addition to the aforementioned MLE objective function (Eq. 4) and the RL objective function (Eq. 5), there is an auxiliary function that constrains the sentences and their translated counterparts have the same or similar semantic code and is defined as: Finally, the overall training objective of crosslanguage training is to minimize the following loss function with hyperparameters η: where η is a scaling factor. In the beginning of the training η = 0, while as we move on with the training we can increase the η to slowly reduce the effect of MLE loss. And η is updated as follows: η = min(0.8, max(0.0, steps − n s n e − n s )) where steps is the global steps that the model has been updated, n s and n e are the start and end steps for increasing η respectively.

Training Procedure
There are two stages in the proposed unsupervised training. In the first stage, we pre-train the proposed model with denoising auto-encoding and cross-language training, until no improvement is achieved on the development set. This ensures that the model starts with a much better policy than random because now the model can focus on the good part of the search space. In the second state, we use an annealing schedule to teach the model to produce stable sequences gradually. That is, after the initial pre-training steps, we continue training the model with future rewarding. During each iteration, we perform one batch of denoising auto-encoding and cross-language training for the source as well as target languages alternately. For model selection, we randomly extract 3000 source and target sentences to form a development set. Following (Lample et al., 2018a), we translate the source sentences to the target language and then convert the resulting sentences back to the source language. The quality of the model is then evaluated by computing the BLEU score over the original inputs and their reconstructions via this two-step translation process. The performance is finally averaged over two directions, and the selected model is the one with the highest score.

Experiments
We mainly evaluate the proposed approach on the widely used English-German, English-French and NIST Chinese-to-English 1 translation tasks.

Datasets
For English-French and English-German, we use 30M sentences from the WMT monolingual News Crawl datasets from years 2007 through 2017. We use the publicly available implementation of Moses 2 scripts for tokenization. Besides, we use a shared vocabulary for source and target languages with 60K subword tokens based on byte-pair encoding (Sennrich et al., 2016b). We remove sentences longer than 50 subword-tokens. Experimental results are reported on newstest2014 for English-French translation and newstest2016 for English-German translation. We adopt the same method as in (Lample et al., 2018b) to obtain cross-lingual embeddings.
For NIST Chinese-to-English translation, our training data consists of 1.6M sentence pairs randomly extracted from LDC corpora 3 , which has been widely utilized by previous works. Similar to (Yang et al., 2018), we build the monolingual dataset by randomly shuffling the Chinese and English sentences respectively since the data set is not big enough. We set the vocabulary size to 30K for both Chinese and English. The average BLEU score over NIST02∼06 is reported en→fr fr→en en→de de→en zh→en in this paper. To pre-train cross-lingual embeddings, we utilize the monolingual corpora to train the embeddings for each language independently by using word2vec (Mikolov et al., 2013). Then we apply the public implementation 4 proposed by Artetxe et al. (2017) to map these embeddings into a shared latent space and keep the mapped embeddings fixed during training. For NIST Chinese-to-English, we apply caseinsensitive NIST BLEU computed by the script mteval-v13a.pl to evaluate the translation performance. For English-German and English-French, we evaluate the translation performance with the script multi-belu.pl.

Hyper-parameters
We set the following hyper-parameters: word embedding dimension as 512, hidden size of selfattention as 512, hidden size of fully connected layers as 1024 and the head number as 8. We share the last one layer of encoders in both languages. The dropout rate is set as 0.1, 0.3 and 0.2 during the training for En-Fr, En-De and Zhto-En, respectively. We perform a fixed number of iterations (500K) to train each model, and set n s = 300K, n e = 400K, for gradually increasing the effect of future rewarding. We use the Adam optimizer with a simple learning rate schedule: we start with a learning rate of 10 −4 , after 300K updates, we begin to halve the learning rate every 100K steps. We set the mini-batch size as 64. At decoding time, we use greedy search.

Overall Results
Our method is compared with several previous unsupervised NMT systems (Artetxe et al., 2018b;4 https://github.com/artetxem/vecmap Lample et al., 2018a,b;Yang et al., 2018;Wu et al., 2019;Song et al., 2019). Although, Song et al. (2019) have achieved comparable results with supervised NMT systems with larger monolingual data (Wikipedia data) and bigger model 5 , we still list the results that obtained with the same data and model as ours for fair comparison. We also consider a "Baseline" model, with the same architecture as described in Section 2.1 except for the variational inference network and is trained using MLE only. We directly copy the experimental results of previous models reported in their papers and report the BLEU scores on English-French, English-German and NIST Chinese-to-English test sets in Table 1.
As shown in Table 1, our approach achieves BLEU score of 27.59 and 27.15 on En→Fr and Fr→En translations respectively, which outperforms Lample et al. (2018b) by more than 2 BLEU points on both En→Fr and Fr→En. For the En-De, we achieve 19.65 and 23.42 BLEU scores on En→De and De→En respectively, with up to +10.09 BLEU points improvement over previous unsupervised NMT models. For the Chinese-to-English translation, the proposed method leads to a substantial improvement (up to 54%) over the previous system showed in Yang et al. (2018). Compared to baseline, our approach demonstrates significant improvements by more than 2 BLEU points over three benchmarks. These results indicate that the newly proposed training method that models future rewards to optimize global word predictions for unsupervised NMT is promising and enables the model to generate quality translations.

Analysis
In this section, we conduct some analysis over the proposed method by taking English-French translation as an example.

Ablation Study
To understand the importance of different components of the proposed system, we perform an ablation study by training multiple versions of our model with some missing components: the variational inference network and the future rewarding method. Results are reported in Table 2. From the table, we can see that removing the future rewarding, and the accuracy drops by 0.98/1.02 BLEU points. Without the variational inference networks, the accuracy decreases with 0.62/0.69 BLEU points. These findings demonstrate that both the future rewarding and the VIN are important, and both contribute to the improvement of translation accuracy. The more critical component is the future rewarding technology, which is vital to optimize the global word predictions.

Qualitative Comparison of Back-translating
We perform qualitative evaluation on the pseudo parallel data generated with the back-translation method. To this end, we conduct a "round-trip" translation (e.g., src →t gt →ŝ rc), where src andt gt form a pseudo parallel corpus,ŝ rc is the reconstruction fromt gt. We explore three settings for qualitative evaluation: 1) UNKs, the ratio of the number of unknown words to the number of total words int gt; 2) the average over all sentences iñ tgt with respect of their semantic adequacy, denoted as SA; 3) the BLEU scores over the original inputs and their reconstructions, denoted as r-BLEU. All settings are finally averaged over two directions.
Results are shown in Table 3. The proposed training method introduces significant boosts in all of the three settings, with reducing 1.34% of unknown words, increasing the semantic adequacy UNKs SA r-BLEU Baseline 3.51% 0.794 54.23 +Future Rewarding 2.17% 0.882 60.08 Table 3: Qualitative comparison of the generated pseudo parallel sentences from the models trained with MLE only and with the proposed training method on English-French test set.  by 0.088 and improving r-BLEU points by 5.85. This is in line with our expectations, as the proposed future rewarding method is not optimized to predict the next token, but rather to increase longterm reward. Table 4 shows four example translations. The first part shows examples for which the proposed model reached a higher BLEU score than the baseline model. We find that the translation produced by the baseline model doesn't adequately convey the meaning of the source sentence to the target. By contrast, the proposed future rewarding method enables the model to generate translations that are more diversity while ensuring the meaning of the source sentences, such as "circonstance" and "come to light". The possible reason is that we apply the semantic adequacy to reward translations that have different syntax structures and expressions but share the same meaning as the ground-truth sentence. The second part contains examples where the baseline achieved better BLEU score than our model, that is, in a few cases, our model chooses inappropriate words that under the same topic as reference words.

Related Work
In order to reduce the exposure bias and optimize the metrics used to evaluate sequence modeling tasks (like BLEU, ROUGE or METEOR) directly, reinforcement learning (RL) has been widely used in many of recent works on machine translation (Ranzato et al., 2016;Shen et al., 2016;He et al., 2017;Bahdanau et al., 2017;Li et al., 2017), text summarization (Paulus et al., 2018;Wu and Hu, 2018;Li et al., 2018;, dialogue generation (Li et al., 2016), and question answering . However, our proposed method is the first use in combination with reinforcement learning for unsupervised NMT to explicitly enhance back-translation.
Recently, motivated by the success of crosslingual embeddings (Artetxe et al., 2016;Zhang et al., 2017;Conneau et al., 2017), several works have tried to train NMT or SMT models using unsupervised setting, in which the model only has access to unlabeled data. For example, Lample et al. (2018a) propose a model that consists of a single encoder and a single decoder for both languages, respectively responsible for encoding source and target sentences to a shared latent space and to decode from that latent space to the source or target domain. Different from (Lample et al., 2018a), Artetxe et al. (2018b) introduce a shared encoder but two independent decoders with one for each language. Both of these two works mentioned above utilize denoising auto-encoding to reconstruct their noisy inputs and incorporate back-translation into cross-language training procedure. Further, Yang et al. (2018) extend the single encoder by using two independent encoders but sharing some partial weights, which are responsible for alleviating the weakness in keeping language-specific characteristics of the shared encoder. And the entire system is fine-tuned by introducing two global GANs with one for each language. More recently, Artetxe et al. (2018a) and Lample et al. (2018b) propose an alternative approach based on phrase-based statistical machine translation, which profits from the modular architecture of SMT. In addition, Lample et al. (2018b) also introduce a novel cross-lingual embedding training method which is particularly suitable for related languages (e.g., English-French and English-German). Ren et al. (2019) introduce SMT models as posterior regularization, in which SMT and NMT models boost each other through iterative back-translation in a unified EM training algorithm. Wu et al. (2019) propose an alternative for back-translation, , extract-edit, to extract and then edit real sentences from the target monolingual corpora. Lample and Conneau (2019) and Song et al. (2019) propose to pretrain cross-lingual language models for the initialization stage of unsupervised neural machine translation, which is critical to the performance of their proposed model. In contrast to theirs, we propose an effective training method for unsupervised NMT that models future rewards to optimize the global word predictions via neural policy reinforcement learning, which can be applied to arbitrary architectures and language pairs easily.

Conclusion
In this paper, we have proposed a novel learning paradigm for unsupervised NMT that models future rewards to optimize the global word predictions via reinforcement learning, in which we design an effective reward function that jointly accounts for the n-gram matching and the semantic adequacy of generated translations. To constrain the corresponding sentences in two languages have the same or similar semantic code, we also introduce a variational inference network into the proposed model.
We test the proposed model on WMT'14 English-French, WMT'16 English-German and NIST Chinese-to-English translation tasks. Experiment results show that our approach leads to significant improvements over various language pairs, especially on distantly-related languages such as Chinese and English.

Acknowledgments
We would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the National Key Research and Development Program of China (No. 2017YFB0803301). Yue Hu is the corresponding author.