A Retrieve-and-Rewrite Initialization Method for Unsupervised Machine Translation

The commonly used framework for unsupervised machine translation builds initial translation models of both translation directions, and then performs iterative back-translation to jointly boost their translation performance. The initialization stage is very important since bad initialization may wrongly squeeze the search space, and too much noise introduced in this stage may hurt the final performance. In this paper, we propose a novel retrieval and rewriting based method to better initialize unsupervised translation models. We first retrieve semantically comparable sentences from monolingual corpora of two languages and then rewrite the target side to minimize the semantic gap between the source and retrieved targets with a designed rewriting model. The rewritten sentence pairs are used to initialize SMT models which are used to generate pseudo data for two NMT models, followed by the iterative back-translation. Experiments show that our method can build better initial unsupervised translation models and improve the final translation performance by over 4 BLEU scores. Our code is released at https://github.com/Imagist-Shuo/RRforUNMT.git.


Introduction
Recent work has shown successful practices of unsupervised machine translation (UMT) (Artetxe et al., 2017;Lample et al., 2017Lample et al., , 2018Artetxe et al., 2018b;Marie and Fujita, 2018;Ren et al., 2019;Lample and Conneau, 2019). The common framework is to build two initial translation models (i.e., source to target and target to source) and then do iterative back-translation (Sennrich et al., 2016a;Zhang et al., 2018) with pseudo data generated by each other. The initialization stage is important because bad initialization may wrongly squeeze the search space, and too much noise introduced in this stage may hurt the final performance. * Contribution during internship at MSRA. Previous methods for UMT (Lample et al., 2018;Artetxe et al., 2018b;Marie and Fujita, 2018;Ren et al., 2019) usually use the following n-gram embeddings based initialization. They first build phrase translation tables with the help of unsupervised cross-lingual n-gram embeddings (Conneau et al., 2017;Artetxe et al., 2018a), and then use them to build two initial Phrase-based Statistical Machine Translation (PBSMT) (Koehn et al., 2003) models with two language models. However, there are two problems with their initialization methods.
(1) Some complex sentence structures of original training sentences are hard to be recovered with the n-gram translation tables. (2) The initial translation tables inevitably contain much noise, which will be amplified in the subsequent process.
In this paper, we propose a novel retrieve-andrewrite initialization method for UMT. Specifically, we first retrieve semantically similar sentence pairs from monolingual corpora of two languages with the help of unsupervised cross-lingual sentence embeddings. Next, with those retrieved similar sentence pairs, we run GIZA++ (Och and Ney, 2003) to get word alignments which are used to delete unaligned words in the target side of the retrieved sentences. The modified target sentences are then rewritten with a designed sequence-to-sequence rewriting model to minimize the semantic gap between the source and target sides. Taking the pairs of the source sentences and corresponding rewritten targets as pseudo parallel data, we then build two initial PBSMT models (source-to-target and targetto-source), which are used to generate pseudo parallel data to warm up NMT models, followed by an iterative back-translation training process. Our code is released at https://github.com/Imagist-Shuo/RRforUNMT.git.
Our contributions are threefold. (1) We propose a novel method to initialize unsupervised MT models with a retrieve-and-rewrite schema, which can  "embs" means "embeddings" and "x-lingual" means "cross-lingual".) preserve the rich sentence structure and provide high-quality phrases. (2) We design an effective seq-to-seq architecture based on the Transformer to rewrite sentences with semantic constraints. (3) Our method significantly outperforms the previous non-pre-training based UMT results on en-f r and en-de translation tasks, and give the first unsupervised en-zh translation results on WMT17.

Method
Our method can be divided into three steps as shown in Figure 1. First, we do similar sentences retrieval ( §2.1) from two monolingual corpora with the help of unsupervised cross-lingual sentence embeddings. Next, to minimize the semantic gap between the source and retrieved targets, we do target sentences rewriting ( §2.2) by deleting unaligned words in the target side, and generate complete and better-aligned targets via our rewriting model with the help of missing information provided by the source. After that, we treat the rewritten pairs as the pseudo parallel data for translation models initialization and training ( §2.3).

Similar Sentences Retrieval
Given two monolingual corpora D x and D y of two languages X and Y respectively, we first build unsupervised cross-lingual word embeddings of X and Y using fastText (Bojanowski et al., 2017) and vecmap (Artetxe et al., 2018a), and then we obtain cross-lingual sentence embeddings based on the cross-lingual word embeddings via SIF (Arora et al., 2017). After that, we use the marginal-based scoring (Artetxe and Schwenk, 2018) to retrieve similar sentences from two corpora 1 . Examples retrieved from monolingual English and Chinese corpora are shown in Figure 1 in the Appendix A.

Target Sentences Rewriting
As shown in Figure 2, having retrieved similar sentence pairs {x, y}, we first run GIZA++ (Och and Ney, 2003) on these pairs and obtain the word alignment information. Then, for each target sentence y, we remove the unaligned words from it according to lexical translation probabilities of GIZA++ output. We replace each deleted word with DEL in y to get the incomplete target sentence y . Meanwhile, we record the unaligned words in the source as x m 1 where m is the number of the unaligned source words. Next, we feed y and x m 1 into a sequence-to-sequence model to generate the refined target sentenceŷ. The rewritten pairs {x,ŷ} are used as training data to train initial UMT systems. Figure 3: The architecture of the rewriting model. We modify the input of the Transformer encoder into two parts. The first part is the incomplete target sentence y , which is the same as the original Transformer input, and the second part is a sequence of unaligned source words x m 1 , for which we remove positional encoding because the order of these words is not a concern.
Our rewriting model is a modification of Transformer (Vaswani et al., 2017) shown as Figure 3. We initialize the embedding layer of the second input part with pre-trained cross-lingual word embeddings because its content should be independent of languages. We keep it fixed during training. Thus the second part is like a memory recording semantic information of words. We concatenate the readout embeddings of both parts with a separator, and feed them to the Transformer encoder, so that the attention mechanism will take effect on both parts together. For model training, due to the lack of references, we need to build training data for the rewriting model from monolingual corpus D y . Firstly, we remove 20 to 30 percent of words from a given sentence y ∈ D y , and replace them with DEL to get y . Next, we randomly swap contiguous words in y with the probability of 0.2 to introduce some noises. Then we record the removed words as set s m 1 and randomly drop/add some words from/to this set. We then treat y and s m 1 as the inputs, and y as the output to train the model. For model inference, we feed the incomplete sentence y and unaligned source words x m 1 into the trained model and generate the refined sentenceŷ. Note there seems to be a bias between the training and inference that s m 1 during training are in the same language as y, while during inference, they are from the source language X. But the bias has been eliminated since the second input part of the encoder is the readout cross-lingual embeddings, which is independent of languages.

Translation Models Initialization and Training
Once we get {x,ŷ} generated above, we use them to train initial PBSMT models, and use the SMT models to produce pseudo data to setup two NMT models, followed by the iterative back-translation.

Setup Dataset
In our experiments, we consider three language pairs, English-French (en-f r), English-German (en-de) and English-Chinese (en-zh). For en, f r and de, we use 50 million monolingual sentences in NewsCrawl from 2007 to 2017. As for zh, we use the Chinese side from WMT17 en-zh parallel data. 2 For the convenience of comparison, we use newstest 2014 as the test set for en-f r, newstest 2016 for en-de, and newstest 2017 for en-zh. The data preprocessing is described in Appendix D.

Baselines
Our method is compared with eight baselines of unsupervised MT systems listed in the upper area of Table 1. The first three baselines are unsupervised NMT models, and the fourth baseline is an unsupervised PBSMT model. The fifth baseline is an extract-and-edit schema for unsupervised neural machine translation. The sixth and seventh baselines are hybrid models of NMT and PBSMT. And the last baseline is a pre-training based method.

Overall Results
The comparison results are reported in Table 1. From the table, we find that our method significantly outperforms the best non-pre-training based baseline with an average of 4.63 BLEU scores on all pairs. Note that Lample and Conneau (2019) is based on pre-training, which uses much more monolingual data than our method. Even so, we reach comparable results on the en-f r pair.

Comparison of Initial SMT Models
We compare the performance of SMT models initialized with different methods in three baselines initialize their SMT models with phrase tables inferred from n-gram embeddings and language models. From the table, we find that our proposed method gives better initialization to SMT models. Even the SMT models trained with only the retrieved sentences reach higher performance than previous methods, which verifies that the noise within the retrieved sentences is random to a greater extent and can be easily eliminated by SMT models, which is consistent with Khayrallah and Koehn (2018). With the target sentences rewritten by our rewriting model, the quality of extracted phrases can be further improved. We also try to directly train NMT models with the rewritten pseudo data, but only get the BLEU scores under 10, which means there is still much noise for SMT to eliminate in the pseudo pairs.

Discussion of Rewriting Model
We build two test sets to quantify the performance of our rewriting models. The first test set denoted as "in-domain", is from our synthetic training data. As described before, we build training samples using monolingual data according to the rules in §2.2. We select 8M sentences from the monolingual corpus of a certain language for model training and randomly sample 8k sentences as development and test sets respectively. In addition, we also test our rewriting model on newstest2014 (en-f r), which is denoted as "out-domain". We first run GIZA++ on the parallel sentences in the original test set to find the golden alignments between source and tar-get words. Next, we randomly delete up to 30% words in the target side and record their aligned source words. Then we feed the incomplete target sentence and the recorded source words into our model to recover the original target. The BLEU scores on both test sets are listed in Table 3, which shows our rewriting model has good performance.   (2019) propose a pre-training method and achieve state-of-theart performance on unsupervised en-f r and en-de translation tasks. But they use much more monolingual data from Wikipedia than previous work and this paper. We must also mention the work of Wu et al. (2019). They similarly use retrieval and rewriting framework for unsupervised MT. However, ours is different from theirs in two aspects. First, we efficiently calculate the cross-lingual sentence embeddings via a training-free method SIF rather than a pre-trained language model. Second, our rewriting method is based on the word alignment information which is more explicit than their max pooling, and our rewriting model is more simple but effective so that the rewriting results can be directly used without extra training techniques.

Conclusion
In this paper, we propose a novel method for unsupervised machine translation with a retrieve-andrewrite schema. We first retrieve similar sentences from monolingual corpora and then rewrite the targets with a rewriting model. With the pseudo parallel data, we better initialize PBSMT models and significantly improve the final iteration performance as the experiments show.

A Examples of Retrieval
Examples retrieved from monolingual English and Chinese corpora are shown in Figure 5. With this method, we can retrieve not only highly similar sentences like the first case, but also sentence pairs with rich sentence structures like the second one. The rest retrieved pairs, though containing some noise, also provide high-quality alignments after rewriting according to our observation. The note is a hierarchical translation rule, which belongs to a rich sentence structure.

B Examples of Rewriting
We list some rewriting cases from en to zh in this section. Figure 6 shows some retrieved sentence pairs before and after being rewritten, to demonstrate the effectiveness of our retrieval method and rewriting model. From the first case, we see that the unaligned word "CPSC" is replaced with the right one "她" (she); unrelated words "锂 离子" (lithium-ion) and "消 费 者" (consumer) are removed; "设备" (device) and "爆炸" (explosion) are added into the rewritten sentence. From the second case, we see that the unaligned word "小 组" (group) is replaced with the right one "科学家 们" (scientists); unrelated words "迎来" (welcome) and "天文学" (astronomy) are removed; "最大" (biggest) and "突破" (breakthrough) are added in the rewritten sentence. The two cases show that our rewriting model can produce the target sentences that are better aligned with the given sources. Figure 5 shows some translation results generated by our unsupervised MT models to exemplify the final performance. The cases verify that our method empowers the models to learn rich sentence structure such as the hierarchical translation rules of "be A that B" → "是 B 的 A" in the first case and "act as if A" → "表现 的 好像 A 一样" in the second one. This means that our initialization method can preserve the rich sentence structures of the original monolingual sentences, thus giving better initialization for initial UMT models.

D Data Preprocessing
We use Moses scripts 3 for tokenization and truecasing. For Chinese tokenization, we use our in-house tool. For SMT, we use the Moses implementation of hierarchical PBSMT systems with Salm