Machine Translation With Weakly Paired Documents

Neural machine translation, which achieves near human-level performance in some languages, strongly relies on the large amounts of parallel sentences, which hinders its applicability to low-resource language pairs. Recent works explore the possibility of unsupervised machine translation with monolingual data only, leading to much lower accuracy compared with the supervised one. Observing that weakly paired bilingual documents are much easier to collect than bilingual sentences, e.g., from Wikipedia, news websites or books, in this paper, we investigate training translation models with weakly paired bilingual documents. Our approach contains two components. 1) We provide a simple approach to mine implicitly bilingual sentence pairs from document pairs which can then be used as supervised training signals. 2) We leverage the topic consistency of two weakly paired documents and learn the sentence translation model by constraining the word distribution-level alignments. We evaluate our method on weakly paired documents from Wikipedia on six tasks, the widely used WMT16 German\leftrightarrowEnglish, WMT13 Spanish\leftrightarrowEnglish and WMT16 Romanian\leftrightarrowEnglish translation tasks. We obtain 24.1/30.3, 28.1/27.6 and 30.1/27.6 BLEU points separately, outperforming previous results by more than 5 BLEU points in each direction and reducing the gap between unsupervised translation and supervised translation up to 50%.


Introduction
Neural machine translation (NMT) is a great success of deep learning for natural language processing (Koehn et al., 2003;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015;Jean et al., 2015;Vaswani et al., 2017;Gehring et al., 2017). It has significantly outperformed statistical ma-chine translation and reached near human-level performance for several language pairs (Wu et al., 2016;Hassan et al., 2018). Such breakthroughs heavily depend on the availability of large scale bilingual sentence pairs. Taking WMT14 English→French task as an example, NMT uses 38 million parallel sentence pairs for training (Vaswani et al., 2017). As such bilingual sentence pairs are costly to collect, the success of NMT is not fully realized on the vast majority of language pairs, especially for low-resource languages. Recently, Artetxe et al. (2017);  tackle this challenge by training NMT models using only monolingual data, which achieves considerably good accuracy but still far away from that of the state-of-the-art supervised models.
While it is costly to collect bilingual sentence pairs by human translation, we notice that there exists many weakly paired bilingual documents on the Web. For example, for the entity "machine learning", Wikipedia has multiple articles in different languages (e.g., the English and German articles). Though the two articles have very similar content, they are not sentence-by-sentence translations since they may be independently created by different people. Similarly, an English news in BBC and a Chinese news in China Daily talk about the same event but may have differences in details. Furthermore, a popular novel in different languages is usually liberal translation instead of literal translation. We call such weakly aligned documents weakly paired bilingual documents. In this paper, we explore learning NMT models from such weakly paired documents, which has several advantages. First, weakly paired documents are much easier to obtain than bilingual sentence pairs. We can obtain bilingual document pairs from Wikipedia pages, aligned news articles on internet, or even from books. Second, such weakly paired documents have great coverage of different languages. For example, Wikipedia covers more than 178 languages and most of them have paired pages to English. This means that learning translation models from paired bilingual documents are possible for many language pairs. Weakly aligned document pairs can be utilized for NMT training from two aspects. First, although such two documents are not exactly sentence-by-sentence translations, it is possible that one specific sentence in one document is the translation of one sentence in the other document. Such a sentence pair can be used as bilingual signals for training. For example in Wikipedia, although the web structure and paragraphs are generally different for entity "Beijing" in English and "Pekin" in French, we find the first sentences of the pages are quite semantically similar, in which both of them define Beijing as the capital of China (English: Beijing, formerly romanized as Peking, is the capital of the People's Republic of China. French: Pekin, galement appele Beijing, est la capitale de la Rpublique populaire de Chine.) The challenge is how to mine such sentence pairs from those document pairs. Second, although the sentences in two weakly paired documents are not aligned, the topics of the documents are well aligned. Such topic alignment is a strong signal that can be used to train NMT models.
In this paper, we propose a method to train N-MT models by leveraging weakly paired bilingual documents from Wikipedia. The key idea contains two aspects: a) Mining implicitly aligned sentence pairs. We provide a simple and efficient method to mine bilingual sentence pairs from weakly aligned document pairs. With the cross-lingual word embeddings trained from two monolingual corpora (one for each language) using MUSE (Conneau et al., 2017), we use weighted average of word embeddings in a sentence as the sentence embedding. We then select the sentence pairs using the sentence embeddings with large cosine similarity as bilingual sentence pairs to train NMT models. b) Leveraging topic alignment as regularization. Many previous works suggest that the word distribution can be usecd to well characterize the topic of the document (Petterson et al., 2010;Du et al., 2015;Funatsu et al., 2014;Pedrosa et al., 2016;Chemudugunta et al., 2007). To leverage the topic consistency between weakly paired documents, we minimize the KL-divergence of the word distributions between the ground-truth document and the model-generated document.
Taking Wikipedia corpus as the training data, we test our method on the widely used WMT16 German↔English, WMT13 Spanish↔English and WMT16 Romanian↔English translation tasks. Our method achieves 24.1/30.3 BLEU points for WMT16 German↔English translations, 28.1/27.6 BLEU points for WMT13 Spanish↔English translations and 30.1/27.6 BLEU scores for WMT16 Romanian↔English translations, outperforming the previous strong unsupervised method by more than 5 BLEU points and reduce the gap between unsupervised translation and supervised translation up to 50%.

Our method
We contribute two ways to leverage weakly paired documents: mining implicitly aligned sentence pairs from the document pairs and aligning the topic distributions of two documents in a weakly aligned pair. Before diving into details, we first introduce some notations. Denote .., M } as the set of weakly paired documents, in which document d X i is aligned to d Y i . For example, d X i and d Y i are two cross-lingual linked Wikipedia pages. Denote n X i and n Y i as the number of sentences in document d X i and d Y i respectively. Note that usually n X i = n Y i . Without any confusions, we denote x as a sentence in language X and y as a sentence in language Y . We denote enc as the encoder for language X and Y , which maps a sentence x or y into a sequence of real vectors using parameter θ enc . We use dec with parameter θ dec as the decoder, which takes the encoded vectors and target language tag (X or Y ) as inputs and outputs a probability distribution over sentences in the target language. Let θ denote all the parameters of the translation model. Similar to (Artetxe et al., 2017;, such a model can handle both X → Y and Y → X translations.

Mining implicitly aligned sentence pairs
Different language versions of Wikipedia pages about the same entity/event are usually created by different people speaking different native languages, and therefore most sentences in two weakly aligned documents are not aligned. Even though, there is still a small chance that some bilingual sentences are aligned, and we try to mine such implicitly aligned sentence pairs and use them as supervision for NMT model training.
Many works rely on aligned sentences or bilingual dictionaries to mine the sentence pairs (Adafre and De Rijke, 2006;Yasuda and Sumita, 2008;Fung and Cheung, 2004). For example, Imankulova et al. (2017) extract bilingual sentence pairs from Wikipedia using a well-trained translation model learned from supervised sentence pairs. This method does not work for us since we are under unsupervised scenario and we do not have any aligned bilingual sentence pairs. Instead, our idea is to compute the similarity of two sentences using their cross-lingual sentence embeddings and choose the pairs with large similarity as aligned bilingual sentence pairs. Sentence embedding is widely used to measure textual similarity in text classification tasks (Arora et al., 2017;Le and Mikolov, 2014;Wieting et al., 2015). (Arora et al., 2017) compute the weighted average of the word embedding in one sentence where the weight depends on word frequency and then project away the weighted average sentence embeddings from their first principal component. This method achieves good performance on a range of monolingual textual similarity task. We extend their method from monolingual sentence embedding to cross-lingual sentence embedding, given the cross-lingual word embeddings are pretrained using MUSE (Conneau et al., 2017). The detailed method is described as follows.
For each word w, we denote e w as the word embedding trained from MUSE (Conneau et al., 2017), p(w) as the estimated frequency from another document and a as a predefined parameter to calculate the weight of word embedding. We denoteê s as the weighted average sentence embedding of sentence s and E as the embedding matrix of all the sentences over the monolingual corpus. Then we remove the first principal components u 1 of E for every weighted average sentence embeddingê s . Removing the first principal component from the sentence embeddings make them more diverse and expressive in the embedding space, and thus the resulted embeddings are more effective. We use the resulting embedding e s as the final sentence embedding, i.e., Based on the sentence embedding, we estimate the similarity between two sentences in different languages by their cosine similarity sim(s X , s Y ) = <e s X ,e s Y > e s X e s Y . For each weakly aligned document pair (d X i , d Y i ), we have n X i ×n Y i pairs of sentences and form a bipartite graph between sentences in two documents where the weight of an edge between two cross-lingual sentences is their cosine similarity score. The goal is to find the most confident edges (sentence pairs) from this weighted bipartite graph. We adopt a greedy selection approach with two constraints: The first constraint is that the weight of a selected edge must be larger than threshold c 1 , which is to ensure that the two sentences are similar enough. The second constraint is that the weight of a selected edge must be larger than the weights of all other edges connected to these two nodes (sentences) by threshold c 2 . This ensures that the pair we selected is unique enough. Denote S = {(s X j , s Y j )} as the set of selected sentence pairs. We use those pairs as supervision for model training, i.e., minimizing the negative log-likelihood as below.

Aligning Topic Distribution
Although cross-lingual linked Wikipedia pages are not aligned in sentences, they are usually aligned in topic distribution because they talk about the same event or entity. For example, the English topical words "politician", "United States" and "president" will appear in the English page for "Donald Trump", and similar topical words in French will appear in the corresponding French page. That is, if we translate an article from English to French sentence-by-sentence, the word distribution of the translated article should be generally similar to the word distribution of the corresponding article in French. Here we leverage the document-level word-distribution alignment to enhance and regularize the training of a translation model. Given an NMT model, we first translate a document d X i by sentence translation and obtain a documentd Y i . Then we evaluate the word distributions between the generated documentd Y i and the ground-truth document d Y i , and use such signal to optimize the model. However, straight-forward loss design, e.g., KL-divergence or Wasserstein distance between the word distributions ofd Y i and d Y i is not differentiable w.r.t. the NMT model, due to the non-differentiable operation (greedy search or beam search) while generatingd Y i . To address this challenge, we need to design some loss function that is smooth with respect to the model parameters. Our proposal is assuming each generated sentenceŝ Y i,k ∈d Y i is fixed and "refeeding" the pair (s X i,k ,ŝ Y i,k ) to the model to get the probability distribution over all the words at each position. We calculate the word distribution by averaging word probability distributions over all sentences and positions of the generated document. Mathematically, we have . We simply use KL-divergence loss as the objective. Then the document alignment loss for X → Y translation is defined as The corresponding document loss for Y → X translation can be defined in the same way. The final loss is as follows, which can be optimized using backpropagation thanks to its smoothness.

Overall algorithm
In addition to the above two ways of using Wikipedia data, the sentences in the weakly paired documents can be used as monolingual data to optimize the losses of the unsupervised machine translation. Therefore, our proposed losses can be combined with the loss functions of unsupervised machine translation. Here we first recap unsupervised machine translation , and then present our algorithm in Algorithm 1.
The unsupervised machine translation considers two loss functions.
Given a monolingual sentence s in language X/Y , the denoising auto-encoder loss is defined as L dae = −E s∼X [log P (s|c(s); θ)] − E s∼Y [log P (s|c(s); θ)], where c(.) is to drop and swap words in sentence s. As for the reconstruction loss, given s in language X/Y and the translated sentence s in Y /X by model P X→Y /P Y →X , the reconstruction loss is de- Denote the combination of the monolingual data in language X and Y as M . We define the overall loss on monolingual data M as L m (M ; θ) = L dae + L rec . Finally, the overall training objective of our algorithm is to minimize the following loss function with hyperparameters α and β:

Algorithm 1 Training Algorithm
Require: Initial translation model with parameter θ; monolingual dataset M , implicitly aligned sentence pairs dataset S, weakly paired documents dataset D; optimizer Opt 1: while not converged do 2: Randomly sample a mini-batch monolingual sentences from M , implicitly aligned sentence pairs from S and weakly paired documents from D Update θ by minimizing the overall objective L using optimizer Opt 5: end while

Experiments
We test our method on several benchmark translation tasks. We first describe the data preparation and experimental design, and then present the main results, followed by some deep studies.

Data preparation
Wikimedia offers free copies of all available contents on Wikipedia for multiple languages. We download the language specific Wikipedia contents 1 , and use WikiExtractor 2 to extract and clean  Many Wikipedia pages contain external links to the pages that describe the same entity but in different languages. We extract weakly paired documents using these external links. We filter out a document pair if any document in the pair contains less than 5 sentences. We also remove the sentences longer than 100 words. We conduct experiments on three language pairs: English-German (En-De), English-Spanish (En-Es) and English-Romanian (En-Ro) translations. The former two language pairs are indeed high-resource language pairs while the last is a low-resource language pair. Statistics of the processed Wikipedia documents are provided in Table 1.
We use the monolingual data as in Lample et al. ( , 2018 together with Wikipedia document pairs to train NMT models. For the En-De task, we use all available sentences from the WMT monolingual News Crawl datasets from year 2007 to 2017 containing about 50 million sentences for each language. For Romanian, it has more than 2 million sentences. For En-Es, we use News Crawl datasets from year 2007 to 2012 containing about 10 million sentences. The translation models are evaluated on newstest 2016 dataset for En-De and En-Ro, and newstest 2013 dataset for En-Es which are widely used (Koehn and Knowles, 2017).

Experimental Design
To mine implicitly aligned sentences from weakly paired documents, we use the open-sourced word embeddings trained by Fasttext (Joulin et al., 2016) and use MUSE 3 to build cross-lingual word 3 https://github.com/facebookresearch/ MUSE embeddings. We then generate sentence embeddings with the weighted average of cross-lingual word embeddings and further remove the top-1 principal components of sentence embedding matrix as introduced in Section 2.1. We set the two thresholds c 1 = 0.7 and c 2 = 0.1 respectively when selecting sentence pairs, and the parameter a to calculate the weight of word embedding is 0.001. To translate a documentd X i for topic alignment, we use greedy search.
For the training of translation models, the monolingual datasets, Wikipedia document pairs and mined sentence pairs are jointly processed by BPE (Sennrich et al., 2016b) with 60, 000 codes. For model initialization, we follow (Lample et al., 2018), which uses cross-lingual BPE embeddings to initialize the shared lookup tables, and the cross-lingual BPE embeddings are trained by fast-Text with embedding dimension 512, a context windows of size 5 and 10 negative samples. We adopt the Transformer (Vaswani et al., 2017) architecture in our experiments. We stack 6 layers in both the encoder and the decoder. Following (Lample et al., 2018), we share the lookup tables between the encoder and the encoder, and between the source and target languages. The dimension of the hidden state is 512. The weights α and β of the loss functions are set to be 1 and 0.05. For training, we use Adam optimizer (Kingma and Ba, 2014) and the same learning rate scheduler as used in (Vaswani et al., 2017). For decoding, we use beam search with beam width 4 and length penalty 0.6, and report the case-sensitive BLEU 4 score.

Main Results
Our method is compared with several previous works (Lample et al., 2018 in Table 2. We also consider some simple and heuristic ways of using the weakly paired documents as baselines. One baseline, referred as "NMT + First Wikipedia Sentence", is to directly use the first sentence of the aligned documents as an aligned sentence pair to train NMT model. The motivation behind it is that usually the first sentence of a Wikipedia document summarizes the main content of the document, which is more likely to be similar across languages. The second baseline, referred as "NMT + Document Translation", is to treat the weakly aligned doc-  (Lample et al., 2018) 17.9 22.9 --22.0 23.7 PBSMT + NMT (Lample et al., 2018) 20.  uments as two long sentences and use them as a bilingual sentence pair to train NMT models. From Table 2, our approach achieves BLEU score of 24.1 and 30.3 on En→De and De→En translations respectively. The previous best performance is achieved by (Lample et al., 2018): combining phrase-based approach and neural machine translation together, they achieve 20.2 and 25.2 BLEU scores on En→De and De→En translation with monolingual data only. Our approach outperforms their method by more than 4 and 5 BLEU points on En→De and De→En respectively, with limited weakly paired sentences and documents. For the En-Es, we achieve 28.1 and 27.6 BLEU scores on En→Es and Es→En respectively, with more than 8 and 7 points improvement over unsupervised Transformer baseline models. Similarly, for the En-Ro low-resource task, we achieve 30.1 and 27.6 BLEU scores for En→Ro and Ro→En respectively, which has more than 5 and 4 points improvement compared with previously reported results. One interesting point is that there are only nearly 6k aligned sentence pairs mined by our approach for En-Ro 5 , but we can still achieve a large BLEU improvement. This result can well demonstrate the effectiveness of our method in real lowresource settings and the sentences mined by our approach are in high quality.
For the two heuristic baselines, we find that they hurt the performance of NMT models with more than 2 points decrease in terms of the BLEU score. The results indicate that the careful utilization of Wikipedia data is important, which can also be verified by the superior performance of our approach.
Furthermore, we report the supervised result in 5 Compared with nearly 200k aligned sentences pairs for En-De, which can be found in Section 3.5. the last row of Table 2. The supervised setting is conducted on the full training set on WMT16 En-De translation with 4.5 million parallel sentences, and WMT13 En-Es translation with 3.8 million parallel sentences. As shown in the table, our approach takes a big step towards the supervised result, and reduce the gap between unsupervised translation and supervised translation up to 50%.

Discussions
To make a clear comparison on sentence pair extraction, we work similarly to existing works in our unsupervised scenario.
As introduced before, previous works usually leverage bilingual pairs to train a model or a classifier, or use a bilingual dictionary to mine sentence pairs. In our unsupervised scenario, the most related initial model for selecting pairs is the unsupervised NMT model. Therefore, we first train an unsupervised NMT model followed Lample et al. (2018) and use this model to mine the sentence pairs. Specifically, we conduct following experiments: a). Similarly to Adafre and De Rijke (2006); Yasuda and Sumita (2008), for each x, we generate the translation using the unsupervised N-MT model, and select the most similar sentence to the translation (w.r.t. BLEU) as paired data. b). We use the model outputted translation probability p(y|x) as the scoring function, as used in Smith et al. (2010);Munteanu and Marcu (2005), and select sentence pairs with larger probabilities. As the unsupervised NMT model is not good enough, we found the selected sentence pairs were in poor quality and using such poor data does not work well. On De→En task, the BLEU score of the model trained in a) and b) can only reach 23.4 and 19.8 (dropped) respectively. Both show that the trained unsupervised NMT models are not good   as expected.
Besides, we build a bilingual dictionary based on the titles from cross-lingual Wikipedia pages which we want to use to mine sentence pairs like Fung and Cheung (2004) and evaluate the quality using tool from MUSE. However, the top 1 accuracy of the word translation is poor (< 20%), which makes it unable to select trustable sentence pairs using such a weak lexicon. In contrast, by leveraging our cross-lingual word embedding and unsupervised sentence representation, the selected sentences are much better (see more in Section 3.5). We believe our findings are important to the field of unsupervised machine translation.

Further Studies
In this subsection, we provide an ablation study to our method, check the performance variance with respect to the thresholds for mining implicit sentence pairs, and finally present several cases of mined sentence pairs. The studies are conducted on the En-De translation pair.
Ablation Study Our method leverages weakly paired documents in two ways. To better understand the importance of the two ways, we report results from an ablation study in Table 3. From the table, we can see that removing the topic alignment, the accuracy drops with more than 1 BLEU points. Without the implicitly aligned sentence pairs, the accuracy decreases with about 6 BLEU points. These findings clearly demonstrate that both the two ways are important, and both contribute to the improvement of translation accuracy.

Impact of Sentence Quality
As introduced before, to control the data quality of mined sentence pairs, we set two constraints with threshold c 1 and c 2 . We present the paired sentence number and the BLEU scores with respect to these two thresholds in Table 4 and Figure 1. We vary the threshold c 1 in {0.70, 0.75} and c 2 in {0.0, 0.1, 0.2}. For c 1 ≤ 0.70, we observe that the sentence quality is poor, while for c 1 ≥ 0.75, we can only extract few pairs. As shown in the table and the figure, with more strict constraint, we obtain fewer sentence pairs but with higher quality. We can see that there is a clear trade-off between the data quality and the data number. A good configuration is c 1 = 0.7 and c 2 = 0.1.

Case Study of Mined Sentence Pairs
We present three cases of our selected sentence pairs in Table 5. We can see that our approach can mine high-quality pairs, such as the first case, in which one sentence is a good translation of the other one. Besides, our method can select interesting paired sentences with similar, if not the exactly same, semantics. As shown in the second case, the two sentences are almost semantically the same, while the German word "gewählt" ("elected" in English) is not included in the corresponding English sentence. Also in the last case, the detailed description "15 Meter hoch" (which means "15 meters high" in English) in the German sentence is missed in the English sentence. Although they are not exact translations, such sentence pairs are still very helpful for NMT model training, as demonstrated by the results of our method.

Related Work
Using monolingual data to boost the machine translation performance has attracted a lot of attention (Gulcehre et al., 2015;Sennrich et al., 2016a;Zhang and Zong, 2016;Wu et al., 2018;He et al., 2016;Wu et al., 2017), especially when the bilingual supervision is limited. Sennrich et al. (2016a) propose the back-translation approach which is an effective way to augment the training sentence pairs with ctarget-side monolingual data. He et al. (2016) leverage both source-side and target-side monolingual data in a dual learning framework. However, these methods still require a relatively large amount of labeled bilingual data. Recently,  and Artetxe et al. (2017) make an initial study of unsupervised machine translation, in which the model is trained by the monolingual data only.  leverage two key components to learn unsupervised translation models: 1) suitable initial-  ization of the translation model by cross-lingual word embeddings, 2) denoising auto-encoder as language model and reconstruction loss based on translation-back-translation. Both works only leverage monolingual sentences without quite rich weakly paired documents from Web.
We study how to leverage document pairs for training without supervised bilingual sentence pairs from Wikipedia. Leveraging the free online Wikipedia database as an additional source to improve the natural language processing tasks has attracted interest in recent years. For example, Conneau et al. (2017) show that word translation can be effectively learned based on the embedding trained from Wikipedia corpora. This embedding further becomes one of the key components for unsupervised machine translation. Different from using Wikipedia to train warm-start word embeddings, we aim to leverage more and stronger signals from such weakly paired documents to train translation model. Hálek et al. (2011) use the category information in Wikipedia corpus to improve the translation of named entities. Drexler et al. (2014) incorporate language models from target language documents that are comparable to the source documents in Wikipedia pages to improve the document translation.
There are several works aim at extracting potential sentence pairs from comparable corpus, but most of them rely on a set of bilingual sentence pairs to train a model or obtain a bilingual lexicon and use this model/lexicon to select sentence pairs. For example, Adafre and De Rijke (2006) and Yasuda and Sumita (2008) use a strong machine translation system to obtain a rough translation of a given page in one language into another, and then calculate word overlap or BLEU score between sentences as measure. Smith et al. (2010) and Munteanu and Marcu (2005) develop a ranking model/binary classifier to learn how likely a sentence in target language is a translation of the source language using parallel corpora. (Fung and Cheung, 2004) use a small set of parallel corpora to initialize their EM lexical learning and further use this lexicon to iteratively mine sentence pairs. However, in our unsupervised scenario, we have no bilingual sentence pairs to train such a model or lexicon to further select new sentence pairs.
Our work is also related to document translation (Tu et al., 2018) but with different goals and settings. The goal of document translation is to enhance sentence translation with stronger signals beyond sentences by using richer inputs (e.g., the topic information from the documen-t that contains this sentence). During training, it takes one sentence as well as the cross-sentence (document-level) information as input and predicts the ground-truth translation sentence in other languages. Therefore, training a document translation model requires bilingual sentence pairs and their surrounding contexts. In our scenario, our goal is to learn a sentence translation model with weaker signals than sentence pairs. We target to extract useful information from weakly paired documents to train a translation model without human-labeled bilingual data.

Conclusion and Future Work
In this work, we propose a general method to train machine translation models using weakly paired bilingual documents from Web, e.g., Wikipedia. Our approach contains two key components: mining implicitly aligned sentence pairs and aligning topic distributions. Experiments on public benchmarks verify the effectiveness of our method.
For future work, we will apply our method to more other language pairs. Besides, we will study using weakly paired documents from other data resources, such as news websites. Furthermore, we will investigate better ways to utilize such weakly paired documents, going beyond mining sentence pairs and aligning topic distributions.