Filtering Back-Translated Data in Unsupervised Neural Machine Translation

Unsupervised neural machine translation (NMT) utilizes only monolingual data for training. The quality of back-translated data plays an important role in the performance of NMT systems. In back-translation, all generated pseudo parallel sentence pairs are not of the same quality. Taking inspiration from domain adaptation where in-domain sentences are given more weight in training, in this paper we propose an approach to filter back-translated data as part of the training process of unsupervised NMT. Our approach gives more weight to good pseudo parallel sentence pairs in the back-translation phase. We calculate the weight of each pseudo parallel sentence pair using sentence-wise round-trip BLEU score which is normalized batch-wise. We compare our approach with the current state of the art approaches for unsupervised NMT.


Introduction
Back-translation involves generating a set of pseudo parallel sentence pairs using monolingual data of target language and a target to source machine translation model. Back-translation provides the capability to utilize target-side monolingual data for training.
Unsupervised NMT gained a lot of attention in the last two years. Current state of the art approaches for unsupervised NMT even surpasses supervised baseline for English-French language pair (Song et al., 2019). Unsupervised NMT has three main components: cross-lingual embeddings, denoising, and back-translation, where training involves alternating between denoising and back-translation after a good initialization process (Lample and Conneau, 2019;Song et al., 2019). In this paper our focus is on improving finetuning phase, we are introducing a weight component in unsupervised NMT training based on the quality of pseudo parallel sentence pairs generated for training in back-translation phase. These kinds of techniques have been utilized in domain adaptation to give more weight to in-domain sentences (Wang et al., 2017). Pretraining is also a key component of unsupervised NMT, we utilize existing pretraining approaches proposed in Lample and Conneau (2019) and Song et al. (2019).

Related Work
Our work majorly involves the exploration of filtering of back-translated data in unsupervised NMT. We briefly describe some related concepts of back-translation, unsupervised NMT and language model pretraining in this section.
Back-translation utilizes target-side monolingual data to create pseudo parallel sentence pairs using a translation system from target to source which is then utilized to train source to target NMT system (Sennrich et al., 2016). Hoang et al. (2018) show that iteratively generating better synthetic data improves the NMT performance.
Quality of back-translated data plays an important role in performance of NMT systems (Fadaee and Monz, 2018;Poncelas et al., ). Fadaee and Monz (2018) show that the target side words which have high prediction loss gets most benefit from the addition of synthetic data. Filtered pseudo parallel data selected with a threshold on round-trip BLEU score helps in improving the performance of NMT systems for low resource languages (Morishita et al., 2018;Imankulova et al., 2019). In Reinforcement learning based approaches, rewards for pseudo parallel sentence pairs based on language model score and round trip reconstruction error helps NMT models (He et al., 2016). Caswell et al. (2019) provides an identification mark for synthetic sentence pairs while training on a mix set of human generated and synthetic sentence pairs . Junczys-Dowmunt (2018) show that filtering the parallel data based on cross-entropy scores which calculates the agreement between both models of both directions from source to target and target to source helps in selection of good pseudo parallel sentence pairs. Dou et al. (2020) show that for iterative back-translation selecting and weighing sentences based on the quality of sentence pairs improves the performance of NMT systems, they use combination of different scores like round-trip BLEU, tf-idf, language model scores etc. to select top sentences, and then on rest of the sentences they use encoder representation similarities and agreement between forward and backward models to provide weights to back-translated data. Wang et al. (2019a) proposed uncertainty-based confidence measures to select good pseudo parallel sentence pairs. Wang et al. (2019b) proposed to select in-domain and clean data based on co-curricular learning.
In Domain adaptation different techniques have been applied to give more weight to in-domain sentences in the training process of NMT, some of these techniques are: providing a weight component in the loss function (Junczys-Dowmunt, 2018), using curriculum learning (Zhang et al., 2019), and dynamically selecting data with iterations (van der Wees et al., 2017).
Pretraining in unsupervised NMT is generally focused on language model training of both encoder and decoder to make them understand the language properties and to provide a good initialization to the finetuning phase. Finetuning utilizes the approaches proposed in Lample et al. (2018) which involves denoising and back-translation. Artetxe et al. (2019) proposed a good initialization mechanism using statistical machine translation for training unsupervised NMT. Wu et al. (2019) proposed a new architecture for unsupervised NMT which does not use back-translation but try to find the best possible translations from the target corpus and edit them to make pseudo parallel sentence pairs. Yang et al. (2018) proposed to utilize two independent encoders with sharing some partial weights. Lample and Conneau (2019) proposed a pretraining mechanism for unsupervised NMT to pretrain encoder and decoder separately using monolingual data. Song et al. (2019) show that training encoder and decoder simultaneously using monolingual data helps in pretraining of unsupervised NMT.

Approach
Current state of the art approaches for unsupervised NMT (Lample and Conneau, 2019;Song et al., 2019) do not consider the quality of each generated pseudo parallel sentence pair in the process of training, all generated pseudo parallel sentence pairs have the same weight. There exist different methods to filter bad pseudo parallel sentence pairs with a threshold, which is explained in the previous section. In case of iterative back-translation it is difficult to select a threshold for each batch separately as the training progresses. In the initial iterations of back-translation phase the quality of generated pseudo parallel sentence pairs is very poor. To decrease the weights of bad pseudo parallel sentence pairs we propose to modify back-translation training to include weight of each pseudo parallel sentence pair based on sentence wise round-trip BLEU score which is normalized batch-wise. We perform batch-wise normalization because it helps in maintaining a steady progress in training and also helps maintaining equal weightage of denoising and back-translation. Round-trip BLEU score is the BLEU score between the source sentence and the translation of source sentence to target then back to source language. As the systems in both directions (source to target and target to source) are trained simultaneously we can calculate the round-trip BLEU score based on the current trained systems in both directions. The sentence wise round-trip BLEU score is added in the cross-entropy loss function as weight of each pseudo parallel sentence pair. In general, the cross-entropy loss function is given by: where BS is batch-size, | N | is length of the vocabulary, | L | is length of the sentence,ŷ l,n is predicted probability of word n from vocabulary on word l in sentence, and y l,n is 1 when l from vocab is correct word otherwise 0. We utilize the weighted cross-entropy loss function: y l,n log(ŷ l,n )) (2) w b is the batch-wise normalized round-trip BLEU score between source and round-trip translation of source. The normalization function is given by: where w b is normalized round trip bleu score for bth sentence in the batch. w un b is un-normalized round trip BLEU score for bth sentence. BLEU score is bilingual evaluation understudy which is commonly utilized to evaluate machine translation systems (Papineni et al., 2002). There exist various other methods to evaluate machine translation systems but for start we are considering sentence-wise BLEU score which is the most popular one. We provide results for the same approaches shown in Song et al. (2019), Lample and Conneau (2019) with and without filtering in the back-translation phase. In Song et al. (2019) finetuning is only done using iterative back-translation and in Lample and Conneau (2019) finetuning is done using denoising and back-translation similar to (Lample et al., 2018).

Experiment and Results
In this section we show the impact of inclusion of filtering of back-translated data using above approach with two state of the art unsupervised NMT benchmarks for three language pairs.

Data
We utilize the same BPE codes and vocab as utilized in (Lample and Conneau, 2019) and (Song et al., 2019) for en-fr, en-fe and en-ro. We utilize the same data with mentioned number of sentences in table 1 for our experiments. All this data is from WMT 1 . We perform all pre-processing (normalization, tokenization and byte pair encoding) similar to Song et al. (2019). The validation data is newstest2013 and test data is newstest2014 of WMT.

Model configuration
We utilize the pretrained models from (Lample and Conneau, 2019) and (Song et al., 2019) for language model pretraining. For XLM we utilize masked language model pretraining 2 . We utilize transformer architecture with 6 layers, 8 heads, 1024 hidden units, GELU activation units, attention drop-out of 0.1, learning rate starts from 10 −4 and batch size of 32 sentences. We perform decoding using beam search. BPE codes are learnt using FastBPE 3 using 60000 BPE codes over the combined data of both languages. The epoch size is set to 200000 sentences. We use adam (Kingma and Ba, 2002) optimizer. We perform tokenization using moses (Koehn et al., 2007). For calculating sentence-wise BLEU scores using tensors we utilize allennlp 4 (Gardner et al., 2018) toolkit. We choose best model from different iterations according to BLEU score on validation set. We evaluate our results using tokenized BLEU scores calculated using multi-bleu.pl 5 .

Results
We utilize MASS 6 (Song et al., 2019) as our base implementation and update the back-translation phase to include weight of the pseudo parallel sentence pairs in the loss function.
Method en-fr fr-en en-ro ro-en en-de de-en   (2019) approach. We also performed an experiment to examine the impact of denoising in the finetuning phase with filtering of back-translated data for en-fr language pair using masked sequence to sequence pretraining, which gave BLEU score of 27.02 for en-fr and 26.01 for fr-en, which is an improvement over the baseline. We perform paired bootstrap re-sampling (Koehn, 2004) for a p-value less than 0.05 for statistical significance test.  Lample and Conneau (2019) with and without filtering. In initial iterations model with filtering is not performing good but as training progresses it starts performing better than model with no filtering. This happens because we start the filtering process from the beginning of the finetuning phase when the quality of generated backtranslated data is poorer than later iterations. We also observe that the model with filtering tend to converge a little earlier than the model without filtering. As we are giving less weights to poor pseudo parallel sentence pairs, it makes the system learn more from good data which helps in improving the performance of unsupervised NMT.

Conclusion
In this paper, we show that giving weights to pseudo parallel sentence pairs based on its quality calculated using round trip BLEU score in the back-translation phase helps in improving the performance of unsupervised NMT. In future work, we plan to explore different weighing scores to evaluate quality of back-translated data together with different measures of the quality of individual sentences to improve translation performance and training time.