Facebook FAIR’s WMT19 News Translation Task Submission

This paper describes Facebook FAIR’s submission to the WMT19 shared news translation task. We participate in four language directions, English <-> German and English <-> Russian in both directions. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the FAIRSEQ sequence modeling toolkit. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our system improves on our previous system’s performance by 4.5 BLEU points and achieves the best case-sensitive BLEU score for the translation direction English→Russian.


Introduction
We participate in the WMT19 shared news translation task in two language pairs and four language directions, English→German (En→De), German→English (De→En), English→Russian (En→Ru), and Russian→English (Ru→En). Our methods are based on techniques and approaches used in our submission from last year , including the use of subword models, (Sennrich et al., 2016), large-scale backtranslation, and model ensembling. We train all models using the FAIRSEQ sequence modeling toolkit (Ott et al., 2019). Although document level context for En→De is now available, all our systems are pure sentence level systems. In the future, we expect better results from leveraging this additional context information.
Compared to our WMT18 submission, we also decide to compete in the En↔Ru and De→En translation directions. Although all four directions are considered high resource settings where lar-ge amounts of bitext data is available, we demonstrate that leveraging high quality monolingual data through back-translation is still very important. For all language directions, we back-translate the Newscrawl dataset using a reverse direction bitext system. In addition to back-translating the relatively clean Newscrawl dataset, we also experiment with back-translating portions of the much larger and noisier Commoncrawl dataset. For our final models, we apply a domain-specific finetuning process and decode using noisy channel model reranking (Anonymous, 2019).
Compared to our WMT18 submission in the En→De direction, we observe substantial improvements of 4.5 BLEU. Some of these gains can be attributed to differences in dataset quality, but we believe most of the improvement comes from larger models, larger scale back-translation, and noisy channel model reranking with strong channel and language models.

Data
For the En↔De language pair we use all available bitext data including the bicleaner version of Paracrawl. For our monolingual data we use English and German Newscrawl. Although our language models were trained on document level data, we did not use document level boundaries in our final decoding step, so all our systems are purely sentence level systems.
For the En↔Ru language pair we also use all available bitext data. For our monolingual data we use English and Russian Newscrawl as well as a filtered portion of Russian Commoncrawl. We choose to use Russian Commoncrawl to augment our monolingual data due to the relatively small size of Russian Newscrawl compared to English and German.

Data Preprocessing
Similar to last year's submission for En→De, we normalize punctuation and tokenize all data with the Moses tokenizer (Koehn et al., 2007). For En↔De we use joint byte pair encodings (BPE) with 32K split operations for subword segmentation (Sennrich et al., 2016). For En↔Ru, we learn separate BPE encodings with 24K split operations for each language. Systems trained with this separate BPE encoding performed significantly better than those trained with joint BPE.

Bitext
Large datasets crawled from the internet are naturally very noisy and can potentially decrease the performance of a system if they are used in their raw form. Cleaning these datasets is an important step to achieving good performance on any downstream tasks.
We apply language identification filtering (langid; Lui et al., 2012), keeping only sentence pairs with correct languages on both sides. Although not the most accurate method of language identification (Joulin et al., 2016), one side effect of using langid is the removal of very noisy sentences consisting of mostly garbage tokens, which are classified incorrectly and filtered out.
We also remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. In total, we filter out about 30% of the original bitext data. See Table 1 for details on the bitext dataset sizes.

Monolingual
For monolingual Newscrawl data we also apply langid filtering. Since the monolingual Newscrawl corpus for Russian is significantly smaller than that of German or English, we augment our monolingual Russian data with data from the commoncrawl corpus. Commoncrawl is the largest monolingual corpus available for training but is also very noisy. In order to select a limited amount of high quality, in-domain sentences from the larger corpus, we adopt the method of Moore and Lewis (2010) for selecting in-domain data ( §3.2.1). Our base system is based on the big Transformer architecture (Vaswani et al., 2017) as implemented in FAIRSEQ. We experiment with increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size. All subsequent models, including ensembles, use this larger FFN Transformer architecture.
We trained all our models using FAIRSEQ (Ott et al., 2019) on 128 Volta GPUs, following the setup described in

Large-scale Back-translation
Back-translation is an effective and commonly used data augmentation technique to incorporate monolingual data into a translation system. Backtranslation first trains an intermediate target-tosource system that is used to translate monolingual target data into additional synthetic parallel data. This data is used in conjunction with human translated bitext data to train the desired sourceto-target system.
In this work we used back-translations obtained by sampling  from an ensemble of three target-to-source models. We found that models trained on data back-translated using an ensemble instead of a single model performed better (Table 2). Previous work also found that upsampling the bitext data can improve backtranslation . We adopt this method to tune the amount of bitext and synthetic data the model is trained on. We find a ratio of 1:1 synthetic to bitext data to perform the best.

Back-translating Commoncrawl
The amount of monolingual Russian data available in the Newscrawl dataset is significantly smaller than that of English and German (   order to increase the amount of monolingual Russian data for back-translation, we experiment with incorporating Commoncrawl data. Commoncrawl is a much larger and noisier dataset compared to Newscrawl, and is also non-domain specific. We experiment with methods to identify a subset of Commoncrawl that is most similar to Newscrawl. Specifically, we use the in-domain filtering method described in Moore and Lewis (2010). Given an in domain corpus I, in this case Newscrawl, and a non-domain specific corpus N , in this case Commoncrawl, we would like the find the subcorpus N I that is drawn from the same distribution as I. For any given sentence s, we can calculate, using Bayes' rule, the probability a sentence s in N is drawn from N I We ignore the P (N I |N ) term, since it will be constant for any given I and N , and use P (s|I) instead of P (s|N I ), since I and N I are drawn from the same distribution. Moving into the log domain, we can calculate the probability score for a sentence s by log P (N I |s, N ) = log P (s|I) − log P  L N trained on I and N respectively. Our corpora are very large and we therefore use an n-gram model (Heafield, 2011) rather than a neural language model which would be much slower to train and evaluate. We train two language models L I and L N on Newscrawl and Commoncrawl respectively, then score every sentence s in Commoncrawl by H I (s)−H N (s). We select a cutoff of 0.01, and use all sentences that score higher than this value for back-translation, or about 5% of the entire dataset.

Fine-tuning
Fine-tuning with domain-specific data is a common and effective method to improve translation quality for a downstream task. After completing training on the bitext and back-translated data, we train for an additional epoch on a smaller in-domain corpus. For De→En, we fine-tune on test sets from previous years, including new-stest2012, newstest2013, newstest2015, and new-stest2017. For En→De, we fine-tune on previous test sets as well as the News-Commentary dataset. For En↔Ru we fine-tune on a combination of News-Commentary, newstest2013, newstest2015, and newstest2017. The other test sets are held out for other tuning procedures and evaluation metrics.

Noisy Channel Model Reranking
N -best reranking is a method of improving translation quality by scoring and selecting a candidate hypothesis from a list of n-best hypotheses generated by a source-to-target, or forward model. For our submissions, we rerank using a noisy channel model approach (Anonymous, 2019).
Given a target sequence y and a source sequence x, the noisy channel approach applies Bayes' rule to model P (y|x) = P (x|y)P (y) P (x) Since P (x) is constant for a given source sequence x, we can ignore it. We refer to the remaining terms P (y|x), P (x|y), and P (y), as the forward model, channel model, and language model respectively. In order to combine these scores for reranking, we calculate for every one of our n-best hypotheses: log P (y|x) + λ 1 log P (x|y) + λ 2 log P (y) (3) The weights λ 1 and λ 2 are determined by tuning them with a random search on a validation set and selecting the weights that give the best performance. In addition, we also tune a length penalty.
For all translation directions, our forward models are ensembles of fine-tuned and backtranslated models. Since we compete in both directions for both language pairs, for any given translation direction we can use the forward model for the reverse direction as the channel model. Our language models for each of the target languages English, German, and Russian, are big Transformer decoder models with FFN 8192. We train the language models on the monolingual Newscrawl dataset, and use document level context for the English and German models. Perplexity scores for the language models on the bolded target language of each translation direction are shown in table 4. With a smaller amount of monolingual Russian data available, we observe that our Russian language model performs worse than the German and English language models.
To select the length penalty and weights, λ 1 and λ 2 , for decoding, we use random search, choosing values in the range [0, 2) for the weights and values in the range [0, 1) for the length penalty. For all language directions, we choose the weights that give the highest BLEU score on a combined dataset of newstest2014 and newstest2016.
To run our final decoding step, we first use the forward model with beam size 50 to generate an n-best list. We then use the channel and language models to score each of these hypotheses, using the weights and length penalty tuned previously. Finally, we select the hypothesis with the highest score as our output.

Results
Results and ablations for En→De are shown in Table 5, De→En in Table 6, En→Ru in Table 7 and Ru→En in Table 8. We report case-sensitive Sa-creBLEU scores using SacreBLEU (Post, 2018) 1 , using international tokenization for En→Ru. In the final row of each table we also report the casesensitive BLEU score of our submitted system on this year's test set. All single models and individual models within ensembles are averages of the last 10 checkpoints of training. Our baseline systems are big Transformers as described in (Vaswani et al., 2017). The baselines were trained with minimally filtered data, removing only those sentences longer than 250 words and exceeding a source/target length ratio of 1.5 This setup gave us a reasonable baseline to evaluate data filtering.

English→German
For En→De, langid filtering, larger FFN, and ensembling improve our baseline performance on news2018 by about 1.5 BLEU. Note that our best  bitext only systems already outperforms our system from last year by 1 BLEU point. This is perhaps due to the addition of higher quality bitext data and improved data filtering techniques. The addition of back-translated (BT) data improves single model performance by only 0.3 BLEU, but combining this with fine-tuning and ensembling gives us a total of 3 BLEU. Finally, applying reranking on top of these strong ensembled systems gives another 1.4 BLEU.

German→English
For De→En, as with En→De, we see similar improvements with langid filtering, larger FFN, and ensembling on the order of 1.4 BLEU. Compared to En→De however, we also observe that the addition of back-translated data is much more significant, improving single model performance by over 2.5 BLEU. Fine-tuning, ensembling, and reranking add an additional 2.4 BLEU, with reranking contributing 1.5 BLEU, a majority of the improvement.

English→Russian
For En→Ru, we observe large improvements of 2.4 BLEU over a bitext-only model after applying langid filtering, larger FFN, and ensembling. Since we start with a lower quality initial En↔Ru bitext dataset, we observe a large improvement of 3.5 BLEU by adding back-translated data. Augmenting this back-translated data with Commoncrawl adds an additional 0.2 BLEU. Finally, applying fine-tuning, ensembling, and reranking adds 2.2 BLEU, with reranking contributing 1 BLEU.

Russian→English
For Ru→En, we observe similar trends to En↔De, with langid filtering, larger FFN, and ensembling improving performance of a bitextonly system by 1.6 BLEU. Backtranslation adds 3 BLEU, again most likely due to the lower quality bitext data available. Fine-tuning, ensembling, and reranking add almost 4 BLEU, with reranking contributing 1.2 BLEU.

Reranking
For every language direction, reranking gives a significant improvement, even when applied on top of an ensemble of very strong back-translated models. We also observe that the biggest improvement of 1.5 BLEU comes in the De→En language direction, and the smallest improvement of 1 BLEU in the En→Ru direction. This is perhaps due to the relatively weak Russian language model, which is trained on significantly less data compared to English and German. Improving our language models may lead to even greater improvements with reranking.

Conclusions
This paper describes Facebook FAIR's submission to the WMT19 news translation task. For all four translation directions, En↔De and En↔Ru, we use the same strategy of filtering bitext data, backtranslating monolingual data, then training strong individual models on a combination of this data. Each of these models is fine-tuned and ensembled into a final system that is used for decoding with noisy channel model reranking. We demonstrate the effectiveness of our reranking approach, even when applied on top of very strong systems, and achieve the best case-sensitive BLEU score for En→Ru and competitive results in all other directions.