Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.


Introduction
Training corpora for machine translation come in varying degrees of quality.On the one extreme end they are carefully professionally translated specifically for this purpose which may have done under the instruction to provide fairly literal translations and adherence to sentence-bysentence correspondences.The other extreme are sentence pairs extracted with fully automatic processes from indiscriminate crawling of the World Wide Web.
The Shared Task on Parallel Corpus Filtering targets the second extreme, although the methods developed for this data condition should also carry over to less noisy parallel corpora.In setting this task, we were motivated by our ongoing efforts to create large publicly available parallel corpora from web sources and the recognition that noisy parallel data is especially a concern for neural machine translation (Khayrallah and Koehn, 2018).
This paper gives an overview of the task, presents its results and provides some analysis.

Related Work
Although the idea of crawling the web indiscriminately for parallel data goes back to the 20th century (Resnik, 1999), work in the academic com-munity on extraction of parallel corpora from the web has so far mostly focused on large stashes of multilingual content in homogeneous form, such as the Canadian Hansards, Europarl (Koehn, 2005), the United Nations (Rafalovitch and Dale, 2009;Ziemski et al., 2015), or European Patents (Täger, 2011).A nice collection of the products of these efforts is the OPUS web site1 (Skadin ¸š et al., 2014).
We are currently engaged in a large-scale effort to crawl text from the web.This work has been funded by Google Faculty Awards and is also currently funded by the European Union via the Connecting Europe Facility. 2In 2016, we organized a shared task on document alignment as part of this effort (Buck and Koehn, 2016).
Acquiring parallel corpora from the web typically goes through the stages of identifying web sites with parallel text, downloading the pages of the web site, aligning document pairs, and aligning sentence pairs.A final stage of the processing pipeline filters out bad sentence pairs.These exist either because the original web site did not have any actual parallel data (garbage in, garbage out), or due to failures of earlier processing steps.
In 2016, a shared task on sentence pair filtering3 was organized, albeit in the context of cleaning translation memories which tend to be cleaner that the data at the end of a pipeline that starts with web crawls.
There is a robust body of work on filtering out noise in parallel data.For example : Taghipour et al. (2011) use an outlier detection algorithm to filter a parallel corpus; Xu and Koehn (2017) generate synthetic noisy data (inadequate and nonfluent translations) and use this data to train a classifier to identify good sentence pairs from a noisy corpus; and Cui et al. (2013) use a graph-based random walk algorithm and extract phrase pair scores to weight the phrase translation probabilities to bias towards more trustworthy ones.
Most of this work was done in the context of statistical machine translation, but more recent work targets neural models.Carpuat et al. (2017) focus on identifying semantic differences in translation pairs using cross-lingual textual entailment and additional length-based features, and demonstrates that removing such sentences improves neural machine translation performance.
As Rarrick et al. (2011) point out, one type of noise in parallel corpora extracted from the web are translations that have been created by machine translation.Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction.Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and statistical machine translation output can be detected due to lack of reordering.Belinkov and Bisk (2017) investigate the impact of noise on neural machine translation.They focus on creating systems that can translate the kinds of orthographic errors (typos, misspellings, etc.) that humans can comprehend.In contrast, Khayrallah and Koehn (2018) address noisy training data and focus on types of noise occurring in web-crawled corpora.They carried out a study how noise that occurs in crawled parallel text impacts statistical and neural machine translation.
There is a rich literature on data selection which aims at sub-sampling parallel data relevant for a task-specific machine translation system (Axelrod et al., 2011).van der Wees et al. (2017) find that the existing data selection methods developed for statistical machine translation are less effective for neural machine translation.This is different from our goals of handling noise since those methods tend to discard perfectly fine sentence pairs (say, about cooking recipes) that are just not relevant for the targeted domain (say, software manuals).Our task is focused on data quality that is relevant for all domains.

Task
The shared task tackled the problem of filtering parallel corpora.Given a noisy parallel corpus (crawled from the web), participants developed methods to filter it to a smaller size of high quality sentence pairs.Specifically, we provided a very noisy 1 billion word (English token count) German-English corpus crawled from the web by the Paracrawl project.We asked participants to subselect sentence pairs that amount to (a) 10 million words, and (b) 100 million words, counted on the English side.The quality of the resulting subsets was determined by the quality of a statistical machine translation (Moses, phrase-based) and a neural machine translation system (Marian) trained on this data.The quality of the machine translation system was measured by BLEU score on the (a) official WMT 2018 news translation test set and (b) other undisclosed test sets.
Note that the task addressed the challenge of data quality and not domain-relatedness of the data for a particular use case.Hence, we discouraged participants from subsampling the corpus for relevance to the news domain.Thus, we place more emphasis on the undisclosed test sets, although we report both scores.
Participants in the shared task submitted a file with quality scores, one per line, corresponding to the sentence pairs.The scores do not have to be meaningful, except that higher scores indicate better quality.The scores were uploaded to a Google Drive folder which remains publicly accessible. 4valuation of the quality scores was done by subsampling 10 million and 100 million word corpora based on these scores, training statistical and neural machine translation systems with the subsampled corpora, and evaluation translation quality on blind test sets using the BLEU score.
For development purposes, we released configuration files and scripts that mirror the official testing procedure with a development test set.The development pack consists of • a script to subsample corpora based on quality scores • a Moses configuration file to train and test a statistical machine translation system • Marian scripts to train and test a neural machine translation system The web site for the shared task5 provided detailed instructions on how to use these tools to replicate the official testing environment.

Training Data
The provided raw parallel corpus is the outcome of a processing pipeline that aimed at high recall at the cost of precision, so it is very noisy.It exhibits noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).
A cursory inspection of the corpus is given in Table 1.According to analysis by Khayrallah and Koehn (2018), only about 23% of the data is okay, but even that fraction may be flawed in some way.Consider the following sentence pairs that we did count as okay even though they contain mostly untranslated names and numbers.It is an open question if such data is also harmful, merely irrelevant, or maybe even beneficial.
The raw corpus consists of a billion words of English, paired with German on the sentence level.It was deduplicated from a subset of the raw Paracrawl Release 1.

Provided Meta Information
The provided corpus file contains three items per line, separated by a TAB character: The Hunalign scores were obtained from the sentence aligner (Varga et al., 2005).They may be a useful feature for sentence filtering, but they do not by themselves correlate strongly with sentence pair quality.None of the participants generally used this score.
Participant's systems may take the source of the data into account, e.g., by discounting sentence pairs that come from a web domain with generally low quality scores.To this end, we released the URL sources for each sentence pair as additional data set.Note that due to de-duplication a single sentence pair may have several URL pairs associated it, since it may appear on multiple web pages.
Participants were also allowed to use existing tools and external training data to build their filtering methods.Specifically, they were permitted to use the WMT 2018 news translation task data for German-English (without the Paracrawl parallel corpus) to train components of their method.

Test Sets
The goal of the task is to filter down to highquality sentence pairs, but not to sentence pairs that are most fitting to a specific domain.During the submission period of the task, we only announced that we will use the official new translation test set from the WMT 2018 Shared Task of Machine Translation of News,6 which was not released at that time yet.
In total, we used six test sets.For statistics see For all the test sets, we checked for overlap with the training data, to prevent the possibility of having the test set being contained in the released noisy parallel data.We originally considered a test set based on the PHP documentation but removed it because that was contained in Paracrawl.
The official scoring of machine translation systems generated from the subsampled data sources is the average of the individual BLEU scores for each test set.

Evaluation Protocol
The testing setup mirrors the development environment that we provided to the participants.

Particpants
We received submissions from 17 different organizations.See Table 3 for the complete list of participants.The participant's organizations are quite diverse, with 3 participants from Spain, 3 participants from the United States, 2 participants from Germany, 1 participant each from Canada, Greece, China, Japan, France, Latvia, Estonia, United Kingdom, and Brazil.9 of the participants are companies, 3 are national research organizations, and 5 were universities.Each participant submitted up to 5 different sets of scores, resulting in a total of 44 different submissions that we scored.

Subset Selection
We provided to the participants a file containing one sentence pair per line.A submission to the shared task consists of a file with the same number of lines, with one score per line corresponding to the quality of the corresponding sentence pair.
Using the score file, we selected subsets of a pre-defined size, defined by the number of English words.We chose the number of English words instead of German words, since the latter would allow selection of sentence pairs with very few German words and many English words which are beneficial for language model training but do not count much towards the German word total.
Subselecting sentence pairs is done by finding a threshold score, so that the sentence pairs that will be included in the subset have a quality score at and above this threshold.In some cases, a submission assigned this threshold score to a large num- ber of sentence pairs.Including all of them would yield a too large subset, excluding them yields a too small subset.Hence, we randomly included some of the sentence pairs to get the desired size in this case.

System Training
Given a selected subset of given size for a system submission, we built statistical (SMT) and neural machine translation (NMT) systems to evaluate the quality of the selected sentence pairs.SMT For statistical machine translation, we used Moses (Koehn et al., 2007) with fairly basic settings, such as Good-Turing smoothing of phrase table probabilities, maximum phrase length of 5, maximum sentence length of 80, lexicalized reordering (hier-mslr-bidirectional-fe), fastalign for word alignment with grow-diag-final-and symmetrization, tuning with batch-MIRA, no operation sequence model, 5-gram language model trained on the English side of the subset with no additional data, and decoder beam size of 5,000.
NMT For neural machine translation, we used Marian (Junczys-Dowmunt et al., 2018).It uses the default settings of version 1.5, with 50,000 BPE operations, maximum sentence length of 100, layer normalization, dropout of 0.2 for RNN states, 0.1 for source embeddings and 0.1 for target embeddings, exponential smoothing, and de-coding with beam size 12 and length normalization (1).Training a system for the 10 million word subset was limited to 20 epochs and took about 10 hours.Training a system for the 100 million word subset was limited to 10 epochs and took about 2 days.
Scores on the test sets were computed with multi-bleu-detok.perlincluded in Moses.We report case-insensitive scores.

Core Results
The official results are reported in Table 4.The table contains the average BLEU score over all the 6 test sets for the 4 different setups • statistical machine translation for 10 million word corpus • statistical machine translation for 100 million word corpus • neural machine translation for 10 million word corpus • neural machine translation for 100 million word corpus In the table, we highlight cells for the best scores for each of these settings, as well as scores that are close to it.
One striking observation is that the scores differ much more for the 10 million word subset than for the 100 million word subset.Scores also differ more for neural machine translation systems than for statistical machine translation systems.For the 10 million word subset, there are only 2 submissions within 0.5 BLEU of the best system for statistical machine translation, and 0 for neural machine translation.For the 100 million word subset, there are 15 submissions within 0.5 BLEU of the best system for statistical machine translation, and 9 submissions within 0.5 for neural machine translation.Note that many of these submissions come from the same participants.
For both data sets, scores for neural machine translation are significantly higher.For the 10 million word subsets, the best NMT score is 28.6, while the best SMT score is 24.6.For the 100 million word subsets, the best NMT score is 32.1, while the best SMT score is 26.5.To be fair, statistical machine translation is typically trained with large monolingual corpora for language modelling that are essential for good performance.

Results by Test Set
Table 5 and 6 break out the results by each of the test sets, for statistical machine translation and neural machine translation, respectively.
The use of multiple test sets was motivated by the objective to discourage participants to filter sentence pairs for a specific domain, instead of filtering for general quality.Some participants used domain-specific data for training some elements of their filtering systems, such as monolingual news data sets to train language models but argued that these are broad domains that do not lead to domain over-fitting.
The results do not evoke the impression that some systems are doing better on some domains than others, at least not more than random variance would lead to expect.The closest test sets to the development sets are NEWSTEST2018, GLOB-ALVOICES, and maybe IWSLT2018.Only the 10 million word submissions rwth-nn and rwth-nnredundant seem to do much better on these sets than others, relative to other submissions.

Additional Subset Sizes
Since we were interested in the shape of the curve of how different corpus sizes impact machine translation performance, we subselected additional subset size.Specifically, in addition to the 10 and 100 million word corpora, we also subselected 20, 30, 50, 80, 150, and 200 million words.See Figure 1 for results for neural machine translation systems (also broken down by each individual test set) and Figure 2 for statistical machine translation systems.We only computed results for six systems due to the computational cost involved.
The scoring on additional subset sizes was not announced before the submission deadline for the shared task, so none of the participants optimized for these.In fact, some participants assigned the same low value for almost all sentence pairs that would be ignored when subselecting the 100 million word corpus.So, when subsampling larger corpora (150 and 200 million words, as we have done), the resulting system scores collapse.
The curves for neural machine translation system scores peak almost always at 100 million words, although also occasionally at 80 or 150 million words.Since we did not plot these curves when setting up the shared task, we cannot say if 100 million words is just a optimal value for this corpus or if participants overfitted their system to this value, although we would guess the first.
The performance between the submissions are quite similar on the different test sets.None of the submissions we show in the figures has overly optimized on the news test set.

Methods used by Participants
Not surprising due to the large number of submissions, many different approaches were explored for this task.However, most participants used a system using three components: (1) pre-filtering rules, (2) scoring functions for sentence pairs, and (3) a classifier that learned weights for feature functions.
Pre-filtering rules.Some of the training data can be discarded based on simple deterministic filtering rules.These may include rules to remove   • sentence pairs where names, numbers, email addresses, URLs do not match between both sides • sentence pairs that are too similar, indicating simple copying instead of translating • sentences where language identifier do not detect the required language Scoring functions.Sentence pairs that pass the pre-filtering stage are assessed with scoring functions which provide scores that hopefully correlate with quality of sentence pairs.Participants used a variety of such scoring functions, including • n-gram or neural language models on clean data • language models trained on the provided raw data as contrast • neural translation models • bag-of-words lexical translation probabilities Note that the raw scores provided by these models may be also refined in several ways.For instance, we may desire that the language model perplexities of a German sentence and its paired English sentence are similar.Or, we may contrast the translation model score for a sentence and its given paired sentence with the translation model score for the sentence and its best translation according to the model.
Learning weights for scoring functions.Given a large number of scoring functions, simply averaging their resulting scores may be inadequate.
Learning weights to optimize machine translation system quality is computationally intractable due to the high cost of training these systems to evaluate different weight settings.A few participants used instead a classifier that learns how to distinguish between good and bad sentence pairs.Good sentence pairs are selected from existing highquality parallel corpora, while bad sentence pairs are either synthesized by scrambling good sentence pairs or by using the raw crawled data.Some participants made a distinction between unsupervised methods that did not use existing parallel corpora to train parts of the system, and supervise methods that did.Unsupervised methods have the advantage that they can be readily deployed for language pairs for which no seed parallel corpora exist.

Figure 1 :
Figure1: Additional corpus sizes, with breakdown by individual test set for some high-performing submissions.The charts plot BLEU scores against the size of the subselected corpus (in millions of words).The curves peak around 100 million words.

Figure 2 :
Figure 2: Version of Figure 1 for statistical machine translation systems built from the subselected data.Note that the curves are flatter, and the several systems score in a narrow band of 1 BLEU point across a wide range of corpus sizes (30-200 million words), indicated in grey.

Table 1 :
Noise in the raw Paracrawl corpus.• the test set from the WMT 2016 Shared Task on Machine Translation of News as development set • the test set from the WMT 2017 Shared Task on Machine Translation of News as development test set

Table 2 .
Two of them were taken from existing evaluation campaigns, four were created for this shared task.
EMEA This test set was extracted from documentsEuropean Medicines Agency, which consist of public health announcements and descriptions of medications.We only used sentences with 20 to 80 words, and removed any duplicate sentence pairs.

Table 2 :
Statistics for the test sets used to evaluate the machine translation systems trained on the subsampled data sets.Word counts are obtained with wc on untokenized text.

Table 3 :
Participants in the shared task.