Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task

The WMT19 Parallel Corpus Filtering For Low-Resource Conditions Task aims to test various methods of filtering a noisy parallel corpora, to make them useful for training machine translation systems. This year the noisy corpora are the relatively low-resource language pairs of Nepali-English and Sinhala-English. This papers describes the Air Force Research Laboratory (AFRL) submissions, including preprocessing methods and scoring metrics. Numerical results indicate a benefit over baseline and the relative benefits of different options.


Introduction
For this task the participants were provided with a corpus of parallel data in English-Nepali (enne) and English-Sinhala (en-si). Both parallel and monolingual training datasets were provided in these languages. The task organizers built statistical machine translation (SMT) and neural machine translation (NMT) systems from the scores produced, based on parallel training sets of 1M (one million) and 5M English words.
Subset selection techniques often strive to reduce a set to the most useful. For the shared task one should avoid selecting: • A line with undue repetition of content of other selected lines. This repetition can extend training times and/or skew the translation system to favor this type of line.
• Long lines, which will be ignored in training the MT systems.
In addition to adapting the corpus to the building of a general-purpose MT system, we must also deal with significant noise. The main types of noise present in the given data are: • Not natural language • One or both languages are incorrect • Lines are not translations of each other In contrast to our WMT18 submission (Erdmann and Gwinnup, 2018), we include a text quality metric in the subcorpus-building process, rather than combining it afterward.

Preprocessing
As a first step, a rough preprocessing filter is applied to the data.
We remove lines where either language text contains more than 80 words, since the test systems use a maximum of 80 words per line. We also remove lines where the language ID probabilities from fastText (Joulin et al., 2016b,a) do not match the expected languages (using the pre-built language ID models of the authors).
This preprocessed text is used to generate the scores that determine a line's usefulness. We note that there are many fewer preprocessing steps than our previous system (Erdmann and Gwinnup, 2018). We can simplify preprocessing because inclusion of a text quality metric during subcorpusbuilding will avoid other forms of noise in the process.

Coverage Metric
Our metric for subcorpus-building uses both a coverage metric and a text quality metric.
We first give our coverage metric (Gwinnup et al., 2016). Let us select a subcorpus S from a larger corpus C to maximize its similarity to a representative corpus T . Let our preferred subselected subcorpus size be τ times the size of T . Let V be a set of vocabulary elements of interest. Defining c v (X) to be the count of the occurrence of feature v ∈ V in a given corpus X, the coverage g is given by where the oversaturation penalty p v (S, T, τ ) is Here f can be any submodular function, but we choose exclusively f (x) = log(1 + x). The scaled count c τ v (T ) = τ c v (T ) accounts for the preferred size of the selected subcorpus differing from the size of T .

Text Quality Metric
To create a text quality metric, we use the given clean parallel data to create a MT system. We use the MT system to translate both pre-filtered noisy parallel corpora into English.
This allows us to compute the Meteor (Denkowski and Lavie, 2014) score of the given English lines, using the translated English as a reference. The Meteor metric was chosen due to its using deeper linguistic information than BLEU. The text quality metric of a subcorpus is given by its average: where m(s) is the text quality metric (e.g., Meteor) score of line s. This corpus metric is defined to be zero for the empty corpus: h(∅) = 0.
The overall score of a subcorpus is given by the product of the coverage metric (1) and the quality metric (2):

Subcorpus-Building Algorithm
To build a subcorpus, we iterate the following two steps until the selected subcorpus is large enough: 1. Add the line that has the best effect on the overall score F from (3).
2. If removal of any line would improve F , find the line with the largest improvement. Remove it, unless infinite cycling would result.
This is a greedy algorithm, with review after each selection.

Application
This section outlines the particulars of the method applied to the given data for this task. Pre-filtering removed a significant percentage of the noisy parallel corpora prior to scoring. The thresholds for language identification were set empirically. For en-ne we used 40% for English and 1% for Nepali. For en-si we used 10% for both English and Sinhala. After filtering for language identification and a maximum of 80 words, 0.9M of the 2.2M lines remained for en-ne and 1.2M of the 3.4M lines remained for en-si. We trained phrase-based Moses (Koehn et al., 2007) systems with the small amount of "clean" training data provided by the organizers. These training corpora were normalized as necesssary to remove systematic representation oddities, mostly in punctuation. The Moses systems employ a hierarchical reordering model (Galley and Manning, 2008) and 5-gram operation sequence model (Durrani et al., 2011). The 5-gram English language model used by both systems was trained with the constrained monolingual corpus from our WMT15 (Gwinnup et al., 2015) efforts.
These Moses MT systems were used to translate the pre-filtered datasets. The Meteor score of the given English lines was computed, using the translated English as a reference.
The pre-filtered parallel corpora were lowercased and tokenized with tools from Moses. We built a 2000-word-vocabulary SentencePiece (Kudo and Richardson, 2018) model on the given monolingual corpora for each language. The prefiltered parallel corpora were processed with these models prior to subcorpus-building.
Our subcorpus-building procedure was followed, producing a subcorpus that we ranked by the order a line was added to the subcorpus. This can produce too few scored lines for the 1M-word or 5M-word subcorpora, so we order the scores of the remaining lines by their text quality metric (i.e., Meteor) scores alone. We submitted scores generated by two values of τ for each language pair. The smaller value of τ produced a 50kline subcorpus, and the larger value of τ produced 150k lines. Our expectation was that the smaller subcorpus would be best in the 1M-word case, and the larger subcorpus in the 5M-word case. For these cases the selected corpora were roughly the same size as the training sets.

Numerical Results
The official results of the WMT19 Parallel Filtering Task are given by Bojar et al. (2019).
Here we give some general findings by using the given Moses-EMS configuration for the task. Tables 1-2 give numerical results of this test. BLEU scores are uncased and produced during the Moses-EMS run. We see that the parallel filtering methods we expected to be best do in fact improve on the Zipporah (Xu and Koehn, 2017) baseline.
The smaller, 50k-line subcorpus shows increases of by 0.24 BLEU for 1M en-ne and 0.15 BLEU for 1M en-si. The larger, 150k-line subcorpus shows increases of by 0.11 BLEU for 5M en-ne and 0.32 BLEU for 5M en-si. Picking the best results over all our experiments shows greater improvements over baseline: 0.48 BLEU for 1M en-ne, 0.46 BLEU for 1M en-si, 0.11 BLEU for 5M en-ne, and 0.44 BLEU for 5M en-si.
The tables show that the subcorpus-building process normally improves over scoring by the text quality metric score alone (the row labelled "quality", which is equivalent to either building an empty subcorpus or choosing F = h in (3)). These improvements are largest and most consistent in the 1M-word tests. We expect that the larger sets might be struggling to find helpful data in the noisy corpora, essentially converging to the text-quality-metric-only score.
We tested excluding the text quality metric from the selection process (i.e., choosing F = g in (3)), and these tests are given in the table rows labelled "coverage". As in (Erdmann and Gwinnup, 2018), we saw great benefit from including the text quality using an MT system, even in this low-resource setting.
Varying the number of grams considered in the subcorpus-building algorithm's vocabulary yielded small and inconsistent changes over unigram selection. We have no insight into which linguistic or corporeal features make it beneficial to consider 2-grams in English-Nepali but slightly detrimental in English-Sinhala.

Conclusions
We have presented the techniques we used in our submissions to the WMT19 Parallel Corpus Filtering For Low-Resource Conditions Task. Numerical results show our method to be a fraction of a BLEU point better than the Zipporah baseline for training the SMT system.  We expect the optimal choices in our method to vary significantly with language pairs and noisy corpora. This might be in parameters (language ID thresholds, τ , n-gram levels, etc.) or the combination of coverage and metric metrics (product, sum, etc.), the design of the MT system(s) used for the text quality metric (e.g., phrase-based or neural, with their myriad design choices) or the text quality metric itself (Meteor, BEER (Stanojević and Sima'an, 2015), chrF (Popović, 2015), etc.).
Building a machine translation system in each direction would provide us with two text quality metric scores to incorporate into the overall score. We expect this would decrease dependence on the language ID thresholds and produce a somewhat better subcorpus.
Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. Cleared for public release on 12 Jun 2019. Originator reference number RH-19-119920. Case number 88ABW-2019-2964.