Coverage and Cynicism: The AFRL Submission to the WMT 2018 Parallel Corpus Filtering Task

The WMT 2018 Parallel Corpus Filtering Task aims to test various methods of filtering a noisy parallel corpus, to make it useful for training machine translation systems. We describe the AFRL submissions, including their preprocessing methods and quality metrics. Numerical results indicate relative benefits of different options and show where our methods are competitive.


Introduction
For this task the participants were provided with a large corpus of parallel data in English and German. The corpus contains approximately 10 8 lines, with approximately 10 9 words in each language. Hunalign scores (Varga et al., 2005) also were provided for each line. The task organizers built statistical machine translation (SMT) and neural machine translation (NMT) systems from the scores produced, based on parallel training sets of 10 6 and 10 7 words.
Subset selection techniques often strive to reduce a set to the most useful. In this circumstance, this entails: • Avoiding selecting a line with undue repetition of content of other selected lines. This can extend training times and/or skew the translation system to favor this type of line.
• Avoid selecting long lines, which will be ignored in training an NMT system.
In addition to adapting the corpus to the building of a general-purpose machine translation system, we must also deal with its significant noise. The main types of noise present in the given data are: • Not natural language • One or both languages are incorrect • Correct languages and natural language, but not translations of each other

Preprocessing
As a first step, a rough preprocessing filter is applied to the data. This entails removing: • Lines where either language contains more than 80 words • Lines where either language contains less than 4 words • Lines containing "www", as lines with web addresses tend to provide less useful information • Lines where the ratio of the number of English words to the number of German words is greater than three or less than one third • Lines containing characters with the Unicode general category of "other" • Lines where the English text is identical to the German text, after removing space, period, and numeric characters.
• Lines where numeric characters are different (or in a different order) in the two languages • Lines where the hunalign score is less than 0.5 or greater than 1.5 The first of these criteria is based on limitations of NMT training, where long lines are discarded or truncated. The other criteria are highly empirical, based on indicators of apparent qualitative problems.
The remaining lines are put through further processing prior to scoring: • Punctuation is normalized • Words are truncated to 72 characters. The tokenizer attempts to separate German compound words, and long words cause it to hang.
• Language-specific tokenization is performed, using SYSTRAN's Linguistic Development Kit. Subword units are generated via bytepair-encoding (BPE) (Gage, 1994). The BPE models are learned on a per-language basis, trained with 2000 byte-pair encoding merges, over all WMT 2018 news translation task parallel German-English data 1 without the Paracrawl 2 corpus. This small vocabulary was chosen to reduce the number of out-ofvocabulary tokens resulting from morphology and compounding.
• The BPE form is transformed into the format used for character-based processing, with denoted spaces and no subword continuation markers (e.g., stand@@ ard prac@@ tice becomes stand ard _ prac tice) • Case features are removed, essentially allowing BPE formation using case but scoring lowercased.
This preprocessed text is used to generate the scores that determine a line's usefulness.

Coverage Metric
We use two metrics to estimate the relative appropriateness of a selected set to a reference. The first is our own coverage metric (Gwinnup et al., 2016), which we reproduce here. Let us select a subset S from a larger set C to maximize its similarity to a representative set T . Let our preferred subselected set size be τ times the size of T . Let V be a set of vocabulary elements of interest. Define c v (X) to be the count of the occurrence of feature v ∈ V in a given corpus X and c τ v (T ) = c v (T )/τ to be the scaled count that accounts for the preferred size of the selected set. The coverage g is then given by where the oversaturation penalty p v (S, T, τ ) is Here f can be any submodular function, and we choose exclusively f (x) = log(1 + x). The final score reported for a line is the change it makes to the coverage metric on its inclusion. Lines which are not selected are given scores of zero.

Cynical Metric
As another approach we defined a metric based on the cynical selection method (Axelrod, 2017), which seeks to minimize the cross-entropy H. In our terms, this is .
(2) We prefer to maximize metrics, so we define h(S, T ) = −H(S, T ) as the cynical metric to maximize. Including the scaling factor τ would have no effect on the cross-entropy value.
Note that Axelrod (2017) defines the crossentropy purely in terms of unigrams, motivated by an unsmoothed unigram language model. We include unigrams through 4-grams in our feature set V. This extension to n-grams was not recommended by Axelrod (2017). However, we found it useful for this task.
The final score reported for a line is the change it makes to the cynical metric on its inclusion, with a maximum score of 1. Lines which are not selected are given scores of zero.

Set-building Algorithm
Whether the metric is our coverage metric or our cynical metric, the method of building the set is the same. We iterate the following two steps until the selected set is large enough: 1. Add the line that has the best effect on the metric.
2. Check if removing a line from the selected corpus would improve the metric. If so, remove the line with greatest such improvement, unless it was the most-recently selected or would lead to infinite cycling.
This is a greedy algorithm with review after each selection.

Translation Score
The preceding processes and metrics were designed to remove many sources of error mentioned in the introduction of this paper. However, we have not yet dealt with the case of having both English and German lines being natural and useful, but the lines not being translations of one another. To help mitigate this phenomenon, we created a German-English NMT system using OpenNMT (Klein et al., 2017). It was trained on all WMT 2018 news translation task parallel German-English data, excluding the Paracrawl corpus. This system was a 4-layer bidirectional RNN, with 600-dimensional word embeddings and an RNN dimension of 1024, incorporating case features and a vocabulary from 2000 byte-pair encoding merges. The small vocabulary was chosen to reduce the number of out-ofvocabulary tokens resulting from morphology and compounding.
We translated all German the lines that survived the preprocessing step using this MT system. We computed the sentence-level Meteor scores (Denkowski and Lavie, 2011) of the English from the MT system, with the given data as the reference. We simply multiplied positive coverage or cynical scores by their Meteor scores.

Application
This section outlines the particulars of the method applied to the given data for this task. First, the Paracrawl data are preprocessed according to the method in §2. This reduces the set of potential lines from 10 8 to 10 7 . This reduced set is divided into 100 parts of 10 5 lines for scoring via batch processing.
Five different scoring methods will be considered. The baseline is cvg-mix, which uses our coverage metric and sums the coverage score for a small set (τ corresponding to 10 6 total lines) and a large set (τ corresponding to 10 7 total lines). Other scores are variants of this. The treatment cvg-large considers only the large set, and cvg-small considers only the small set. Meteor scores of translated lines are considered in cvg-mix-meteor. Finally, cynical scores are considered in cyn-mix.

Numerical Results
The results of the WMT 2018 Parallel Filtering Task are given by Bojar et al. (2018). BLEU scores for MT systems built from sets selected via our scoring methods are given in Tables 1-4. We do not consider the development set (newstest2017) in any analysis below, but we include it in the tables for completeness.
Several trends are apparent within our five submissions. First, including the Meteor score is always beneficial for the MT systems trained on smaller sets and rarely detrimental for the systems trained on larger sets. The filtering that includes a translation score, cvg-mix-meteor, is our top submission by mean BLEU score for all four MT systems. Second, the filter cvg-small, designed for producing a small training set, is poor at producing a large training set. Third, for the small training set there is almost always (test set EMEA in SMT excepted) a benefit from averaging the small training set method and the large training set method. Fourth, the coverage and cynical measures produce very similar results for SMT, but the cynical score is much better for the NMT system that used a small training set. The fact that selection methods differ in performance for SMT and NMT is known (van der Wees et al., 2017), but it is interesting that it is true for our two scoring methods.
Our best filtering method, cvg-mix-meteor, scores better than the mean performance of all non-AFRL methods in the task, for every test set and every MT system type. This method exhibits relatively better quality on the smaller (10 6 -word) training sets, where it also bests the median. It is especially competitive with the top two systems using the 10 6 -word training sets on the test sets Acquis and KDE.

Conclusions
We have described a total of five different methods for filtering parallel data, as submitted to the WMT 2018 Parallel Filtering Task. We present numerical results, showing that our methods are especially competitive on certain test sets in the small training set condition.
Our coverage and cynical metrics yield approximately equivalent results in SMT, but the cynical metric is much better for the NMT system built on a small training set. Cynical scoring requires roughly half the computational time burden, so it is sometimes a good choice for NMT.
The ability to specify the size of the selected set is beneficial for our coverage scoring method in Table 1: BLEU scores of created systems, 10 6 -word SMT. Filter mean excludes the development set (new-stest2017). The two additional systems listed are the best performing in the task, by mean test set BLEU score. Set score statistics are over the 43 task submissions from other participants.    Inclusion of a translation metric score such as Meteor is beneficial, and the simplistic version given here produced our best system. Introducing of a translation metric score directly in the set-building process would help in avoiding redundancy.
Optimizing the heuristic and empirical prefiltering and preprocessing steps given here could yield substantial benefit. We have doubtlessly removed some beneficial lines in the prefiltering, which excluded up to 90% of the data. In fact, the prefiltering could conceivably be replaced by moving the application of the machine translation system to before scoring, rather than after. Unfortunately this change would cause much more of a computational burden, as every line would need to be translated.