The JHU Parallel Corpus Filtering Systems for WMT 2018

This work describes our submission to the WMT18 Parallel Corpus Filtering shared task. We use a slightly modified version of the Zipporah Corpus Filtering toolkit (Xu and Koehn, 2017), which computes an adequacy score and a fluency score on a sentence pair, and use a weighted sum of the scores as the selection criteria. This work differs from Zipporah in that we experiment with using the noisy corpus to be filtered to compute the combination weights, and thus avoids generating synthetic data as in standard Zipporah.


Introduction
Todays machine translation systems require large amounts of training data in form of sentences paired with their translation, which are often compiled from online sources.This has not changed fundamentally with the move from statistical machine translation to neural machine translation, also we observed that neural models require more training data (Koehn and Knowles, 2017) and are more sensitive to noise (Khayrallah and Koehn, 2018).Thus both the acquisition of more training data such as indiscriminate web crawling and corpus filtering will have large impact on the quality of state-of-the-art machine translation systems.
The JHU submission to the WMT18 Parallel Corpus Filtering shared task uses a modified version of the Zipporah Corpus Filtering toolkit (Xu and Koehn, 2017).For a sentence pair, Zipporah uses a bag-of-words model to generate an adequacy score, and an n-gram language model to generate fluency score.The two scores are combined based on weights trained in order to separate clean data from noisy data.The original version of Zipporah generates artificial noisy training data to train such classifier, in this submission we also treat the Paracrawl corpus as the negative examples.

Related Work
Zipporah builds upon prior work in data cleaning and data selection.
For data selection, work has focused on selecting a subset of data based on domain-matching.Moore and Lewis (2010) computed cross-entropy between in-domain and out-of-domain language models to select data for training domain-relevant language models.XenC (Rousseau, 2013), an open-source tool, also selects data based on crossentropy scores on language models.Axelrod et al. (2015) utilized part-of-speech tags and used a class-based n-gram language model for selecting in-domain data and Duh et al. (2013) used a neural network based language model trained on a small in-domain corpus to select from a larger mixeddomain data pool.Lü et al. (2007) redistributed different weights for sentence pairs/predefined sub-models.Shah and Specia (2014) described experiments on quality estimation which, given a source sentence, select the best translation among several options.
For data cleaning, work has focused on removing noisy data.Taghipour et al. (2011) proposed an outlier detection algorithm which leads to an improved translation quality when trimming a small portion of data.Cui et al. (2013) used a graph-based random walk algorithm to do bilingual data cleaning.BiTextor (Esplá-Gomis and Forcada, 2009) utilizes sentence alignment scores and source URL information to filter out bad URL pairs and selects good sentence pairs.Similar to this work, the qe-clean system (Denkowski et al., 2012;Dyer et al., 2010;Heafield, 2011) uses word alignments and language models to select sentence pairs that are likely to be good translations of one another.
We focus on data cleaning for all purposes, as opposed to data selection for a given domain.We aim to create a corpus of generally valid translations, which could then be filtered to adapt to a particular domain.

Zipporah
We use a slightly modified version of the Zipporah Corpus Filtering toolkit (Xu and Koehn, 2017).Zipporah works as follows: it first maps all sentence pairs into the proposed feature space, and then trains a simple logistic regression model to separate known good data and bad data.Once the model is trained, it is used to score sentence pairs in the noisy data pool.
Zipporah uses two features inspired by adequacy and fluency.The adequacy feature uses bagof-words translation scores, and the fluency feature uses n-gram language model scores.

Adequacy Score
Zipporah generates probabilistic dictionaries from an aligned corpus, and uses them to generate bag of words translation scores for each sentence.This is done in both directions.
Given a sentence pair (s f , s e ) in the noisy data pool, we represent the two sentence as two sparse word-frequency vectors v f and v e .For example for any French word w f , we have v , where c(w f , s f ) is the number of occurrences of w f in s f and l(s f ) is the length of s f .We do the same for v e .Then we "translate" v f into v e , based on the probabilistic f2e dictionary, where For a French word w that does not appear in the dictionary, we keep it as it is in the translated vector, i.e. assume there is an entry of (w, w, 1.0) in the dictionary.We compute the cross-entropy between v e and v e , where c is a smoothing constant to prevent the denominator from being zero, which we set c = 0.0001 for all experiments.We perform similar procedures for English-to-French, and compute xent(v f , v f ).We define the adequacy score as the sum of the two:

Fluency Score
Zipporah trains two 5-gram language models with a clean French and English corpus, and then for each sentence pair (s g , s e ) scores each sentence with the corresponding model, F ngram (s g ) and F ngram (s e ), each computed as the ratio between the sentence negative log-likelihood and the sentence length.We define the fluency score as the sum of the two:

Classifier
We train a binary classifier to separate a clean corpus from noisy corpora, based on the 2 features proposed.Higher orders of the features are used in order to achieve a non-linear decision boundary.We implement this using the logistic regression model from scikit-learn (Pedregosa et al., 2011), and use the features in the form of (x 8 , y 8 ).

Training Data
We Since much of the raw Paracrawl data is noisy (Khayrallah and Koehn, 2018), we also train a version where we simply use the portion of Paracrawl released for the shared task as the negative examples to train our classifier, without generating synthetic noisy data.We experiment with using both the full portion of Paracrawl and a 10, 000 line subset.

Results
We include the results of running the three versions of Zipporah in Table 1.The final column is the average score across the 6 test sets.
• Zipporah-synthetic denotes the system with synthetic negative examples as in the original version of Zipporah.
• Zipporah-paracrawl denotes the system trained with the Paracrawl as the negative examples.
• Zipporah-paracrawl-10000 denotes the system trained with a 10000 sentence subset of Paracrawl.In general, our systems lag behind the top performing systems by about 3 BLEU on the average of the six test sets.The different Zipporah systems perform similarly, with a slight edge to the original version with synthetic parallel data.This indicates that a subset can be used for faster training of Zipporah.
Zipporah does not require building an initial NMT system to score the data, as required by some of the top performing systems.Zipporah also has a very fast run time, the most expensive part being the language model scoring.
Our submissions are more competitive in the SMT experiments, and lag behind the top performing system system by less than a BLEU point (averaged across the test sets) for SMT systems trained on 100 million sentences.This may be due to the fact that Zipporah's adequacy and fluency scores directly track the translation and language model components of SMT.

Conclusion
Our submission to the WMT 2018 shared task on parallel corpus filtering was based on our Zipoorah toolkit.We varied methods to generate negative samples for the classifier to detect noisy sentence pairs, with similar results for synthetic noise, the full raw corpus to be filtered, and a subset of it.
We note that our method is quite simple and fast, using only n-gram language model and bagof-words translation model features.
use clean WMT training data as the examples of clean text.The original version of Zipporah creates synthetic negative training examples by shuffling the clean data set, both at the corpus and sentence levels in order to generate inadequate and non-fluent text.

Table 1 :
Results of our Zipporah variants, compared to the submission with the best average test score.