Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.


Introduction
Statistical machine translation (SMT) systems require the use of parallel corpora for training the internal model parameters. Data quality is vital for the performance of the SMT system (Simard, 2014). To acquire a massive parallel corpus, many researchers have been using the Internet as a resource, but the quality of data acquired from the Internet usually has no guarantee, and data cleaning/data selection is needed before the data is used in actual systems. Usually data cleaning refers to getting rid of a small amount of very noisy data from a large data pool, and data selection refers to selecting a small subset of clean (or in-domain) data from the data pool; both have the objective of improving translation performances. For practical purposes, it is highly desirable to perform data selection in a very fast and scalable manner. In this paper we introduce Zipporah 1 , a fast and scalable system which can select an arbitrary size of good data from a large noisy data pool to be used in SMT model training.
1 https://github.com/hainan-xv/zipporah 2 Prior Work Many researchers have studied the data cleaning/selection problem. For data selection, there have been a lot of work on selecting a subset of data based on domain-matching. Duh et al. (2013) used a neural network based language model trained on a small in-domain corpus to select from a larger data pool. Moore and Lewis (2010) computed cross-entropy between indomain and out-of-domain language models to select data for training language models. XenC (Rousseau, 2013), an open-source tool, also selects data based on cross-entropy scores on language models. Axelrod et al. (2015) utilized partof-speech tags and used a class-based n-gram language model for selecting in-domain data. There are a few works that utilize other metrics. Lü et al. (2007) redistributed different weights for sentence pairs/predefined sub-models. Shah and Specia (2014) described experiments on quality estimation which, given a source sentence, select the best translation among several options. The qeclean system (Denkowski et al., 2012;Dyer et al., 2010;Heafield, 2011) uses word alignments and language models to select sentence pairs that are likely to be good translations of one another.
For data cleaning, a lot of researchers worked on getting rid of noising data. Taghipour et al. (2011) proposed an outlier detection algorithm which leads to an improved translation quality when trimming a small portion of data. Cui et al. (2013) used a graph-based random walk algorithm to do bilingual data cleaning. BiTextor (Esplá-Gomis and Forcada, 2009) utilizes sentence alignment scores and source URL information to filter out bad URL pairs and selects good sentence pairs.
In this paper we propose a novel way to evaluate the quality of a sentence pair which runs efficiently. We do not make a clear distinction between data selection and data cleaning in this work, because under different settings, our method can perform either based on the computed quality scores of sentence pairs.

Method
The method in this paper works as follows: we first map all sentence pairs into the proposed feature space, and then train a simple logistic regression model to separate known good data and (synthetic) bad data. Once the model is trained, it is used to score sentence pairs in the noisy data pool. Sentence pairs with better scores are added to the selected subset until the desired size constraint is met.

Features
Since good adequacy and fluency are the major two elements that constitute a good parallel sentence pair, we propose separate features to address both of them. For adequacy, we propose bag-ofwords translation scores, and for fluency we use ngram language model scores. For notational simplicity, in this section we assume the sentence pair is French-English in describing the features, and we will use subscripts f and e to indicate the languages. In designing the features, we prioritize efficiency as well as performance since we could be dealing with corpora of huge sizes.

Adequacy scores
We view each sentence as a bag of words, and design a "distance" between the sentence pairs based on a bag-of-words translation model. To do this, we first generate dictionaries from an aligned corpus, and represent them as sets of triplets. Formally, Given a sentence pair (s f , s e ) in the noisy data pool, we represent the two sentence as two sparse word-frequency vectors v f and v e . For example for any French word is the number of occurrences of w f in s f and l(s f ) is the length of s f . We do the same for v e . Notice that by construction, both vectors add up to 1 and represent a proper probability distribution on their respective vocabularies. Then we "translate" v f into v e , based on For a French word w that does not appear in the dictionary, we keep it as it is in the translated vector, i.e. assume there is an entry of (w, w, 1.0) in the dictionary. Since the dictionary is probabilistic, the elements in v e also add up to 1, and v e represents another probability distribution on the English vocabulary. We compute the (smoothed) cross-entropy between v e and v e , where c is a smoothing constant to prevent the denominator from being zero, and set c = 0.0001 for all experiments in this paper (more about this in Section 4). We perform similar procedures for English-to-French, and compute xent(v f , v f ). We define the adequacy score as the sum of the two:

Fluency scores
We train two n-gram language models with a clean French and English corpus, and then for each sentence pair (s f , s e ), we score each sentence with the corresponding model, F ngram (s f ) and F ngram (s e ), each computed as the ratio between the sentence negative log-likelihood and the sentence length. We define the fluency score as the sum of the two: fluency(s f , s e ) = F ngram (s f ) + F ngram (s e )

Synthetic noisy data generation
We generate synthetic noisy data from good data, and make sure the generated noisy data include sentence pairs with a) good fluency and bad adequacy, b) good adequacy and bad fluency and c) bad both.
Respectively, we generate 3 types of "noisy" sentence pairs from a good corpus: a) shuffle the sentences in the target language file (each sentence in the source language would be aligned to a random sentence in the target language); b) shuffle the words within each sentence (each sentence will be bad but the pairs are good translations in the "bagof-words" sense); c) shuffle both the sentences and words. We emphasize that, while the synthetic data might not represent "real" noisy data, it has the following advantages: 1) each type of noisy data is equally represented so the classifier has to do well on all of them; 2) the data generated this way would be among the hardest to classify, especially type a and type b, so if a classifier separates such hard data with good performance, we expect it to also be able to do well in real world situations. We plot the newstest09 data (original and autogenerated noisy ones as described in Section 3.2) into the proposed feature space in Figure 1. We observe that the clusters are quite separable, though the decision function would not be linear. We map the features into higher order forms of (x n , y n ) in order for logistic regression to train a non-linear decision boundary. 2 We use n = 8 in this work since it gives the best classification performance on the newstest09 fr-en corpus.

Hyper-parameter Tuning
We conduct experiments to determine the value of the constant c in the smoothed cross-entropy computation in equation 1. We choose the new-stest09 German-English corpus, and shuffle the sentences in the English file and combine the original (clean) corpus with the shuffled (noisy) corpus into a larger corpus, where half of them are good sentence pairs. We set different values of c and use the adequacy scores to pick the better half, 2 We avoid using multiple mappings of one feature because we want the scoring function to be monotonic both w.r.t x and y, which could break if we allow multiple higher-order mappings of the same feature and they end up with weights with different signs. and compute the retrieval accuracy.

Evaluation
We evaluate Zipporah on 3 language pairs, French-English, German-English and Spanish-English. The noisy web-crawled data comes from an early version of http://statmt.org/ paracrawl. The number of words are (in millions) 340, 487 and 70 respectively. To generate the dictionaries for computing the adequacy scores, we use fast align (Dyer et al., 2013) to align the Europarl (Koehn, 2005) corpus and generate probabilistic dictionaries from the alignments. We set the n-gram order to be 5 and use SRILM (Stolcke et al., 2011) to train language models on the Europarl corpus and generate the n-gram scores.
For each language pair, we use scikit-learn (Pedregosa et al., 2011) to train a logistic regression model to classify between the original and the synthetic noisy corpus of newstest09, and the trained model is used to score all sentence pairs in the data pool. We keep selecting the best ones until the desired number of words is reached.
To evaluate the quality, we train a Moses (Koehn et al., 2007) SMT system on selected data, and evaluate each trained SMT system on 3 test corpora: newstest2011 which contains 3003 sentence pairs, and a random subset of the TED-talks corpus and the movie-subtitle corpus from OPUS (Tiedemann, 2012), each of which contains 3000 sentence pairs. Tables 2, 3 and 4 show the BLEU performance of the selected subsets of the Zipporah system compared to the baseline, which selects sentence pairs at random; for comparison, we also give the BLEU performance of systems trained on Europarl.   In particular, for the Germen-English corpus, when selecting less than 2% of the data (10 million words), on the TED-talk dataset, Zipporah achieves a 5.5 BLEU score improvement over the baseline; by selecting less than 4% of the data (20 million words) the system gives better performance than using all data. Peak performance is achieved when selecting 100 million words, where an improvement of 2.1 BLEU score over all data is achieved on the movie-subtitle dataset, despite only using less than 1/5 of the data.  (Denkowski et al., 2012;Dyer et al., 2010;Heafield, 2011) and the random baseline. We use the same data when running qeclean, with Europarl for training and newstest09 for dev. While they both perform comparably and better than the baseline, Zipporah achieves a better peak in all the datasets, and the peak is usually achieved when selecting a smaller number of words compared to qe-clean, Another advantage of Zipporah is it allows the user to select an arbisubsets of the Zipporah system can surpass that of Europarl, although the Europarl corpus acts like an "oracle" in the system, upon which the dictionaries and language models for feature computations are trained. Figure 4: BLEU performance of Zipporah, qeclean and random on TED-talks, Spanish-English trary size from the pool. 4 We also want to emphasize that unlike qe-clean, which requires running word-alignments for all sentence pairs in the noisy corpus, Zipporah's feature computation is simple, fast and can easily be scaled for huge datasets.

Conclusion and Future Work
In this paper we introduced Zipporah, a fast data selection system for noisy parallel corpora. SMT results demonstrate that Zipporah can select a high-quality subset of the data and significantly improve SMT performance.
Zipporah currently selects sentences based on the "individual quality" only, and we plan in future work to also consider other factors, e.g. encourage selection of a subset that has a better n-gram coverage.