Parallel Corpus Filtering via Pre-trained Language Models

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.


Introduction
Training modern neural machine translation (NMT) systems requires large parallel-text resources. Publicly-available parallel corpora are mostly paired with English, such as German-English, French-English, Chinese-English, etc., and their domains are limited. For building machine translation systems between non-English language pairs, such as Chinese and Japanese, existing parallel corpora are insufficient and often low quality. To address this problem, system builders have trained NMT systems on web-crawled data and achieved promising results (Xu and Koehn, 2017;Junczys-Dowmunt, 2018;Schwenk, 2018;. However, data automatically crawled from the web is extremely noisy.  and Belinkov and Bisk (2018) show that neural translation models are far more sensitive to noisy parallel training data than statistical machine translation. Data selection methods that can filter noisy parallel sentences from large-scale web crawled resources are in demand.
In this paper, we study the problem in a realworld scenario where we crawl a large Japanese-Chinese parallel corpus from various websites and build open-domain machine translation systems between Japanese and Chinese, by filtering the web crawled parallel corpus. In addition, a small amount of clean parallel data is available, in the software domain. In order to confirm our results on a public data, we also apply our filter to the WMT 2018 German-English Parallel Corpus Filtering shared task.
Previous work on parallel corpus filtering performs poorly in our scenario as it either requires large clean parallel corpora or dictionaries (Xu and Koehn, 2017;Artetxe and Schwenk, 2019;Junczys-Dowmunt, 2018;, or relies on multilingual word embeddings and neglects context when measuring translation parallelism (Hangya and Fraser, 2018).
In this paper, we propose a simple but effective parallel corpus filtering method. Multilingual BERT (Devlin et al., 2019) projects multilingual sentences into a shared space and has shown a great potential for cross-lingual model transfer (Pires et al., 2019). We use pre-trained multilingual BERT as prior knowledge and fine-tune it on a synthetic dataset. This multilingual BERT-based classifier forms an acceptability filter that determines whether or not a sentence pair consists of a bona-fide translation.
As the domain of training data largely affects machine translation model performance, we also introduce a domain filter. It uses the pre-trained Generative Pre-training (GPT) as in-domain language model and is an extension of the existing crossentropy difference based domain filter (Moore and Lewis, 2010;Junczys-Dowmunt, 2018).
We evaluate our proposed method on the WMT 2018 German-English Parallel Corpus Filtering shared task and achieve a new state-of-the-art. Our unsupervised method achieves comparable performance to the top system that is trained on millions of clean parallel sentence pairs. Our proposed methods also significantly outperform baselines in our own Japanese-Chinese parallel corpus filtering task.
We make the following contributions: • We propose a novel approach to filter noisy parallel corpora by using pre-trained language models. Our approach outperforms strong baselines and achieves a new state-of-the-art.
• We devise an unsupervised filtering approach that does not require an identifiable clean subset of parallel segments. Our unsupervised method matches the results of previous supervised methods.
• We release a large web-crawled Japanese-Chinese parallel corpus which can be a useful resource for machine translation research on non-English language pairs. 1

Related Work
Several recent works address parallel corpus filtering. Denkowski et al. (2012), Dyer et al. (2010) and Heafield (2011) use language models and word alignments to determine how likely sentences are to be a good translation of another. Xu and Koehn (2017) introduce a noise filtering tool, Zipporah, that discriminates parallel and non-parallel sentences based on word-frequency vectors and a dictionary. Junczys-Dowmunt (2018) proposes a dual conditional cross-entropy filtering method, which achieved first place in the WMT 2018 German-English Parallel Corpus Filtering shared task. They train two translation models in inverse directions on millions of parallel sentences and score sentence pairs based on the word-normalized conditional cross-entropy from the translation models. Artetxe and Schwenk (2019) and Schwenk (2018) propose a margin-based scoring method that compares the similarity of the source and target sentence representations. The sentence representations are produced by a sentence encoder trained on clean parallel data via a neural encoder-decoder architecture.
Other works based on sentence embeddings include Hangya and Fraser (2018) and , as well as , which mines millions of parallel sentences in 1620 language pairs from Wikipedia. These encoder-decoder based methods require large amounts of clean parallel training data and are not applicable in our scenario where available data is noisy. Ondrej Bojar (2020) organize an open domain translation challenge where participants are provided a large, noisy set of Japanese-Chinese segment pairs built from web data, and the task is to clean the noisy data and build an end-to-end machine translation system. Work on data selection is also related. Moore and Lewis (2010); Junczys-Dowmunt (2018) select domain-related data by computing the crossentropy difference between in-domain and outdomain language models. Duh et al. (2013) use neural language models for data selection. Axelrod et al. (2011) and Axelrod et al. (2015) expand cross-entropy difference filtering to both sides of the parallel corpus. Since we aim to build a general machine translation system, instead of selecting data that are relevant to a specific domain, we select data whose domains are as general as possible, by using Generative Pre-training (GPT) models trained on large and diverse corpora.

Method
In this section we introduce a language detection filter, a translation-acceptability filter, and a domain filter. Each filter produces a score for every candidate source/target sentence pair. The partial score produced by each filter ranges from 0 to 1. Values beyond this range are normalized by minmax normalization:ŷ = (y − min)/(max − min). The final score is the product of the partial scores.

Language Detection Filter
Targeting a web-crawler at a given language pair still results in many pages written in the wrong language. For example, while a URL pair may clearly indicate translation (e.g., ".jp" and ".zh"), it may happen that the text content is simply copied rather than translated. We observe this in both our Japanese-Chinese data and the German-English Paracrawl data set. It is necessary to filter out sen-tence pairs with undesired languages.
We adopt the fastText (Joulin et al., 2017(Joulin et al., , 2016 language identification toolkit in our language detection filter. For each sentence, the toolkit produces a list of language candidates and their corresponding confidence scores. We select the language that has the highest confidence score from fastText as the language of the sentence. Sentence pairs that have both of the elements detected as the desired language are assigned score 1 and otherwise 0. By discarding sentence pairs with undesired language IDs, we filter out 27% of our Chinese-Japanese parallel sentences and nearly 70% of the German-English parallel sentences from Paracrawl data set.

Acceptability Filter
In this section, we introduce our translation acceptability filter, one of the main contributions in the paper. It aims to measure the parallelism of sentence pairs and filter out sentence pairs that are not mutual translations.
The pre-trained language model BERT (Devlin et al., 2019) has been shown to be effective in many NLP tasks as it produces better and meaningful contextualized word representations. Multilingual BERT, a transformer Masked Language Model pre-trained on Wikipedia dumps of 104 languages, shows remarkable multilingual capability, given that it is not exposed to any multilingual signals, such as parallel data or dictionaries. A thorough study by Pires et al. (2019) shows the promising zero-shot cross-lingual model transfer ability of multilingual BERT on named entity recognition and part-of-speech tagging tasks. They hypothesize that having language-universal word pieces, such as numbers and URLs, mapped to a shared space forces the co-occurring pieces to also be mapped to a shared space, thus spreading the effect to other word pieces, until different languages are close in the shared space.
We use pre-trained multilingual BERT to encode a sentence pair (s, t) and create the sentence embeddings v s and v t by using the representations of the [CLS] token of s and t. We find that the cosine similarity between v s and v t does not necessarily reflect the parallelism of sentence s and t. We suspect that the word representations from multilingual BERT are loosely aligned across languages as there is no parallel data or dictionary used during the pre-training. A similar observation was made in Lample et al. (2018), where the cross-lingual word embeddings learned in an unsupervised manner are loosely aligned. However, after fine-tuning on a few anchor pairs (word translations), they become more aligned.
Similarly, we use an unsupervised synthetic training set as anchors to fine-tune multilingual BERT with a binary classification objective. Xu and Koehn (2017) did similar work to train a filtering classifier on synthetic data, but via bag-ofwords translation features.
Synthetic Training Set. In cases where a small number of clean parallel sentence pairs are available, we use them as positive training samples for our classifier. In Japanese-Chinese filtering, we use around 300k sentence pairs, mostly from open-source software documentation, 2 as our positive samples. In extreme cases where no identifiable, clean parallel data is available, we sub-select high quality parallel sentences, which are used as positive samples, from the noisy parallel corpus based on the Hunalign (Varga et al., 2007) sentencealignment score. We sample negative instances by simulating the noise produced by web crawling and alignment. Given a positive pair (s, t), we create a negative sample by randomly choosing one of the following options: • Randomly select a target sentence from its adjacent sentences within a window size of k (where k = 2 in our experiments).
• Swap the order of 30%-70% words of the source or target sentence.
To balance the training set, we create the same number of positive instances and sampled negative instances. Binary Classification Objective. We feed the sentence pair (s, t) into multilingual BERT, which accepts two-sentence input due to its next-sentence prediction objective (Devlin et al., 2019). Instead of using the [CLS] token representation, we use a Convolutional Network (CNN) layer that takes the BERT output and generates the final representation of the pair. Our experiments show that using CNN layer pooling achieves marginal gains over [CLS] pooling. The final layer is a feed-forward network with a softmax activation function to produce label probabilities. We use the softmax probability as the degree of parallelism.

Domain Filter
Web-crawled data contains noise of various types, due to the complicated structure of web pages. By inspecting the training data generated by the above methods, we notice much of the content is not well-formed, e.g., concatenated lists of months and dates, randomly mixed content from tables, series of emojis and punctuation marks, etc. These are certainly written in the desired language, thus not filtered out by language detection. The translation acceptability filter also accepts them. However, such malformatted data is not helpful to machine translation models, and we prefer a training corpus to contain meaningful content.
For our domain filter, we adopt the cross-entropy difference scoring method proposed by Moore and Lewis (2010) and Junczys-Dowmunt (2018). More specifically, we treat a general domain monolingual corpus as our in-domain data set I, and the noisy parallel corpus without any filtering as our nondomain data set N. We train two language models L I and L N and measure how the target sentence t is domain-related to I and less domain-related to N by a perplexity ratio, which is a transformation of cross-entropy difference: where PPL M (x) is the word-normalized perplexity of the sentence x defined by the language model L M : The intuition is fairly straightforward: the higher the perplexity of the sentence to the non-domain corpus and the lower the perplexity of the sentence to the in-domain corpus, the more likely the sentence is meaningful.
Our contribution is to use GPT (Radford et al., 2019) as our in-domain language model, instead of news domain text (Junczys-Dowmunt, 2018).
This minor yet crucial change yields non-trivial performance gains in our experiments for German-English parallel corpus filtering. As GPT is trained on data from various sources, such as Wikipedia, Reddit, news websites, etc., it covers a wide range of domains, so our filtered data is more diverse and performs better on multi-domain test sets, as well as in the real world application.
For our in-domain language model, we use pre-trained Chinese GPT 3 for Japanese-Chinese and pre-trained GPT-2 4 for German-English. We randomly sample 4 million sentences from the unfiltered noisy parallel corpus and use KenLM (Heafield, 2011) to train the non-domain language model. Perplexity scores from different language models are compatible.
Following Junczys-Dowmunt (2018), we introduce two operations, clip and cutoff, to postprocess the domain filter scoref dom (s, t). The clip operation clips the maximum value of the domain score to a threshold τ clip : and the cutoff operation modifies scores below a threshold τ cutoff and changes them to 0: x, if x > τ cutoff 0, otherwise τ clip prevents a high monolingual in-domain score from overwriting scores from other filters. τ cutoff eliminates out-domain sentence pairs and ensures that highly parallel sentence pairs are at least somewhat in-domain. We tune τ clip and τ cutoff on the development set. The scoring method of our final domain filter becomes:

WMT 2018 Parallel Corpus Filtering
We use the WMT 2018 Parallel Corpus Filtering shared task  as a benchmark to evaluate our methods. Participants in the shared task are provided a very noisy 1 billion word (English token count) German-English corpus crawled from the web by the Paracrawl project. 5 The task is to sub-select clean sentence pairs amounting to (a) 10 million words, and (b) 100 million words, counted on the English side. The quality of the resulting subsets is determined by training a neural machine translation system (Marian) 6 (Junczys-Dowmunt et al., 2018) on this data. The quality of the machine translation system is measured by BLEU score on six test sets from various domains. As the task is to address the challenge of the data quality and not domain-relatedness of the data for a particular use, sub-sampling the corpus for relevance to the news domain is not encouraged by the shared task organizers. All parameters used for training Marian machine translation models are the same as described in . We use CLIP = 5 and CUTOFF = 1.5 in the experiments. We use 4 GPUs for training.

Web-Crawled Japanese-Chinese Parallel Corpus Filtering
Due to the lack of publicly available Japanese-Chinese parallel corpus, we build a data harvesting pipeline to fetch Japanese-Chinese parallel text from the Internet. The crawled bi-text are extremely noisy, but we rely on the proposed parallel corpus filtering method to clean up the data and eventually train a satisfactory machine translation system. In this paper, we use these crawled data as another test bed to evaluate our proposed method. A single run of the of the data harvesting pipeline is the following. We first identify Japanese-Chinese parallel webpages by programmatically analyzing the URL structure of the 5 billion URLs from CommonCrawl, 7 for example, https://www.gotokyo.org/jp/ and https: //www.gotokyo.org/cn/ only differ by jp and cn. Then we download the webpages and conduct a series of cascaded data cleaning methods, including removing HTML markups, sentence segmentation, etc. Finally we perform segment alignment and filtering. Our workflow consists of several runs of the data harvesting pipeline with entry points at different modules (for instance, a more targeted crawling of higher quality material from a previous run).
We also integrate existing Japanese-Chinese parallel datasets from other publicly available sources for a final parallel data size of 527m characters in 20.9M parallel segments.
We include all details of our data harvesting 6 https://github.com/marian-nmt/marian (We do not evaluate our method using Moses, the statistical machine translation system provided by WMT, as neural machine translation better fits our real world scenario.) 7 https://commoncrawl.org/ pipeline, as well as the statistics of the obtained dataset, in Appendix A.
Test and Development Dataset. We curate two parallel test sets by manually processing web data involving daily expressions (337 parallel segments) and news (437 parallel segments). For our development set, we use 5304 Japanese-Chinese basic expressions.

Results and Analysis
WMT 2018 Parallel Corpus Filtering. Table 1 presents the BLEU scores of neural machine translation systems trained on 10 million and 100 million words of training data, selected by different filtering methods. In the table, we list the top three performers from the shared task, as well as another two work that are similar to ours. Junczys-Dowmunt (2018) has a dual conditional crossentropy adequacy filter and a domain filter trained on news corpora. Hangya and Fraser (2018) generate sentence embeddings by using unsupervised word embedding alignment and measure parallelism via multilingual sentence embedding similarity.  leverage massive publicly available English-German parallel corpora to train multilingual sentence embeddings via bidirectional Long Short Term Memory (LSTM) encoderdecoder network.
We replicate the adequacy and domain-news filters from Junczys-Dowmunt (2018) and obtain similar results. By replacing the domain-news filter with our domain-GPT filter, we achieve new stateof-the-art scores on 10M and 100M word data sets (bold scores in the table). Given the very compact score range in the shared task , we consider this gain very successful. It is stated in the shared task that the test sets are from multiple domains. Domain-news filter in Junczys-Dowmunt (2018) tends to select sentence pairs from news domain as the filter is trained on news domain data, and this leads to a biased parallel corpus for training machine translation system. Our proposed domain-GPT filter is trained from various sources and thus covers a wide range of domains, so our filtered data is more diverse and performs better on multidomain test sets.
For our supervised acceptability filter, we train a mulitlingual BERT classifier on clean parallel sentences as positive examples and randomly sampling negative instances, using the method described in Section 3.2. For our unsupervised acceptabil-  * percentage of raw parallel sentences used for MT training. Table 2: BLEU scores of Japanese-Chinese and Chinese-Japanese MT systems trained on data sets generated by various filtering methods. We rank sentence pairs by filtering scores and train an MT system on N percent of the top ranked data. N is selected based on the development set and we report the best BLEU score. domain-GPT is the domain filter whose in-domain language model is the pre-trained GPT language model; note that for ZH-JA, we do not have access to pre-trained Japanese GPT.
ity filter, we rank noisy parallel sentences by (a) the alignment score from Hunalign, and (b) the GPT domain filter score. We then select the top 10M words (counted on English side) worth of sentence pairs as positive examples. This makes the method completely unsupervised, not requiring any identifiable clean parallel data. With finetuning multilingual BERT on sentences pairs aligned by Hunalign, the unsupervised acceptability already achieves comparable performance to  which use massive public parallel data. After applying the unsupervised domain-GPT filter, we achieve a surprisingly good result (underlined scores in the table), comparable to the best super-vised method.
Japanese-Chinese Parallel Corpus Filtering. In Table 2, we evaluate machine translation systems trained on data generated by different filtering methods. Unfiltered refers to data generated by Hunalign without any filtering.  refer to LASER, the top performing filtering system in WMT 2019 Parallel Corpus Filtering shared task. We use the pretrained 93-language LASER model to generate sentence pair scores. The model is trained on a large parallel corpus that contains 3.2M English-Japanese and 8.2M English-Chinese sentence pairs (English is used as pivot to connect Japanese and Chinese during their training). Adequacy refers to the dual conditional cross-entropy filtering method that we replicate from Junczys-Dowmunt (2018). It is trained on around 300k high quality softwaredomain parallel sentences from Microsoft Developer Network (MSDN) and Ubuntu. The GPT domain filter uses a pre-trained Chinese GPT 8 as the in-domain language model and trains a four-gram KenLM (Heafield, 2011) language model on the Chinese side of our 4 million unfiltered noisy parallel sentences as a non-domain language model. Acceptability is our proposed multilingual BERT based filtering method, which is trained on a synthetic dataset, where we use 300k high-quality software domain parallel sentences as positive examples and sample equal-sized negative sentence pairs, using the sampling methods described in Section 3.2.  train a multilingual sentence encoder on various English-Foreign Language parallel corpus and prove the zero-shot cross-lingual transfer capability between non-English pairs, such as Japanese and Chinese. However, when English is used as the pivot, the distance between Japanese and Chinese become larger, resulting in not effectively capturing the correlation between them. The conditional cross-entropy metric in adequacy relies on the quality of machine translation system. Due to the difficulty of training high-quality machine translation systems on 300k sentence pairs, the adequacy filter cannot produce accurate conditional cross-entropy. The GPT domain filter assigns higher score to sentences that are more like human natural language and downgrades malformatted sentence pairs. It is effective in the German-English filtering task, where a fixed-size subset is selected and we want to fill the subset with as much domain relevant data as possible. However, to best fit the real world scenario where the goal is to have the best machine translation system, we do not limit the amount of data to select for training machine translation system and let the system decide the amount of the data to select, according to each filtering method. We rank sentence pairs by their filtering scores and train a MT system on N percentage of the top ranked data. N is selected based on the development set and we report the best BLEU score. Under this setting, adding a domain filter makes the model use less data (N = 50% vs N = 75%), but we do not observe any performance gain, as we suspect that the malformatted but parallel sentence pairs are neither harmful or helpful to the model, and filtering them out makes no difference in performance of the model. High Precision Parallel Corpus Filtering. For analysis purposes, we manually annotate a small set of 320 sentence pairs randomly selected from our original web crawled Japanese-Chinese data set. 24% of the sentence pairs are labeled "not mutual translations." As stated in , neural machine translation models are more sensitive to noise than statistical machine translation models, so having high precision filtering results as training data is necessary. In Figure 1, we show precision and recall curves for our proposed filtering method on this labeled test set, under different threshold settings. The threshold is selected based on the filtering classifier probability produced by the softmax layer. By setting the threshold to 0.9, we are able to obtain 97.7% precision high-quality parallel sentences, while still having 66.9% recall.

Conclusions
In this paper, we address the parallel corpus filtering problem in machine translation. We propose a novel filtering method using pre-trained language models. Our method outperforms strong baselines and achieves a new state-of-the-art. We release a large Japanese-Chinese web crawled parallel corpus for the research purposes. Because it is artificial to use synthetic data for training a filter classifier, future work can focus on a better objective that models parallelism more smoothly. Future work also includes extending the method to low-resource languages not covered by multilingual BERT.
A Web-Crawled Parallel Data for Japanese-Chinese   This appendix describes our pipeline to extract parallel Japanese-Chinese parallel sentence fragments from the Internet (Figure 2). We start with 5 billion URLs from CommonCrawl. 9 We identify Japanese-Chinese parallel webpages by looking at URL structure (step 2). For example, https://www.gotokyo. org/jp/ and https://www.gotokyo.org/cn/ only differ by jp and cn. We download these potentially parallel page pairs (step 3), remove HTML and other markup metadata (step 4), 10 and split into sentence segments. We use off-the-shelf Hunalign 11 for segment alignment (step 5). We filter segment pairs by rough language ID and length ratio (step 6). We obtain 227k URL pairs, 1.4m segment pairs, and 28.7m characters of parallel data (measured on the Chinese side).
From the 227k URL pairs above, we trace which site pairs yielded the most parallel data. We then run a deep-crawling module on each of the 6000 most-promising sites, 12 and we process the resulting URLs using the rest of the pipeline. Concatenating parallel data from all runs (step 7) and running a simple post-processing filter to remove objectionable content in the text gathered, we obtain around 494m characters of parallel data (measured on the Chinese side).
We also integrate existing Japanese-Chinese parallel datasets from other publicly available sources for a final parallel data size 527m characters in 20.9m parallel segments. Table 3 describes the various components of this dataset.