Noisy Parallel Corpus Filtering through Projected Word Embeddings

We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.


Introduction
With the advent of web-scale parallel text mining, quality estimation and filtering is becoming an increasingly important step in multilingual NLP. Existing methods focus on languages with relatively large amounts of parallel text available (Schwenk, 2018;Artetxe and Schwenk, 2018), but scaling down to languages with limited amounts of parallel text poses new challenges. We present a method based on projecting word embeddings learned from a monolingual corpus in a highresource language, to the target low-resource language through whatever parallel text is available.
The goal of participants in the WMT 2019 parallel corpus filtering shared task is to select the 5 million words of parallel sentences producing the highest-quality machine translation system, given a set of automatically crawled sentence candidates of varying quality. It is the continuation of the last year's task (Koehn et al., 2018), except that this year two low-resource languages are used: Nepali and Sinhalese.

Related Work
We refer readers to Koehn et al. (2018) for a more thorough review of the methods used in the WMT 2018 parallel corpus filtering shared task, and here review only a few studies of particular relevance to our model. * Authors contributed equally.
The Zipporah model of Xu and Koehn (2017) is used as a (strong) baseline in this year's shared task. It aims to find sentences pairs with high adequacy, according to dictionaries generated from an aligned corpora, and fluency modeled by n-gram language models. Zariņa et al. (2015) use existing parallel corpora to learn word alignments and identify parallel sentences on the assumption that non-parallel sentences have few or none word alignments. In preliminary experiments we also evaluated a variant of this method, but found the resulting machine translation system to produce worse results than the simple approach described below.
Similar to the our model, Bouamor and Sajjad (2018) perform parallel sentence mining through sentence representations obtained by averaging bilingual word embeddings. Based on the cosine similarity, they create a candidate translation list for each sentence on the source side. Then, finding the correct translation is modelled as either a machine translation or binary classification task.

Data
In this section, we summarize the target noisy data and the allowed third-party resources where we train our model.

Target Noisy Corpora
The target noisy parallel corpora provided by the WMT 2019 organizers come from the Paracrawl project 1 , and is provided before the standard filtering step to ensure high-recall, low-precision retrieval of parallel sentences.
The noisy corpora have 40.6 million words on the English side (English-Nepali) and 59.6 million words (English-Sinhala). The task is thus to se-  lect the approximately 10% highest-quality parallel text.

Training Data
Participants are allowed to use only the resources provided by the organizers to train systems. The permissible resources include supposedly clean parallel data, consisting of bible translations, Ubuntu localization files as well as movie subtitles. Larger monolingual corpora based on Wikipedia and common crawl data were also provided. 2 To train our model, we use all the parallel data available for the English-Sinhala and English-Nepali pairs (summarized in Table 1) and the English Wikipedia dump which contains about 2 billion words. We modified the Nepali-English dictionary so that multiple translations were split into separate lines. As manual inspection revealed some problems in this data as well, we ran the same pre-filtering pipeline on it as we used for the noisy evaluation data (see Section 4.1)

Method
In this section, we present the components our model used to score the nosiy parallel data.

Pre-filtering Methods
As many types of poor sentence pairs are easy to detect with simple heuristics, we begin by applying a series of pre-filters. Before pre-filtering, the corpus is normalized through punctuation removal and lowercasing. We pre-filter all parallel data, both the (supposedly) clean and the noisy evaluation sets, using a set of heuristics based heavily on the work of Pinnis (2018): • Empty sentence filter: Remove pairs where either sentence is empty after normalization.
• Numeral filter: Remove pairs where either sentence contains 25% or more numerals.
• Sentence length filter: Remove pairs where sentence lengths differ by 15 or more words.
• Foreign writing filter: Remove pairs where either sentence contains 10% or more words written in the wrong writing system.
• Long string filter: Remove pairs containing any token longer than 30 characters.
• Word length filter: Remove pairs where either sentence has an average word length of less than 2.
The statistics of each individual filter on the training data and the noisy data are provided in Table 2 and Table 3. In total, the pre-filtering step removed 2,790,557 pairs for the English-Sinhala data and 1,778,339 pairs for English-Nepali. Of all filters, foreign writing and numeral filter seem to be the most useful ones in terms of removing poor data.
Although almost 150 thousand sentence pairs are filtered out in the training data, the rate is considerably less than that of the raw noisy data suggesting that our pre-filters have a low rate of false positives. We further tested our pre-filters on the development data for the MT system evaluation (discarding the result), and found that less than 3% is removed.

Multilingual word vectors
We first train 300-dimensional FASTTEXT vectors (Bojanowski et al., 2017) with its default parameters using the provided English Wikipedia data.
Our first goal is now to create word vectors for the low-resource languages Sinhala and Nepali, in the same space as the English vectors.
After pre-filtering, we perform word alignment of the provided parallel text using the EFLOMAL tool (Östling and Tiedemann, 2016) with default parameters. Alignment is performed in both directions, and the intersection of both alignments is used. The vector v f i for word i in the non-English language f is computed as that is, the weighted sum of the vectors v e j of all aligned English word types j, which have been aligned to the non-English type i with frequency c(i, j). Word types which are aligned less than 20% of the most commonly aligned type are not   counted, to compensate for potentially noisy word alignments. In other words, we let c(i, j) = 0 if the actual count is less than 0.2 max j c(i, j ). On average, the vector of each Sinhala word type is projected from 1.66 English word types, and each Nepali word from 1.83 English words types.

Sentence similarity
Given a sentence pair x and y, our task is to assign a score of translation equivalence. The multilingual word vectors learned in Section 4.2 provide a measure of word-level translational equivalence, by using the cosine similarity between the vectors of two words. Since sentence-level equivalence correlates strongly with word-level equivalence, we can approximate the former by looking at pairwise cosine similarity between the words in the sentence pair: cos(v e i , v f j ). A good translation should tend to have a high value of max j cos(v e i , v f j ) since most English words w e i (with vector v e i ) should have a translationally equivalent word w f j (with vector v f j ) in the other language, and these vectors should be similar.
However, this naive approach suffers from the so-called hubness problem in high-dimensional 1 Million 5 Million Sinhala 3.59 (4.65) 0.53 (3.74) Nepali 4.55 (5.23) 1.21 (1.85) Table 4: BLEU scores of the NMT system trained on the released development sets. Numbers within parenthesis refer to the baseline scores spaces (Radovanović et al., 2010), where some words tend to have high similarity to a large number of other words. This can be compensated for by taking the distribution of vector similarities for each word into account (as done in similar contexts by e.g. Conneau et al., 2017;Artetxe and Schwenk, 2018). We use this information in two ways. First, all words which have an average cosine similarity higher than 0.6 to the words in the English sentence are removed since they are unlikely to be informative. We then use as our score the ratio between the highest and the second highest similarity within the sentence, averaged over all remaining words in the sentence. 3  Table 5: Word and sentence counts in the 1 million and 5 million sub-samples according to our model. Numbers in parenthesis refer to the counts of the baseline system (Xu and Koehn, 2017) which is only available only for 5 million sub-sample

Results
The quality of the sub-sampled data is assessed according to the BLEU scores of the statistical and neural machine translation systems trained on them.
Here, we present the BLEU scores of the NMT system (Guzmán et al., 2019) which will be used in the official evaluation on the released development set. We evaluate our model via two different sub-samples, one with 1 million and one with 5 million words on the English side. See Table 5 for statistics on the filtered data. Table 4 presents our results using the NMT system. For Nepali, the performance of our model approaches the strong baseline on both the 1 million and 5 million sub-samples, whereas the NMT system fails completely using the 5 million word Sinhala sub-sample. All BLEU scores are below 6, for our system as well as for the baseline, indicating that there is insufficient data for the NMT system to learn a useful translation model.

Conclusion
We have described our submission to the WMT 2019 parallel corpus filtering shared task. Our submission explored the use of multilingual word embeddings for the task of parallel corpus filtering. The embeddings were projected from a high-resource language, to a low-resource language without sufficiently large monolingual corpora, making the approach suitable for a wide range of languages.