ParaCrawl: Web-Scale Acquisition of Parallel Corpora

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.


Introduction
Parallel corpora are essential for building highquality machine translation systems and have found uses in many other natural language applications, such as learning paraphrases (Bannard and Callison-Burch, 2005;Hu et al., 2019) or cross-lingual projection of language tools (Yarowsky et al., 2001).
We report on work to create the largest publicly available parallel corpora by crawling hundreds of thousands of web sites, using open source tools. The processing pipeline consists of the steps: crawling, text extraction, document alignment, sentence alignment, and sentence pair filtering. We describe these steps in detail in Sections 4-8. For some of these steps we evaluate several methods empirically in terms of their impact on machine translation quality. We provide the data resources used in these evaluations as benchmarks for future research.
As part of these effort, several open source components have been developed. These are integrated into the open-source tool Bitextor, 1 a highly modular pipeline that allows harvesting parallel corpora from multilingual websites or from preexisting or historical web crawls such as the one available as part of the Internet Archive. 2 The execution of the pipeline has focused on official European Union languages, but also targeted Russian, Sinhala, Nepali, Tagalog, Swahili, and Somali. We show that the obtained parallel corpora improve state-of-the-art results on common benchmarks, such as the WMT Shared Task on News Translation.

Related Work
While the idea of mining the web for parallel data has been already pursued in the 20th century (Resnik, 1999), the most serious efforts have been limited to large companies such as Google (Uszkoreit et al., 2010) and Microsoft (Rarrick et al., 2011), or targeted efforts on specific domains such as the Canadian Hansards and Europarl (Koehn, 2005). The book Bitext Alignment (Tiedemann, 2011) describes some of the challenges in greater detail.

Acquisition Efforts
Most publicly available parallel corpora are the result of targeted efforts to extract the translations from a specific source. The French-English Canadian Hansards 3 were used in the earliest work on statistical machine translation. A similar popular corpus is Europarl (Koehn, 2005), used throughout the WMT evaluation campaign.
Multi-lingual web sites are attractive targets. Rafalovitch and Dale (2009) ;Ziemski et al. (2015) extract data from the United Nations, Täger (2011) from European Patents, Lison and Tiedemann (2016) from a collection of TV and movie subtitles. Cettolo et al. (2012) explain the creation of a multilingual parallel corpus of subtitles from the TED Talks website which is popular due to its use in the IWSLT evaluation campaign.
There are also various efforts targeted at a single language pair. Martin et al. (2003) build a parallel corpus for Inuktitut-English. Utiyama and Isahara (2003); Fukushima et al. (2006) worked on creating Japanese-English corpora. Uchiyama and Isahara (2007) report on the efforts to build a Japanese-English patent corpus and Macken et al. (2007) on efforts on a broad-based Dutch-English corpus. Li and Liu (2008) mine the web for a Chinese-English corpus. A large Czech-English corpus from various sources was collected (Bojar et al., 2010), linguistically annotated (Bojar et al., 2012), and has been continuously extended to over 300 million words (Bojar et al., 2016).
All these efforts rely on methods and implementations that are quite specific for each use case, not documented in great detail, and not publicly available. A discussion of the pitfalls during the construction of parallel corpora is given by Kaalep and Veskis (2007). A large collection of corpora is maintained at the OPUS web site 4 (Tiedemann, 2012).

Document Alignment
Document alignment can be defined as a matching task that takes a pair of documents and computes a score that reflects the likelihood that they are translations of each others. The task is typically limited to a single web domain (all web pages from www.aaa.com and aaa.com, possibly aaa.de but not bbb.com) for efficiency.
Matching may take the HTML structure into account, or purely rely on the textual content. Examples of structural matching is the use of editdistance between linearized documents (Resnik and Smith, 2003) and probability of a probabilistic DOM-tree alignment model (Shi et al., 2006). Using the URL for matching is a very powerful indicator for some domains, typically by using a predefined set of patterns for language marking or simple Levenshtein distance (Le et al., 2016).
Content matching requires crossing the language barrier at some point, typically by using bilingual dictionaries or translating one of the documents into the other document's language (Uszkoreit et al., 2010). Documents may be represented by vectors over word frequencies, typically td-idf-weighted. Vectors may also be constructed over bigrams (Dara and Lin, 2016) or even higher order n-grams 4 http://opus.lingfil.uu.se/ (Uszkoreit et al., 2010). The vectors are then typically matched with cosine similarity (Buck and Koehn, 2016a). The raw vectors may be recentered around the mean vector for a web domain (Germann, 2016) Document alignment quality can be improved with additional features such ratio of shared links, similarity of link URLs, ratio of shared images, binary feature indicating if the documents are linked, DOM structure similarity (Esplà-Gomis et al., 2016), same numbers (Papavassiliou et al., 2016), or same named entities (Lohar et al., 2016). Guo et al. (2019) introduce the use of document embeddings, constructed from sentence embeddings, to the document alignment task.

Sentence Alignment
Early sentence aligners (Brown et al., 1991;Gale and Church, 1993) use scoring functions based only on the number of words or characters in each sentence and alignment algorithms based on dynamic programming. Europarl, for example, used metadata to align paragraphs, typically consisting of 2-5 sentences, and using Gale and Church (1993)'s method to align sentences within corresponding paragraphs. Later work added lexical features and heuristics to speed up search, such as limiting the search space to be near the diagonal (Moore, 2002;Varga et al., 2005).
More recent work introduced scoring methods that use MT to get both documents into the same language (Sennrich and Volk, 2010) or use pruned phrase tables from a statistical MT system (Gomes and Lopes, 2016). Both methods "anchor" highprobability 1-1 alignments in the search space and then fill in and refine alignments. They later propose an extension (Sennrich and Volk, 2011) in which an SMT system is bootstrapped from an initial alignment and then used in Bleualign.
Vecalign (Thompson and Koehn, 2019) is a sentence alignment method that relies on bilingual sentence embeddings and achieves linear run time with a coarse-to-fine dynamic programming algorithm.

Sentence Pair Filtering
Parallel corpora that have been crawled from unverified web sites and processed by error-prone extraction and alignment methods are likely to contain noise, such as random text fragments, text in the wrong language, translations produced by machine translation tools or bad translators, and misaligned sentence pairs. Such noise is specially harmful for neural machine translation (Khayrallah and Koehn, 2018), so filtering it out is an essential processing step.
There is a robust body of work on filtering out noise in parallel data but most recently this topic has gained a lot of momentum, partly due to the lack of robustness of neural models and fostered by recent shared tasks on parallel corpus filtering under high-resource (Koehn et al., 2018) and lowresource data conditions .
Most participants in these shared tasks used three components: pre-filtering rules, scoring functions for sentence pairs, and a classifier that learned weights for feature functions.
Pre-filtering rules. Some of the training data can be discarded based on simple deterministic filtering rules. This may remove over 80% of the data (Kurfalı andÖstling, 2019; Soares and Costa-jussà, 2019). Such rules remove too short or too long sentences, sentences that have too few words (tokens with letters instead of just special characters), either absolute or relative to the total number of tokens, sentences whose average token length is too short or too long, sentence pairs with mismatched lengths in terms of number of tokens, sentence pairs where names, numbers, dates, email addresses, URLs do not match between both sides, sentence pairs that are too similar, indicating simple copying instead of translating, and sentences where language identifier do not detect the required language.
Scoring functions. Sentence pairs that pass the pre-filtering stage are assessed with scoring functions which provide scores that hopefully correlate with quality of sentence pairs. Participants used a variety of such scoring functions, including n-gram or neural language models on clean data (Rossenbach et al., 2018), language models trained on the provided raw data as contrast, neural translation models (Junczys-Dowmunt, 2018), bag-of-words lexical translation probabilities (González-Rubio, 2019), or even existing off-the-shelf tools like Zipporah and Bicleaner .
Learning weights for scoring functions. Given a large number of scoring functions, simply averaging their resulting scores may be inadequate. Learning weights to optimize machine translation system quality is computationally intractable due to the high cost of training these systems to evaluate different weight settings. A few participants used instead a classifier that learns how to distinguish between good and bad sentence pairs (where bad sentence pairs are either synthesized by scrambling good sentence pairs or selected from the raw crawled data).
A novel method that was central to the bestperforming submission in WMT 2019 was the use of cross-lingual sentence embeddings that were directly trained from parallel sentence pairs . Other submissions used monolingual word embeddings (Soares and Costajussà, 2019;Kurfalı andÖstling, 2019;Bernier-Colborne and Lo, 2019).
Another approach is to first train a translation system on the clean data, then use it to translate the non-English side into English and use monolingual matching methods to compare it against the English side of the parallel corpus. Different matching metrics were used: METEOR (Erdmann and Gwinnup, 2019), Levenshtein distance (Sen et al., 2019), or BLEU (Parcheta et al., 2019), As Rarrick et al. (2011) point out, one type of noise in parallel corpora extracted from the web are translations that have been created by machine translation. Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction, with a negligible loss of quality. Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and statistical machine translation output can be detected due to lack of reordering. Rarrick et al. (2011) train a classifier to learn the distinction and show that removing such data leads to better translation quality.

Comparable Corpus Mining
Our work exploits web sites that provide roughly the same content in multiple languages, leading us to the assumption to find pairs of web pages which are translations of each other, with translated sentences following the same order. This assumption does not hold in less consistently translated web content such as Wikipedia, or accidental parallel sentence found in news stories about the same subject matter written in multiple languages.
There have been increasing efforts to mine sentence pairs from large pools of multi-lingual text, which are treated as unstructured bags of sen-tences. Munteanu and Marcu (2005) use document retrieval and a maximum entropy classifier to identify parallel sentence pairs in a multi-lingual collection of news stories.
Bilingual sentence embeddings (Guo et al., 2018) and multilingual sentence embeddings (Artetxe and Schwenk, 2018) were tested on their ability to reconstruct parallel corpora. This lead to work to construct WikiMatrix, a large corpus of parallel sentences from Wikipedia  based on cosine distance of their crosslingual sentence embeddings.

Identifying Multi-Lingual Web Sites
Since the start of the collection effort in 2015, we identified potential web sites to crawl in various ways, but mainly by exploiting statistics from CommonCrawl. By splitting this large collection of crawled web pages by web domain and running text extraction and language identification (Buck et al., 2014), we can extract statistics on what language content exists on each of them. Web domains with sufficient content in a targeted language and English are selected for crawling.
The thresholds of what constitutes sufficient content varied depending on language. Typically, we require minimum amounts of content in the targeted language and English (measured in bytes of text), and consider the ratio between the two. For instance, we identified 19,616 web domains with at least 100KB of content in German and English (max ratio 10), but only 438 web domains with at least 20KB of content in Maltese and English (max ratio 10).
It is worth noting that by targeted crawling of web sites we are able to collect many more web pages than present in CommonCrawl. In an exploratory study, only 5% of a collection of web pages with useful content were found in Common-Crawl. This may have improved with recent more extensive crawls by CommonCrawl but there is still a strong argument for targeted crawling.

Crawling
Crawling is the initial step of the pipeline. It entails downloading documents from a number of websites and looking for any documents that contain text. These documents are stored as single or multi-domain Web ARChive (WARC) files. WARC is an archiving format for crawled data originally proposed by the Internet Archive Four different crawling tools are currently supported in Bitextor: HTTrack 5 Well-known multi-platform tool for crawling. It has been for long time in Bitextor, even though it is now deprecated as the support for the tool is discontinued.
Heritrix 6 Internet Archive's web crawler; it is fully compatible with WARC format and supports a variety of options that make it one of the most suitable options for large scale data crawling.
Creepy 7 Python library with basic resources for crawling. A crawler has been implemented on top of it, and is currently experimental.
Wget One of the most popular tools for retrieving files through HTTP and HTTPS in Unix systems. It is fully compatible with WARC format.
Most of our crawling in ParaCrawl has been done using HTTrack. To deal with the I/Ointensive process of writing small files with high frequency, data is first stored on local SSD drives and then transferred to a network file system for subsequent processing.

Text Extraction
After crawling, all documents are pre-processed to extract and normalize the text and identify their language. The resulting cleaned and sorted text is the input for the subsequent steps of document and segment alignment (see Sections 6 and 7).
Conversion to HTML WARC files contain one web-crawled document per record. The documents can be in a variety of formats that contain text: plain text, HTML, Open Document Format 8 (".odt"), Office Open XML 9 (".docx") or PDF files containing text. With the exception of the small number of documents that are already in plain text format, the bitextor-warc2htmlwarc.py module converts any of these formats to HTML (see fig. 1) and produces WARC files containing only HTML or plain text documents.
Text extraction from HTML Given WARC files containing HTML, we extract the text content. We preserve sentence breaks indicated by HTML tags such as <p> or <br> (paragraph and line break), but remove formatting tags such as <b> (for bold text) without a trace.

Document Alignment
There are two main workflows for document alignment.
Using bilingual lexica The traditional workflow in Bitextor until version 5 used bilingual lexica. Module bitextor-buildidx.py builds indexes of documents containing, for each word in the lexicon for each language, the documents containing it. Then bitextor-idx2ridx uses the bilingual lexica to translate these words and build reverse indexes where each document is paired to a list of documents and bag-of-words-based overlap scores in the other language. A series of modules (bitextor-urlscomparison.py, bitextor-urlsetoverlap.py, bitextorimagestooverlap.py, etc.), compute a series of features for each language direction based on mutual linking and the comparison of document URLs, the set of outgoing URLs, HTML structure and image content; these features are integrated by bitextor-rank.py into two new reverse-index file with new scores, which are used to obtain the final document alignment.
Using machine translation This workflow uses machine translation to decide whether two documents have to be aligned, and is the one that has been used for the parallel data releases of the project (Buck and Koehn, 2016b).
After extract-lett.py extracts plain-text documents in each language, a machine translation system translates each document from language A to B. We then generate a (sparse) matrix of tf-idf scores between machine translated versions of documents in language A and documents in language B. These scores are used by compute_matches.py to compute a list of document pairs (score, source URL, target URL). Document pairs are stored in a file in which each line contains the URLs of both documents and their plain-text content encoded in base64.

Sentence Alignment
During the ParaCrawl project, we made use of a few sentence alignment tools. In this paper, we compare their performance on five language pairs. The sentence aligners are: Hunalign (Varga et al., 2005) is a widely used tool that relies on a bilingual dictionary that we  generated from the Europarl corpus or other available parallel corpora.
Bleualign (Sennrich and Volk, 2010) aligns an English translation of the foreign sentences and the English sentences based on their similarity, as measured by a variant of the BLEU score. We implemented a faster version of Bleualign in C++.
Vecalign (Thompson and Koehn, 2019) is a new sentence aligner based on sentence embeddings, using an efficient coarse-to-fine algorithm with linear run time. We used pre-trained LASER embeddings 10 which cover all the languages of ParaCrawl, except for Irish.
We compared the quality of the sentence pairs extracted from document pairs for these tools. To our knowledge, this is the first evaluation of sentence aligners on large-scale real-world webcrawled data. We selected five languages, ranging from low resource (Maltese) over mid-resource (Estonian, Hungarian) to high-resource (Czech, German). We selected a subset of web domains, for details see Table 1.
The data is provided as document pairs from the usual upstream ParaCrawl processing. The text of web pages needs to be further split into sentences, and then aligned using the different sentence aligners. The resulting sentence pairs are deduplicated are assessed for quality using Bicleaner (more on sentence pair filtering in the next section).
Since different sentence aligners generate different amounts of data (for instance, Bleualign filters quite aggressively for noise), we selected differently sized subsets of the data for evaluation by selecting the best sentence pairs according to Bicleaner quality scores. We built neural machine translation models on these subsets using  Fairseq and evaluated them on test sets drawn from the WMT news translation task (newstest2018 for German, Czech, Estonian; newstest2009 for Hungarian) and the EU Bookshop 11 corpus (Maltese). See Table 2 for the BLEU scores and corpus sizes for the best-performing subsets for each sentence aligner and language. Vecalign gives the best results for 4 of the languages, and is slightly behind Hunalign for Estonian.
We published the document pairs to be aligned, as well as the testing environment 12 to promote the evaluation of novel sentence alignment methods.

Sentence Pair Filtering
Our processing pipeline is aimed at high recall at the cost of precision, thus creating large but very noisy corpora. So, as a last processing step, we aim to filter out sentence pairs that are not useful as training data for machine translation or any other purpose. This is especially important since training on noisy corpora is a challenge for neural machine translation which motivated the organization of two shared tasks in 2018 and 2019, on the high resource language German-English and the low resource languages Sinhala and Nepali, respectively. Here, we extend this evaluation to European languages with medium sized resources.
Building on the data sets generated by the sentence alignment evaluation of the previous section, we compared three sentence pair filtering methods used in the ParaCrawl effort: Zipporah (Xu and Koehn, 2017), Bicleaner (Sánchez-Cartagena et al., 2018), and LASER .
We carried out the evaluation (see Table 3) in the same fashion, as in the previous section. Filtering by LASER scores gives the best results except for Maltese (for which the publicly available  LASER model has not been trained). Moreover, in almost all settings, we achieve better results with Bicleaner than Zipporah.

Released Corpora
Overall, the ParaCrawl corpus release v5.0 contains a total of 223 million filtered 13 , unique sentence pairs from around 150k website domains and across 23 EU languages with English (see Table 5). However, the data release is highly imbalanced with 73% of sentence pairs comprising of just five languages: French, German, Spanish, Italian and Portuguese. The average (untokenised) English sentence length (over all languages) is 22.9 words, with some notable anomalies. For example, the low-resourced Irish-English pair (27.6 words) has over 50% of sentence pairs originating from the legal domain, where sentences are longer than usual. Furthermore, we noticed that filtered sentences which had been aligned using Hunalign were significantly shorter than those aligned by Bleualign (26.1 and 20.1 words respectively), although we are unsure of the exact reason for this discrepancy.
Our main motivation for creating the ParaCrawl corpus is to improve the quality of machine translation systems. To test this, we trained neural machine translation models where we added the corpus to existing data sets for language pairs that were tackled in the shared task on news translation at the Conference on Machine Translation (WMT) -which we consider a strong baseline. 13 Sentence pairs with a Bicleaner score of less than 0.7 were discarded, but remain in the RAW release. 14   We trained Transformer-Base models with Marian using SentencePiece. See Table 4 for results. For most language pairs, we see gains of several BLEU points (up to 6 BLEU points for English-Romanian). We even see gains for English-Czech, were ParaCrawl is quite a bit smaller than existing data sets (+0.7 BLEU when adding 5.3m sentence pairs to the existing set of 52m sentence pairs).

Computational Costs Concerns
Several of the steps involved in producing and evaluating the ParaCrawl corpora are computationally expensive. Even as some of the steps are embarrassingly parallel and amenable processing in a high-performance computing setting, even pre-processing of 100TB of source data to produce candidate documents consumes on the order of 50,000 CPU-hours equivalent to an estimated 15 720kWh of power. Training of a neural network model for translating one of the more resource-rich languages such as German may take a week on a dozen GPUs again consuming about 750kWh. Translating 500 million German sentences to English for evaluation consumed roughly 7MWh. In practice, these computations are not simply performed once, they are performed many times as parameters are changed and different strategies tried.
This energy cost is significant. The Typical Domestic Consumption Values published by  700 38,164,560 Dutch: 770,141,393 2,687,331 Dutch: 60,504,313 French: 817,973,481 French: 64,650,549 11,060,105 Polish: 202,765,359 916,522 Polish: 18,883,576 German: 198,442,547 German: 20,271,637 Table 5: Size of corpus release 5. The corpus is released in two versions: Raw is very noisy data before the sentence pair filtering step. Clean has been proven to be useful for training machine translation systems. We release the raw corpus to allow use of other filtering methods, or different thresholds for quality cutoffs.
Ofgem 16 , the UK energy regulator, say that a highconsuming household with electric heating is expected to consume 7.1MWh/year. Does an increase of one or two BLEU points justify this cost? For ParaCrawl, we argue that yes, it does, because we are producing an enabling data set whose cost will, we hope, be amortised across many future experiments. But there is a more general point to be made here: it is not currently the practice in the machine translation community to publish figures about the cost involved in achieving an increase in performance as measured with the standard metrics. It is not straightforward to evaluate when or if we, as a community, have reached a point of diminishing returns where small changes to a family of methods consume an ever-increasing amount of resources yielding only marginal improvements. We therefore suggest adopting a practice of disclosing energy use for experiments in machine translation alongside BLEU scores to make the 16 https://www.ofgem.gov.uk/electricity/ retail-market/monitoring-data-and-statistics/ typical-domestic-consumption-values cost-benefit trade-off explicit.

Conclusions
We released the largest publicly available parallel corpora for many language pairs and demonstrated their benefit to train machine translation systems. Going beyond providing data, the goals of this project include the creation of publicly available infrastructure to explore new research directions on parallel corpus mining by releasing open source code for the entire pipeline and public benchmarks for individual processing steps.
Each of the processing steps we describe here still have great potential for improvement, and we hope that our work contributes to the development of novel methods both in terms of better processing of raw parallel data sources, but also increasing the robustness of neural machine translation training when faced with noisy data.
We are especially interested in further extending this work into low resource languages where resources tend to be noisier and underlying models to support data mining less reliable.

Acknowledgement
This work has been supported in part by three projects funded by the