A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in cross-lingual NLP across a variety of low, mid, and high-resource languages.


Introduction
Document alignment is the task that attempts to pair documents such that they are translations or near translations of each other. There are a variety of tasks in natural language processing that require or benefit from parallel cross-lingual data. Traditionally, machine translation approaches have leveraged parallel sentences as training data for use with sequence-to-sequence models. Other tasks include cross-lingual information retrieval and cross-lingual document classification. Additionally, cross-lingual data facilitates cross-lingual representations such as in the work of (Lample and Conneau, 2019) which has direct applications to zero-shot NLP tasks and internationalization of models. The availability of high-quality datasets is necessary to both train and evaluate models across these many tasks.
While it is possible to manually identify and label aligned documents across languages, the process is costly and time consuming due to the quadratic search space for document pairs. Additionally, for low resource languages, identifying these cross-lingual document pairs is more difficult due to their relative scarcity. Furthermore, lack of access to qualified human annotators makes it necessary to have additional quality control in low-resource scenarios .
In this paper, we present a dataset consisting of pairs of translated documents represented by URLs extracted from a massive collection web crawls. Our dataset is based on a simple-yetpowerful approach to automatically extract crosslingual documents based on high-precision handcrafted rules that leverage language specifications of URLS. These rules coupled with a majorityvoted language identification algorithm which reduces the prevalence of false positives and ensures these web documents truly represent the languages they claim. Given this rule-based expert system, we mine massive collection of 13 billion web documents and identify 54 million cross-lingual parallel documents in 92 language pairs.
We evaluate the quality of our automaticannotation setup using two approaches: (1) by comparing it to a human-annotated ground truth set (2) by leveraging the mined documents as training data for a downstream machine translation task.
Finally, we also introduce a simple baseline that effectively aligns cross-lingual document pairs using solely textual content and in the presence of detractor documents which may not have any parallel counterpart. We hope that the size, diversity, and quality of this dataset spurs its use not only as a benchmark for document alignment, but also as supervision for a variety of cross-lingual tasks.

Related Works
The concept of crawling and mining the web to identify sources of parallel data has been previously explored (Resnik, 1999). A large body of this work has focused on identifying parallel text from multilingual data obtained from a single source: for example the United Nations General Assembly Resolutions (Rafalovitch et al.;Ziemski et al., 2016) or European Parliament parallel corpus (Koehn, 2005). These parallel corpora were curated from specific, homogeneous sources by examining the content and deriving domainspecific rules for aligning documents.
Other approaches have identified parallel documents in unstructured web corpora by relying on metadata. Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents Marcu, 2005, 2006;Udupa et al., 2009;Do et al., 2009;AbduI-Rauf and Schwenk, 2009). However, temporal features can be sparse, noisy, and unreliable. A different class of alignment methods rely on document structure (Resnik and Smith, 2003;Chen and Nie, 2000).
In the WMT-2016 bilingual document alignment shared task (Buck and Koehn, 2016a), many techniques applied retrieval and matching on translated 5-grams (Dara and Lin, 2016) to query, retrieve, and align documents. Similar methods generate candidates by retrieving matches based on the least frequent bi-lingual 5-grams have been proposed (Gomes and Lopes, 2016) with the insight that rare snippets are more informative. Both of these candidates rely on high-quality translation systems to translate either the source or the target. Such models may not exist, especially for low-resource language directions. The application of alignment to a variety of languages was not explored in WMT-2016 which only considered English to French document alignment -a highresource direction.
Recently, the use of neural embedding methods has been explored for bilingual alignment of text at the sentence and document level. Guo et al. (2019) propose using hierarchical document embeddings, constructed from sentence embeddings, for bilingual document alignment.

Dataset Creation and Description
In this section, we describe the data preparation process detailing how the dataset was created as well as the statistics on the resultant dataset.

Common Crawl
The Common Crawl corpus is a publicly available crawl of the web. With a new snapshot uploaded each month, and over 2 billion pages released in each snapshot, this data is a vast resource with content across a large number of domains and languages. Previous works have leveraged the data from Common Crawl for mining ngram counts to perform language modeling (Buck et al., 2014). Other works (Smith et al., 2013) have mined Common Crawl for bitexts for machine translation. However, this mining was performed on a small scale. For our dataset, we use snapshots published in 2018 covering 12 snapshots from January to December which is vastly larger than previous works.

Dataset Preparation
In this section, we describe the pipeline used to create the cross-lingual document pair dataset.

Preprocessing
Extracting the textual content of Common Crawl web documents is a relatively challenging task that involves removing all tables, pictures, hyperlinks, and formatting markup. As such, the first preprocessing step is to remove all HTML tags and boiler-plate markup.
After content cleaning, the next step in preprocessing the data is deduplication. While investigating combining many Common Crawl snapshots, we found duplicate URLs both within an individual snapshot and almost always across snapshots. As our data curation method relies on unique URLs for each web document, we apply a heuristic to ensure each URL appears once within the final cleaned data. The first step is to normalize each URL; we perform this by simply removing the protocol and host name (e.g., https://www.aaa.com → aaa.com). Upon normalization, for each URL that appears more than once, we select the instance that possesses the longest document content. This heuristic assumes that occasionally, content is (1) deleted and gets shorter or (2) is amended and gets longer. In this case, it is preferable to operate on the larger content. Starting from 12 Common Crawl snapshots with a raw document count of 35.7 billion documents, upon deduplication, the resultant corpus is approximately 13.3 billion web documents from 64.8 million distinct web domains -a 63% reduction from the raw corpus.

Language Identification
The next step in the pipeline is to tag each document with the dominant language identifier. We utilize FastText (Joulin et al., 2017), a lightweight text classifier that has been trained to detect more than 170 languages. Because mixed language content is common, and boiler plate can often add noise to language identification, language identification may incorrectly tag documents. In case boilerplate (often at the beginning of a document) is tagged incorrectly, performing language identification on different segments can mitigate the noise introduced by boiler plate and correctly identify the dominant language within a document. To address this, we perform ensembling by predicting the language of a number of contiguous subsets of the document content. Majority voting is then used to tag the document with the predominant predicted language.

Language ID URL Matching
To identify pairs of cross-lingual documents, we apply a high-precision, low recall heuristic to assess whether two URLs represent web pages that are translations of each other. This heuristic presumes that two URLs, with high probability, refer to pages that are translations of each other if both can be transformed into the same string after stripping language identifiers. To improve recall, we allow matches where only one of the pair of URLs contain a language identifier e.g., https://facebook.com would be a match to https://fr-fr.facebook.com. We further ensure that these matches are high-precision by verifying that the language identifier stripped from the URL reflects the language of the web document document as predicted by the language identifier. Table 1 shows a few examples of pairs of aligned URLs. Alignment is performed by normalizing each URL by stripping its present language identifiers. Extra care is taken to ensure relevant indicators such as /, &, and ? are stripped as well to ensure proper alignment between URLs.
For simplicity of implementation and reducing the volume of aligned documents, we restrict the source URL to English documents and allow  the target URL to vary among the 92 target languages. Given these rules and restrictions, we mined 54 million aligned documents across 12 Common Crawl snapshots. See Figure 1 for detailed a breakdown per language. We assess the efficacy of this rule-based alignment in the next section.

Dataset Evaluation
In this section, we analyze the quality of our crosslingual URL-aligned dataset. The first evaluation assesses the quality by measuring the precision of a representative sample of the URL-aligned data to human-annotated alignment judgments. The second evaluation assesses the data by utilizing the data in a downstream task. By first mining the aligned documents for parallel bitexts and using these bitexts as training data for massively multilingual machine translation, we can assess the overall quality of machine translation models trained solely from these mined bitexts.

Dataset Quality Evaluation
To assess the effectiveness of stripping language identifiers as a method for identifying web document pairs that are cross-lingual translations, we recruit human annotators to evaluate the alignments. We first select 6 languages from various language families, scripts, and levels of resource availability. For each language, we identify 30 pairs of URLs for a total of 180 pairs from the aligned dataset. To gather pairs from a diverse set of websites, each URL pair is selected from a distinct web domain. Twelve human annotators were tasked with evaluating URL pairs by loading the two webpages corresponding to each URL pair side by side and assessing whether or not the content rendered is both comparable and in the correctly tagged language. We ensure that each URL pair is evaluated   Table 2 shows the precision of the URL-aligned documents when compared to human-annotated ground truth. In addition, we report the agreement among annotators as measured by the Krippendorff Alpha (Krippendorff, 2011) of the annotations. Overall, the URL pairs appear to adhere to human-standards of comparability with a majority of measured directions achieving precision of over 80%. After observing annotator comments and analyzing the misaligned documents, many of the mis-classified pairs appear to be due to a few reasons: (1) the majority of dynamic content within a document pair appears to be in the same language. In this scenario, only boilerplate text such as columns and title are translations. As such, many annotators don't consider the document pairs as translations of each other. (2) The content in one of the parallel documents appears to be much shorter than the document in the original (dominant) language. In this case, annotators believe that the two web documents are not translations as one is a shorter paraphrase of the other.
In addition to the URL-mined document pairs, we release the human-annotated sub-sampled pairs.

Machine Translation Evaluation
To assess the quality of the aligned document corpus, we propose a downstream task that leverages the aligned document data as a source of supervision for a massively multilingual machine translation task.
The first step is to decompose and mine the aligned document corpus for parallel sentences. We segment each document into sentences, then apply the Moses tokenizer (without true casing) to tokenize each sentence. Given each document pair's decomposition into tokenized sentences, we seek to align sentences within each pair of documents. We can then aggregate these parallel sentences across all document pairs to form a parallel sentences dataset suitable for training machine translation models.
We apply a recent approach for mining parallel cross-lingual texts based on a distance measure in a joint multilingual sentence embedding space (Schwenk, 2018). This method has been shown accurately align and filter sentences for across a variety of low, mid, and high-resource directions .
We apply the open-source LASER toolkit (Artetxe and Schwenk, 2018) which provides a language agnostic sentence encoder and use the margin-based filtering criterion.
After mining parallel sentences from the aligned documents, we perform large-scale neural machine translation training on the extracted bitexts. First the data is processed to induce a 5000 subword vocabulary using SentencePiece (Kudo and Richardson, 2018). The model used is a transformer model from fairseq  with embeddings shared in the encoder and decoder, 5 encoder and decoder layers with dimensionality 512 are used, encoder and decoder FFN with 2 attention heads each with an embedding dimension of 2048 are used along with encoder and decoder normalization. Dropout of 0.4, attention dropout of 0.2 and relu dropout of 0.2 are applied. The en-es no-en da-en en-fr en-no es-en en-da en-pt sv-en pt-en fr-en it-en en-sv en-it nl-en bg-en en-id en-bg hr-en en-nl de-en ro-en en-vi he-en el-en mk-en en-el id-en en-de en-mk uk-en cs-en en-hr en-ro sk-en vi-en ru-en sl-en bs-en en-uk en-ru en-sl en-he sr-en en-sk tr-en pl-en en-cs hu-en en-pl en-hu en-bs en-tr en-sr adam optimizer is used to train the model for 100 epochs by optimizing a smoothed-cross entropy with 0.2 label smoothing. After training models for each direction, we then evaluate the quality of the learned NMT models on a publicly available data set consisting of transcribed and translated TED talks in 50 languages (Qi et al., 2018). This helps assess the quality of aligned data from our corpus as all parallel sentences must be extracted from aligned document pairs.
In Table 3, we report the BLEU scores from the mined bitexts from aligned documents on the TED talk dataset as well the number of distinct aligned sentence pairs (reported in millions). Based on these results, it appears that European languages yield higher quality sentences than non-European regardless of the resource level of the direction. Additionally, documents aligned across high-resource directions yield enough highquality aligned data to learn high-quality models. While these bleu scores should be taken in context of the volume of aligned bitexts, one can get an intuition as to the quality of the underlying URLaligned documents the sentences were mined from from the resultant test-set BLEU scores.
Finally, we compare the test set BLEU scores to a dataset mined from Wikipedia  using LASER sentence embedding and margin-based sentence alignment. All preprocessing and experimental conditions including model hyper-parameters between these two NMT experiments were held constant making the BLEU scores directly comparable. As seen in Figure 2, sentences mined from the URL-aligned CommonCrawl corpus is of comparable quality to the Wikipedia-mined data resulting in higher BLEU scores for 39 out of the 54 evaluated language directions (72.2%). This demonstrate that although we restrict parallel sentence alignment to documents that have been aligned by our URL rule-set, the mined sentences yielded are of high quality indicating adequately aligned documents.

Baselines & Evaluaton
In Section 4, we verify the quality of the URLaligned dataset through human-evaluation and evaluation in a downstream task. In this section, we treat the URL-aligned dataset as a highprecision, low-recall dataset and evaluate baselines that score document pairs based on content rather than URL information. The scored document pairs are then aligned via a greedy bipartite matching algorithm. The resultant alignments are evaluated on a subset of the URL-aligned dataset which is treated as ground truth.

Problem Definition
Given a set of source document, D s and a set of target documents D t , there exist |D s |×|D t | potential pairs of documents where each document pair is of the form (d s , d t ) s.t. d s ∈ D s and d t ∈ D t respectively. Let P be the set of all candidate pairs (D s × D t ). Then cross-lingual document alignment aims to find the largest mapping from source documents to target documents, P ⊂ P, s.t. the mapping is injective from D s to D t : That is the largest set of pairs of documents from source to target such that each source document and target document can only be used in at most a single pair.
In the remainder of this section, we introduce some document pair scoring functions that attempt to capture the notion of cross-lingual document similarity. We then describe a simple alignment process that leverages the similarity scores to align documents between source and target.

Document Embedding Similarity
To guide the alignment algorithm, a notion of cross-lingual document similarity is necessary. This score should capture the fact that two documents are semantically similar despite having some or all of their content in different languages.

BLEU Language
En-x x-En Vol . This time, the LASER encoder is used to encode each sentence s i into a dense vector v s i . After embedding each sentence in a document, document embedding is performed by averaging these sentence vectors into a document vector v d as follows: Scoring Using the dense document representations for each document from the source and target sets, the next step is to score pairs to evaluate how semantically similar documents are. Given two documents a and b, We compute their semantic similarity using a cosine similarity score: (2)

Greedy Alignment
Using the baseline scoring function, we score all document pairs in the same web domain that belong to the source and target languages respectively. As such, for any given domain, each document in the source document set, D s is paired with each document in the target set, D t , yielding D s × D t scored pairs -a fully connected bipartite graph. Just like in (Buck and Koehn, 2016b), the expected output assumes that each page in the non-dominant language has a translated or comparable counterpart. This yields a min(|D s |, |D t |) expected number of aligned pairs. While an optimal matching maximizing scoring can be solved using the Hungarian algorithm (Munkres, 1957), the complexity of this algorithm is O(max(|D s ||D t |) 3 ) which is intractable to even moderately sized web domains. As such, similar to the work in (Buck and Koehn, 2016b), a one-to-one matching between English and non-English documents is enforced by applying a greedy bipartite matching algorithm. In Algorithm 1, the algorithm first scores each candidate document pair using the document similarity scoring function. These candidates are then sorted in order of most similar to least similar using their numerical score. The algorithm then iteratively chooses a document pair with the highest score as long as the d s and d t of each pair have not been used in a previous (higher scoring) pair. The algorithm terminates when min(|D s |, |D t |) pairs have been selected. Unlike the Hungarian algorithm, the runtime complexity is a more tractable O(|D s ||D)t|×log(|D s ||D t |)) which is dominated by the cost of sorting all candidate pairs.

Baseline Results
We evaluate the baseline scoring by aligning the documents from a subset of the 12 Common Crawl snapshots. We score document pairs from the source and target languages within the same webdomain then apply the greedy document alignment algorithm from Algorithm 1 to ensure the technique cannot simply align all pairs. Recall (i.e. what percentage of the aligned pages in the test set are found) is computed on a test-set consisting of pairs from the URL-aligned documents, which we verified have high-precision and we treat as the ground-truth test set.  We show the alignment results in Table 4. Comparing DDE which directly applies LASER to the entirety of the document content, we see that performance is significantly lower than SAE which averages the individual sentence embeddings. We suspect this may be the case for two reasons (1) sentence encoders may suffer at representing the semantic meaning of long documents (2) there may be noisy boiler plate content at the beginning of each web document that is less useful semantically but dominates the representation. Intuitively, higher-resource directions appear to align better than lower-resource directions. We believe this is a byproduct of the data LASER was trained on which is predominantly high-resource.

Conclusion
In this paper, we apply URL-matching rules to curate a high-quality cross-lingual documents dataset from the commoncrawl corpus. Our dataset contains document pairs from 92 different languages aligned with English. We first directly evaluate the quality of the URL-aligned pairs using human annotators. We further evaluate the URL-aligned documents in a downstream machine translation task by decomposing the aligned documents into aligned sentences, then training machine translation models across all 92 directions. Finally, we introduce and evaluate a new general embedding-based baseline technique for aligning documents based on content rather than meta-information like URLs.