Findings of the WMT 2016 Bilingual Document Alignment Shared Task

This paper presents the results of the WMT16 Bilingual Document Alignment Shared Task. Given crawls of web sites, we asked participants to align documents that are translations of each other. 11 research groups submitted 19 systems, with a top performance of 95.0%.


Introduction
Parallel corpora are especially important for training statistical machine translation systems, but so far the collection of such data within the academic research community has been ad hoc and limited in scale. To promote this research problem we organized a shared task on one of the core processing steps in acquiring parallel corpora from the web: aligning bilingual documents from crawled web sites.
The task is to identify pairs of English and French documents from a given collection of documents such that one document is the translation of the other. As possible pairs we consider all pairs of documents from the same webdomain for which the source side has been identified as (mostly) English and the target side as (mostly) French.
Lack of data in some cases has held back research. To give an example, there are significant research efforts on various Indic languages (Post et al., 2012;Joshi et al., 2013;Singh, 2013), but this work has been severely hampered, since it uses very small amounts of data. But even for the language pairs tackled in high profile evaluation campaigns, such as the ones organized around WMT, IWSLT, and even NIST, we use magnitudes of data less than what has been reported to be used in the large-scale efforts of Google or Microsoft. This diminishes the value of research findings: reported improvements for methods may not hold up once more data is used. Work in reduced data settings may also distract from efforts to tackle problems that do not go away with more data, but are inherent limitations of current models.

Related Work
Although the idea of crawling the web indiscriminately for parallel data goes back to the 20th century (Resnik, 1999), work in the academic community on extraction of parallel corpora from the web has so far mostly focused on large stashes of multilingual content in homogeneous form, such as the Canadian Hansards, Europarl (Koehn, 2005), the United Nations (Rafalovitch and Dale, 2009;Ziemski et al., 2015), or European Patents (Täger, 2011). A nice collection of the products of these efforts is the OPUS web site 1 (Skadiņš et al., 2014).
These efforts focused on individual web sites allow for writing specific rules for aligning documents as well as extracting and aligning content. Scaling these manual efforts to thousands or millions of web sites is not practical.
A typical processing pipeline breaks up parallel corpus extraction into five steps: • Identifying web sites with bilingual content • Crawling web sites • Document alignment • Sentence alignment • Sentence pair filtering For each of these steps, there has been varying amount of prior work and for some tools are readily available. Since there has been comparatively little work on document alignment, we picked this problem as the subject for the shared task this year, but other steps are valid candidates for future tasks.

Web Crawling
Web crawling is a topic that has not received much attention from a specific natural language processing perspective. There are a number of challenges, such as identification of web sites with multilingual content, avoiding to crawl web pages with identical textual content, learning how often to recrawl web sites based on frequency of newly appearing content, avoiding crawling of large sites that have content in different languages that is not parallel, and so on.
We used for the preparation of this shared task the tool Httrack 2 which is a general web crawler that can be configured in various ways. Papavassiliou et al. (2013) present the focused crawler ILSP-FC 3 that integrates crawling more closely with subsequent processing steps like text normalization and deduplication.

Document Alignment
Document alignment can be defined as a matching task that takes a pair of documents and computes a score that reflects the likelihood that they are translations of each others. Common choices include edit-distance between linearized documents (Resnik and Smith, 2003), cosine distance of idfweighted bigram vectors (Uszkoreit et al., 2010), and probability of a probabilistic DOM-tree alignment model (Shi et al., 2006).

Sentence Alignment
The topic of sentence alignment has received a lot of attention, dating back to the early 1990s with the influential Church and Gale algorithm that is language-independent and easy to implement. It relies on relative sentence lengths for alignment decisions and hence is not tolerant to noisy input.
It is not clear, which of these tools fares best with noisy parallel text that we can expect from web crawls, which may have spurious content and misleading boilerplate.

Filtering
A final stage of the processing pipeline filters out bad sentence pairs. These exist either because the original web site did not have any actual parallel data (garbage in, garbage out), or due to failures of earlier processing steps.
As Rarrick et al. (2011) point out, a key problem for parallel corpora extracted from the web is filtering out translations that have been created by machine translation. Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction. Antonova and Misyurev (2011) report that rulebased machine translation output can be detected due to certain word choices, and machine translation output due to lack of reordering.
This year, a shared task on sentence pair filtering 8 was organized, albeit in the context of cleaning translation memories which tend to be cleaner that the data at the end of a pipeline that starts with web crawls.

Comprehensive Tools
For a few language pairs, there have been individual efforts to cast a wider net, such as the billion word French-English corpus collected by Callison-Burch et al. (2009), or a 200 million word Czech-English corpus collected by Bojar et al. (2010). Smith et al. (2013) present a set of fairly basic tools to extract parallel data from the publicly available web crawl CommonCrawl 9 .
In all these cases, the corpus collection effort reinvented the wheel and wrote dedicated scripts to download web pages, extract text, and align sentences, with hardly any description of the methods used.
Our data preparation for the shared task builds partly on Bitextor 10 , which is a comprehensive pipeline from corpus crawling to sentence pair cleaning (Esplà-Gomis, 2009).

Training and Test Data
We made available crawls of web sites (defined as pages under the same webdomain) that have translated content. We also annotated some document pairs to provide supervised training data to the participants of the shared task.

Terminology
A quick note on terminology: Unfortunately, the notion of domain is ambiguous in NLP applications, and we use an unusual meaning of the word in this report. To avoid confusion we will instead use the term webdomain to refer to content from a specific website, e.g,"This page is from the statmt.org webdomain." We distinguish between webdomains using their Fully Qualified Domain Name (FQDN). Thus, www.example.com and example.com are considered to be different webdomains.
We will use source to denote English pages and target for French ones. This does not imply that translation was performed in that direction. In fact we cannot know if translation from one side to the other was performed at all, both sides could possibly be translations of a third language document.
The task was organized as part of the First Conference on Machine Translation (WMT), and all data can be downloaded from its web page 11 .

Data Preparation
We crawled full web sites with the web site copyer HTTrack, from the homepage down, restricted to HTML content. Web sites differed significantly in their size, from a few hundred pages to almost 100,000.
In the test data we removed all duplicates from the crawl 12 . Duplicates are defined as web pages, whose text content is identical. Duplicates may differ in markup and URL. To extract the text we used a Python implementation of the HTML5 parser to extract text as a browser would see it. As the text is free of formatting, determining whitespace is important. While generally following the standard, e.g. inserting line breaks after block level elements 13 , we found that inserting spaces around <span> tags helps tokenization as these are often visually separated using CSS.
We restricted the task to the alignment of French and English documents, so we filtered out all web pages that are not in these two languages. However, we did not expect that participants would develop language-specific approaches. To detect the language of a document we feed the extracted text into an automatic language detector 14 . We note that language detection is a noisy process and many pages contain mixed language context, for example English boilerplate but French content. We take the overall majority language per page as the document language.
We decided to have a large collection of web sites, to encourage methods that can cope with various types of web sites, such as differing in size, balance in the number of French and English pages, and so on.
Given the large number of correct document pairs, we did not even attempt to annotate all of them, but instead randomly selected a subset of pages and identified their corresponding translated page. We augmented this effort with aligned document pairs that are indicated at the web site Linguee 16 , a searchable collection of parallel corpora, in which each retrieved sentence is annotated with its source web page.
The task then is to find these document pairs. Since this is essentially a recall measure, which can be gamed by returning all possible document pairs, we enforce a 1-1 rule, so that participants may align each web page only once.

Training Data
As training data we provide a set of 1,624 EN-FR pairs from 49 webdomains. The number of annotated document pairs per webdomain varies between 4 and over 200. All pairs are from within a single webdomain, possible matches between two different webdomains, e.g. siemens.de and siemens.com, are not considered in this task.
The full list of webdomains in the training data is listed in Table 1. Webdomains range in size from 33×29 pages (schackportalen.nu) to 24,325×43,045 pages (www.nauticnews.com).

Test Data
For testing, we provide 203 additional crawls of new webdomains, distinct from the ones in the training data in the same format. No aligned pairs  are provided for the any of these domains. We removed exact duplicates of pages, keeping only one instance. Otherwise, we processed the data in the same way as the training data.

Data Format
The training document pairs are specified as one pair per line: Source URL<TAB>Target URL For the crawled data we provide one file per webdomain in .lett format adapted from Bitextor. This is a plain text format with one line per page. Each line consists of 6 tab-separated values: • Language ID (e.g. en) • Mime type (always text/html) • Encoding (always charset=utf-8) • URL • HTML in Base64 encoding • Text in Base64 encoding To facilitate use of the .lett files we provide a simple reader class in Python. We make sure that the language id is reliable, at least for the documents in the train and test pairs.
Text extraction was performed using an HTML5 parser. As the original HTML pages are available, participants are welcome to implement their own text extraction, for example to remove boilerplate.
Additionally, we have identified spans of French text in French documents for which we produced English translations using MT. We use a basic Moses statistical machine translation engine (Koehn et al., 2007) trained on Europarl and News Commentary with decoding settings geared towards speed (no lexicalized reordering model, no additional language model, cube pruning with pop limit 500).
These translations are not part of the lett files but provided separately. The format for the source segments and target segments is URL<TAB>Text where the same URL might occur multiple times if several lines/spans of French text were found. The URLs can be used to identify the corresponding documents in the .lett files.

Baseline Method
We provide a baseline systems that relies on the URL matching heuristic used by Smith et al. (2013). Here two URLs are considered a pair if both can be transformed into the same string through stripping of language identifiers. Strings indicating languages are found by splitting a large number of randomly sampled URLs into components and manually picking substrings that correlate with the detected language.
We further improve the approach by allowing matches where only one URL contains a strip-able language identifier, e.g. we match x.com/index.htm and x.com/fr index.htm. If a URL has several matching candidates we pick the one that requires the fewest rewrites, i.e. we prefer the pair above over x.com/en/index.htm x.com/fr index.htm.
The baseline achieves roughly 60% recall, compared to 95.0% of the best submission.

Evaluation
Our main evaluation metric is recall of the known pairs, i.e. what percentage of the aligned pages in the test set are found. We strictly enforce the rule that every page may only be aligned once, so that participants cannot just align everything. After a URL has been seen as part of a submitted pair, all later occurrences are ignored.
After we released the gold standard alignments, a number of participants pointed out that some predicted document pairs were unfairly counted as wrong, even if their content differed only insignificantly from the gold standard.
To give an example, the web pages www.taize.fr/fr article10921.html?chooselang=1 and www.taize.fr/fr article10921.html are almost identical, but the first offers a checkbox to select a language, while the second does not. Since the text on the pages differs slightly, these were not detected as (exact) duplicates.
To address this problem, we also included a soft scoring metric which counts such near-matches as correct. We chose that to be a close duplicate, the edit distance between the text of two pages, normalized by the maximum of their lengths (in characters) must not exceed 5%.
If we observe a predicted pair (s, t) that is not in the gold set, but (s, t ) is and dist(t, t ) ≤ 5%, then this pair is still counted as correct. The same applies for a close duplicate s of s but not both as we still follow the 1-1 rule.

Results
11 research groups participated in the shared task, some with multiple submissions. The list of participants is shown in Table 2, with a citation of their system descriptions, which are included in these conference proceedings. Each participant submitted one or more collections of document pairs. We enforced the 1-1 rule on the collections, and scored them against the gold standard. Results are summarized in Table 3. Almost all systems outperformed the baseline by a wide margin. The best system is NOVALINCS-URL-COVERAGE with 2,281 correct pairs, 95.0% of the total.
Note that the submissions varied in the number of document pairs, but after enforcing the 1-1 rule, most submissions comprise about 200,000-300,000 document pairs. Table 4 displays the results with soft scoring. Essentially, every system improved, mostly by around 3%. The top two performers swapped places, with YODA now having the best showing with 96.0%. We also experimented with a tighter threshold of 1% which gave almost identical results.
6 System Descriptions NOVALINCS (Gomes and Pereira Lopes, 2016) submitted 3 systems that use a phrase table from a phrase-based statistical machine translation system to compute coverage scores, based on the ratio of phrase pairs covered by a document pair. In addition to the purely coverage-based system, NOVALINCS-COVERAGE (88.6%), they also submit a system that uses coverage-based matching as a preference over URL matching NOVALINCS-COVERAGE-URL (85.8%) and the converse system that prefers URL matching over coverage-based matching NOVALINCS-URL-COVERAGE (95.0%).
YODA (Dara and Lin, 2016) submitted one system (93.9%) that uses the machine translation of the French document, and finds the English corresponding document based on bigram and 5-gram matches, assisted by a heuristics based on document length ratio.
UEDIN1 (Buck and Koehn, 2016) submitted one system (89.1%) that uses cosine similarity between tf/idf weighted vectors, extracted by collecting n-grams from the English and machine translated French text. They compare many hyperparameters such as weighting schemes and two pair selection algorithms.
DOCAL (Azpeitia and Etchegoyhen, 2016) submitted one system (88.6%) that used word translation lexicons to compute document similarity scores based on bag-of-word representations. They expand a basic translation lexicon by adding all capitalized tokens, numbers, and longest common prefixes of known vocabulary items.
UEDIN2 (Germann, 2016) submitted 2 systems based on word vector space representations of documents using latent semantic indexing and URL matching, UEDIN LSI (85.8%) and UEDIN LSI (87.6%). In addition to a global cosine similarity score, a local similarity score is computed by re-centering the vector around the mean vector for a webdomain.  (Papavassiliou et al., 2016) submitted one system (84.9%), which uses boilerplate removal, and carries out document alignment based on features such as links to documents in the same webdomain, URLs, digits, image filenames and HTML structure. Their paper also describes in detail the open source ILSP Focused Crawler.
YSDA (Shchukin et al., 2016) submitted one system (84.1%) that uses n-gram matches between the machine translation of the French document and the English document. They cluster French and English words into bilingual clusters of up to 90 words, starting with word pairs with high translation probability in both directions, and then adding words that translated well into existing words in a cluster.
UA PROMPSIT (Esplà-Gomis et al., 2016) submitted 2 systems based on Bitextor and describe improvements to the Bitextor toolkit. Their submissions contrast the old version of the tool, UA PROMPSIT BITEXTOR 4.1 (31.1%), with the recent release, UA PROMPSIT BITEXTOR 5.0 (83.3%). Improved document alignment quality is based on various new features: ratio of shared links, similarity of link URLs, ratio of shared images, binary feature indicating if the documents are linked, and similarity of URLs, in addition to the old features bag of words similarity using a translation dictionary and DOM structure similarity.
MEDVED (Medved et al., 2016) submitted one system (79.4%), which determines the top 100 keywords based on tf/idf scores for each document and uses word translation dictionaries to match them.
BADLUC (Jakubina and Langlais, 2016) submitted one system (79.3%) that uses the information retrieval tool Apache Lucene to create two indexes, on URLs and text content, and retrieves the most similar documents based on variants of td/idf scores. Both monolingual queries and bilingual queries based on a word translation dictionary are performed.
ADAPT (Lohar et al., 2016) submitted one system (and a revision) that combines similarity metrics computed on ratio of number of sentences in documents, ratio of number of words in the documents, and matched named entities.
JIS (Mahata et al., 2016) submitted one system (2.0%), which uses text matching based on sentence alignment and word dictionaries. Their paper also described improvements over the original submission.