Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance

This work describes our submission to the WMT16 Bilingual Document Alignment task. We show that a very simple distance metric, namely Cosine distance of tf/idf weighted document vectors provides a quick and reliable way to align documents. We compare many possible variants for constructing the document vectors. We also introduce a greedy algorithm that runs quicker and performs better in practice than the optimal solution to bipartite graph matching. Our approach shows competitive performance and can be improved even further through combination with URL based pair matching.


Related Work
The process of finding bilingual data online has been investigated since the early days of the world wide web (Resnik, 1999). In this work we are concerned with the problem of finding pairs of documents, a problem that can be structured in the following steps: 1. Candidate Generation The naive approach of considering all possible pairs of websites is often not applicable on a web scale, even when limiting the scope to a single webdomain. To overcome computational complexity previous work has focused on (i) matching pairs of URLs (Resnik and Smith, 2003) by removing language identifiers such as &lang=en or /fr/ from URLs, (ii) considering only documents that either link to each other or that share a parent page (Resnik, 1999) that links to them, (iii) following links on already aligned documents (Shi et al., 2006), (iv) querying a search engine for possible translations (Ruopp and Xia, 2008), and (v) rephrasing the task as near-duplicate detection after translating all non-English content to English (Uszkoreit et al., 2010). (Ture et al., 2011) map all document vectors into a target language space and use an approximation of cosine distance based on locally-sensitive hashing (LSH) together with a sliding window algorithm to efficiently collect similar pairs. 2. Document alignment After possible pairings have been generated any distance function that compares two documents can be used to remove unlikely candidates. Common choices include (i) edit-distance between linearized documents (Resnik and Smith, 2003) (ii) cosine distance of idfweighted bigram vectors (Uszkoreit et al., 2010), and (iii) probability of a probabilistic DOM-tree alignment model (Shi et al., 2006).

Approach
In this work we deal mainly with the second problem, document alignment, and just allow all possible source/target pairings. Thus, the task can be formalized as such: We are given a set of possible pairings where C is the set of candidates, d s ∈ D s are source language documents and d t ∈ D t are target language documents. The task is to find a subset of C = {(d s,i , d t,i ), . . .} ⊂ C such that d s,i is a translation of d t,i (and vice versa) and the number |C | of such pairings is maximized. We consider all source/target pairings that come from the same webdomain so that C = D s × D t . This yields a fully connected bipartite graph with source and target pages being the partitions. By using a scoring function defined on edges the graph becomes weighted. We allow every page to occur not more than once in C , i.e. we do not allow 1:n or m:1 connections:

Selecting pairs
After computing a score for every edge of the bipartite graph, a matching of maximum weight can be found in O(max(|D s ||D t |) 3 ) by solving the assignment problem using the Kuhn-Munkres algorithm (Munkres, 1957). We expect every page of the non-dominant language to have a translated counterpart, thus min(|D s |, |D t |) pairs are generated.
In section 3.3, we compare the optimal assignment to a greedy solution by incrementally choosing the edge with the highest score and removing all other edges pointing to respective vertices. The greedy algorithm stops once no edges are left and produces the same number of pairs as the optimal solution but only requires O (|D s ||D t | × log(|D s ||D t |)) time to sort the score matrix.

Experiments
In total, the training dataset consists of 1624 document pairs from 49 web domains. The number of annotated aligned document pairs per web domain ranges from 4 to over 200.
Our experiments that led to the selection of the method used on the evaluation data are all based on a fixed and random split into train and development (dev) data: we split the data set into training (998 document pairs in 24 web domains) and test (626 document pairs in 25 web domains). The former is used for extensive experimentation, the latter to select the best approach for our shared task submission.

Performance considerations
Our approach requires us to produce a dense matrix of feature values which seems prohibitively expensive given the high number of possible pairings. In practice, even for the largest webdomains in our data, requiring the scoring of roughly 1B possible pairs, we are able to produce all values quickly enough that the run-time is dominated by I/O and preprocessing steps such as tokenization. ngram size n = 1 n = 3 n = 5 Speed and, more importantly, memory consumption can be further improved by pruning all n-grams that occur fewer times than a set threshold in the corpus. We find empirically that maintaining a very low minimum count cutoff somewhere below 10 is crucial for maintaining high recall, as shown in Figure 1.

TF-IDF weighting
In the literature (Manning et al., 2008) a number of different weighting schemes based on tf/idf have been proposed with the overall goal to assign lower scores to terms (or n-grams) that are less discriminatory for document comparison.
However, these approaches usually aim at document retrieval, i.e. finding relevant documents given a large (in comparison to the overall document size) number of search terms. In the setting of near duplicate detection, our query is a complete document and other weighting schemes may apply.
To empirically evaluate the fitness of different approaches we implement the following weighting schemes for term frequency (tf). In every case we define and give the other case below: In the same way we implement weighting schemes for inverse document frequency idf(w n 1 , D s , D t ) = idf (·): where D = D s ∪ D t and Slight variations of the above definitions can be found in the wild, for example the search engine Apache Lucene 1 uses tf 6 and idf 6 but uses 1 + |D s ∪ D t | in the numerator since version 6. We evaluate the cross product of weighting schemes using the train and dev splits as described above. Looking at the results in Tables 2 and 3, a number of interesting observations can be made: 1. Performance differs between train and dev data, with results on the training portion of the data being several percents better. This indicates a skew in the data distribution which is surprising given that the webdomains were selected beforehand. We know from the training data that about 1 4 of the known pairs, 236 of 998, are found in a single webdomain tsb.gc.ca which could explain the skew. However, the difference remains if that large webdomain is removed.
Further investigation reveals that the underlying cause of poor performance on the dev set can be attributed to three webdomains that contain near duplicates, such as the same main content but interface elements in a different language.
2. When choosing the optimal length of scoring n-grams, shorter is better. Good recall can be achieved using 1-grams for the monolingual case where no machine translated (MT) data is used and 1-grams or 2-grams for the case where all French data is translated to English beforehand.
3. In tf-idf weighting the inverse document frequency acts as an indicator of a term's importance. This is important in the case of information retrieval where query words differ in utility. In the duplicate detection setting idf weights play a less important role and a common choice such as idf 3 defined in Equation  9 can be used throughout.
4. Results produced by using only the untranslated text, a configuration that requires no bilingual resources and little computational resources, are better than we expected: only between 5% (for train) and 8% (for dev) below the recall achieved using machine translated texts. In this case we just ignore that two pages are written in different languages and only rely on untranslated parts such as boilerplate, names, and numbers to provide sufficient cues.
For our submission we used the machine translated text provided by the organizers and chose n = 2, tf 4 (Equation 4), and idf 3 (Equation 9).

Greedy vs. optimal solution
We found that producing the optimal solution for the assignment problem using the Kuhn-Munkres algorithm (Munkres, 1957) was slightly worse in almost all cases. We hypothesize that by maximizing the aggregate score for all selected pairs the low-scoring pairs for which no matching document exists are over-emphasized. To test this hypothesis we compare the scores of the selected pairs for both algorithms: For each webdomain we sort the selected pairs by their score and select, for each algorithm, the n top scoring pairs: Let s(d s , d t ) be our scoring function, in this case we use Cosine similarity, and let (d s,g 1 , d t,g 1 ), . . . , (d s,g N , d t,g N ) be the document pairs selected by the greedy algorithm and, likewise, those selected by the optimal algorithm. Let these pairs be sorted by score such that s(d s,g i , d t,g i ) ≥ s(d s,g i+1 , d t,g i+1 ) ∀i  Figure 2: Difference in accumulated cosine distances between greedy and optimal algorithm. For more than the first half of the selected pairs, the greedy algorithm overall outperforms the optimal one, indicated by a negative ∆(n). and Let ∆(n) be the accumulated difference of scores for the first n pairs: Since the greedy algorithm is not necessarily optimal we know that ∆(N ) ≥ 0. However, as can be seen from Figure 2, the greedy selection of the best scoring pairs outperforms the Kuhn-Munkres algorithm for the top-scoring half, confirming our assumption that lower scoring pairs are selected in order to find better scoring matches for the documents without a counterpart.
We note that even after selecting 10 000 pairs, the accumulated difference is comparatively small, hinting that very similar sets have been selected. Figure 3 shows the Jaccard Similarity between the top n pairs for both algorithms. The Figure confirms that either approach selects virtually the same set of pairs for low numbers of n. Thus, the globally optimal solution is not only expensive to compute but also very similar to the greedy selection and it outperforms the greedy algorithm mostly for pairs in the tail that are likely misaligned anyways, because no translated page exists. Hence, all our reported results use the greedy selection introduced in Section 2.1.

Results
The test data for the shared task consists of 203 crawled websites that are all distinct from the training set. No additional known pairs are provided for these webdomains, but the organizers offer translations of French text into English, as for the training data. As above, performance in evaluated via recall under the condition that every document can only be part of a single pair. The number of pages per domain varies wildly between 9 and almost 100k. In the latter case, 50k pairs need to be picked from roughly 2.5B possibilities. After some preprocessing such as tokenization we produce 368 260 pairs using greedy selection and cosine distance as explained above. For all webdomains this takes less than 4h on a single machine.
In total, 13 research teams contributed 21 submissions to the shared task. The official results can be found in Table 4. Our submission ranks on 3 rd place. We would like to point out that, apart from selecting the best performing tf/idf weighting method, the training data is not used at all. Thus, besides a baseline machine translation system no additional resources are needed, which makes our approach widely applicable.
A baseline system based on matching URL patterns such as site.com/home-fr/ and site.com/home/en/ as used in previous work (Resnik and Smith, 2003;Smith et al., 2013) is provided by the organizers. We combine our approach and the Baseline by simply selecting all 148 537 baseline pairs first. While not on official submission, Table 4 shows that this combination outperforms all other systems.  Table 4: Official results on the shared task test data. Results described in this work are fat. Across all webdomains a total of 2402 known pairs were to be found. ( † ) indicates a non-official result that was produced post-submission.

Conclusion
We present a comparison of tf/idf weighting schemes for comparison of original and translated documents via cosine distance. We find that the right choice of term-frequence (tf) weighting is crucial in this setting, along with the inclusion of low-frequency words. We compare a greedy selection algorithm to a computationally more expensive solution which yields a slightly better global solution. We can show that the former often outperforms the latter in practical settings where a tail of un-pairable document exits.
Our best results are based on machine translated documents. However, even when ignoring the fact that two documents are written, at least partially, in different languages, we are still able to discover a substantial number of parallel pages.
Results of the shared task show that our approach, which only uses the website's text, yields competitive results. Results improve further when our predictions are combined with pairs found via URL matching.