WMT2016: A Hybrid Approach to Bilingual Document Alignment

Large aligned corpora are required for any computer aided translation system to become effective. In this scenario, bilingual document alignment has gained ut-most importance in recent days. We attempt a simple yet effective approach to align URLs (Uniform Resource Locator) within two documents in two languages as a part of WMT2016 Bilingual Document Alignment Shared Task. Our approach includes the processing of URLs and their embedded texts, which serves as the main matching criterion. In order to align the text initially, we have used Gale-Church algorithm, dictionary based translation and Cosine Similarity that in turn helps us to achieve better results in the alignment task.


Introduction
Bilingual document alignment has gained utmost importance these days [Brown et al.1991, Warwick et al.1990, Gale and Church1991, Kay and Röscheisen1993, Simard et al.1992, Ku-piec1993,Matsumoto et al.1993,Dagan et al.1999. Research on calculating similarity of bilingual comparable corpora is attracting more attention [Vu et al.2009, Pal et al.2014. Growth of monolingual data in different languages has made the task for aligning documents difficult [Jagarlamudi et al.2011]. What makes the problem more critical is the fact that one sentence in one language can correspond to many sentences in a different language [Wu1994]. For any translation system to work correctly and efficiently, it has to be fed with a large parallel corpora. Such corpora are very hard to find [Smith et al.2010] since it involves serious manual labour and cost. To eliminate the high cost, computer aided sentence alignment of two different corpora has become very desirable. The presence of such computer aided aligned corpora aids in many Natural Language Processing (NLP) tasks such as Machine Translation, Word Sense Disambiguation as well as Cross Lingual Information Extraction [Patry and Langlais2005].
In our current task, we have worked on the data provided by WMT shared task 1 , which had web crawls of 203 websites and were extracted in both English and French. The task was to extract 1-1 pairings of English and French URLs that has the same content but in respective languages. The data contained URLs followed by the text in each of the URLs. The task was to extract the text from both the English and French URLs and align them using our alignment algorithm. After alignment of the text, the URLs to which the text belongs to were also aligned. Our algorithm makes use of concepts given by [Gale and Church1991], translation of words using a dictionary created by Anymalign package [Lardilleux et al.2012] and the concept of Cosine Similarity. The following section will document the algorithm. Working of the algorithm will be shown in Section 2, followed by the results in Section 3.

Text and URL Extraction
The given .lett files are opened and the URL as well as the texts are extracted. The extracted text and URLs are given IDs so that it becomes easy to align the URLs after aligning the texts. The process is shown in Figure 1.

Text Selection using Algorithm proposed by Gale-Church
In their paper [Gale and Church1991], Gale and Church suggested that the source sentence and its This idea forms the basis of our proposed system. We have found out the length of the source English sentence, that have been extracted from a URL pair, and have found matches in all the target French sentences, extracted from the same URL pair. This results in one-to-many relationship between the English and French sentences. The variance in this step is kept as 2, which means if the length of the French sentences exceeds or falls behind the length of the English sentence by a difference 2, when compared to the source English sentence, they are also included as a match with the English sentence. This step is shown in Figure  2, where the first sentence is the source English sentence and the corresponding French sentences are the ones with the same length, or length greater than or less than by a value of 2, as compared to the length of the source English sentence.

Dictionary creation using Anymalign Algorithm
WMT2016 provided us with a large English-French parallel corpus. We executed the Anymalign algorithm on this corpus to find out the word alignments. The alignments with a matching probability of more than or equal to 0.75 were kept as higher probability results in good translation. The rest of the alignments were discarded. This data served as our dictionary. The snapshot of the dictionary containing the source English words in the left column and the target French words in the

Sentence matching using dictionary
For each of the words in the source English sentence, its corresponding translations are found out using the dictionary produced in the previous step. The words found were then matched with words in the various French sentences that we obtained using the concept provided by Gale and Church. The French sentences, with matched words equal to the length of the source English sentences or less by a factor of 2, were kept and the rest were discarded. This means that for an English sentence of 10 words, French translation for each of the English words were found out using the dictionary produced in the previous step. If a French sentences with all the 10 words matching to the translated words was found, it was kept. Also, if there was a French sentence con- taining 10 words, but only 8 words words were matching to the translated words, it was also kept. French sentences with less number of matchings were discarded. This process is shown in Figure  4.

Exact Text Translation finding with Cosine Similarity
Out of the French sentences extracted in the previous step, Cosine Similarity is found out with respect to the source English sentence. The French sentence with the highest Cosine Similarity score is selected as the exact translation of the source English sentence. This process is shown in Figure  5.

Cosine Similarity
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]. The formula used in our approach is as follows.
Where A and B are the source English sentence and the one of the target French sentences, respectively.

URL matching
The URL of the source English sentence is then matched the URL of the extracted French sentence with reference to the ID that was given in the first step. We can see from Figure 5 that the French sentences when compared to the source English sentence "Message to Gilbert Blin" have cosine similarity scores appended to it. From the above figure we see that "Presnt par Gilbert Blin", has the highest cosine similarity score. So, this can be treated as the exact translation of the source English sentence. We also see that an ID "(1)" is appended to the English sentence and an ID "(6)" is appended to the French sentence. From Figure 1, we can find out that, since the English sentence ID is "(1)", it belongs to the webpage "http://academiedesprez.org/mailgb.php" and since the ID of the French sentence is "(6)", it belongs to the webpage "http://academiedesprez.org/eng/musicales5eng. htm". Thus, we can mark it as the exact alignment.

Evaluation
WMT 2016 provided us with a baseline system that finds 119979 extracted pairs after enforcing the 1-1 rule. Our proposed system when executed on the test data, found out 48 extracted pairs of URLs after enforcing the 1-1 rule. This gave our proposed system a percent recall value of 1.998335.

Systems
Extracted pairs WMT2016 Baseline 119,979 Proposed System 48 Percent Recall 1.998335 Table 1: Evaluation of proposed system with baseline system provided by WMT2016.

Conclusion
The paper presents a hybrid approach to bilingual document alignment to the shared task proposed by WMT2016. We have developed an approach that uses the concept given by Gale and Church with respect to length of source-translated sentences, translation of words using a dictionary created by Anymalign and the concept of Cosine Similarity. Our approach was able to extract 48 pairs of URLs with a percent recall of 1.998335.