BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora

A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.


Introduction
Statistical Machine Translation (SMT) analyzes the output of human translators using statistical methods and extracts information about the translation process from corpora of translated texts. SMT has shown good results for many language pairs and is responsible for the recent surge in terms of popularity of Machine Translation among the research communities. But, for a SMT system to work efficiently, it has to be fed with large parallel corpus, for producing high quality phrase table and translation models (Brown et. al., 1991;Church et. al., 1993;Dagan et. al., 1999). Since availability of large parallel corpus is an issue for low resourced languages, building one from scratch involves high manual labor and cost Tan and Pal, 2014;Mahata et. al., 2016). This is the reason why lot of research has gone into the concept of building parallel corpus, from comparable corpus (Jagarla-mudi et. al., 2011;Kay and Roscheisen, 1993;Kupiec, 1993;Lardilleux et. al., 2012). A comparable corpus is a pair of monolingual corpus in the same domain, where the sentences in the both the corpus are not aligned. The proposed work deals with identifying parallel sentences from such a comparable corpus provided by BUCC 2017 1 shared task. Sample, training and test data contain monolingual corpora split into sentences, in the format, "utf-8 text, with UNIX end-of-lines; identifiers are made of a two-letter language code + 9 digits, separated by a dash '-'": • The algorithm of the proposed work has been constructed primarily using Moses (Koehn, 2015) toolkit that has been fed with parallel corpus from Europarl 2 , with French as the source language and English as the target language. Also, the similarity based on sentence length has been used for the preliminary alignment because equivalent sentences in comparable corpus may roughly correspond with respect to length. Cosine Similarity algorithm was used for the final alignment. Section 2 will discuss the proposed algorithm in detail and will be followed by results and discussions in Section 3 and Section 4, respectively.

Building baseline Statistical Translation Model
Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair, when trained with a large collection of translated texts pus). Once the model has been trained cient search algorithm quickly finds the highest probability translation among the exponential number of choices. For the given system, Moses was trained with French (Fr) as the source la guage and English (En) as the target language. The En-Fr parallel corpus that was used to train Moses has been downloaded from Europarl Co pus. The language model training of Moses was done by concatenating the English c roparl and the English text of the vided by BUCC 2017. The French corpora the given test data was taken and sentences were extracted barring the sentence_id's. The extracted French sentences were then fed to Moses to get translated English sentences as output. of this process is shown in Figure 1. gated sentence_id's from the previous step were again appended to the translated English se tences. Example of this process is shown in Figure  2.

Sentence similarity based on sentence length
Gale and Church (1991) in their paper posed a system for aligning corresponding se tences in a parallel corpora, based on the principle that equivalent sentences should roughly corre pond in length-that is, longer sentences in one language should correspond to longer sentences i sion ***. Confidential review Copy. DO NOT DISTRIBUTE.

Building baseline Statistical Machine
Moses is a statistical machine translation system that allows you to automatically train translation , when trained with a (parallel corhas been trained, an efficient search algorithm quickly finds the highest probability translation among the exponential For the given system, Moses as the source lanas the target language. parallel corpus that was used to train Europarl Corpus. The language model training of Moses was lish corpus of Euthe English text of the test data proby BUCC 2017. The French corpora from was taken and sentences were extracted barring the sentence_id's. The extracted French sentences were then fed to Moses to get English sentences as output. Example is shown in Figure 1. The segregated sentence_id's from the previous step were again appended to the translated English senis shown in Figure   Sentence similarity based on sentence heir paper, proposed a system for aligning corresponding sentences in a parallel corpora, based on the principle equivalent sentences should roughly corresthat is, longer sentences in one rrespond to longer sentences in the other language. This idea forms the basis of our preliminary alignment system, which tries to align sentence pairs based on their length have found out the length of the translated sentence and have found matches in of the English text from the test data in one-to-many relationship between the lated English and the English sentences test data. The variance in this step is kept as 4 which means if the length of the English of the test data exceeds or falls behind by a factor of 4, when compared to the translated sentence, they are also included in this step is done for reducing the time complexity of the Cosine Similarity search algorithm this step is shown in Figure 4.

Final alignment using Cosine Similarity Algorithm
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]. The formula used in our approach is as fo lows.
Where "A" and "B" are the translated tence and one of the English sentences test data found out using the preliminary alig ment system, respectively. One sentence from the translated English corpus is taken and is matched with the selected sentences in English corpus This idea forms the basis of , which tries to align sentence pairs based on their length. We translated English sentence and have found matches in the sentences from the test data. This results many relationship between the transsentences from the riance in this step is kept as 4, English sentences exceeds or falls behind by a factor translated English tence, they are also included in this step. This is done for reducing the time complexity of the Cosine Similarity search algorithm. Example of

Final alignment using Cosine Similarity
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]. The formula used in our approach is as fol- (1) translated English sensentences from the preliminary align-One sentence from the translated English corpus is taken and is matched English corpus from the test data, using the Cosine Similari rithm. The sentence pair with the highest Cosine Sim larity value is considered as the final alignment. Sentence_id's of the selected sentence pair a tracted and given as output. An example of t output format is shown in Figure 3.

Evaluation
BUCC 2017 provided us with an evaluation script and a gold standard data to calculate the Precision, Recall and F-Score. This is shown in Figure 5. The calculation was done using value TP, FP and FN, where TP (true positive) of sentences that is present in the gold standard FP (false positive) is a pair of sentences that is not present in the gold standard and FN tive) is a pair of sentences present in the gold standard but absent from system. We submitted 38,736 sentence pair alignment. the Cosine Similarity algo- The sentence pair with the highest Cosine Simiue is considered as the final alignment. Sentence_id's of the selected sentence pair are ex-An example of the  BUCC 2017 provided us with an evaluation script and a gold standard data to calculate the This is shown in The calculation was done using value (true positive) is a pair present in the gold standard, is a pair of sentences that is not FN (false negais a pair of sentences present in the gold We submitted t. Table 1 shows the Evaluation Results.

Discussion
We tested the proposed approach by training Moses for translating English to French as well. The English data from the test data corpus was translated to Spanish. After preliminary alig ment, Cosine Similarity was sought for translated Spanish and Spanish corpus of the test data. After testing the system with the gold standard, we found out only one match.  As a future prospect, we would like to align the sentences based on Named-Entity and Edit di tance approach. We tested the proposed approach by training for translating English to French as well. The English data from the test data corpus was translated to Spanish. After preliminary alignment, Cosine Similarity was sought for translated corpus of the test data. After testing the system with the gold standard, we 20779 pairs Second evaluation Results.

Second Evaluation
As a future prospect, we would like to align the Entity and Edit dis-

Conclusion
The paper proposes a Hybrid approach for sentence alignment in comparable corpora. Moses toolkit was used for building the baseline translation system along with similarity based on sentence length and Cosine Similarity algorithms. The evaluation of the proposed method yielded results as Precision: 0.0261 Recall: 0.1118 and F-Score: 0.0423.