Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora

Resources for the non-English languages are scarce and this paper addresses this problem in the context of machine translation, by automatically extracting parallel sentence pairs from the multilingual articles available on the Internet. In this paper, we have used an end-to-end Siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual articles in Wikipedia. Subsequently, we have showed that using the harvested dataset improved BLEU scores on both NMT and phrase-based SMT systems for the low-resource language pairs: English–Hindi and English–Tamil, when compared to training exclusively on the limited bilingual corpora collected for these language pairs.


Introduction
Both neural and statistical machine translation approaches are highly reliant on the availability of large amounts of data and are known to perform poorly in low resource settings. Recent crowdsourcing efforts and workshops on machine translation have resulted in small amounts of parallel texts for building viable machine translation systems for low-resource pairs (Post et al., 2012). But, they have been shown to suffer from low accuracy (incorrect translation) and low coverage (high out-of-vocabulary rates), due to insufficient training data. In this project, we try to address the high OOV rates in low-resource machine translation systems by leveraging the increasing amount of multilingual content available on the Internet for enriching the bilingual lexicon.
Comparable corpora such as Wikipedia, are collections of topic-aligned but non-sentence-aligned multilingual documents which are rich resources for extracting parallel sentences from. For example, Figure 1 shows that there are equivalent sentences on the page about Donald Trump in Tamil  and English, and the phrase alignment for an example sentence is shown in Table 2. Table 1 shows that there are at least tens of thousands of bilingual articles on Wikipedia which could potentially have at least as many parallel sentences that could be mined to address the scarcity of parallel sentences as indicated in column 2 which shows the number of sentencepairs in the largest available bilingual corpora for xx-en 1 . As shown by Irvine and Callison-Burch (2013), the illustrated data sparsity can be addressed by extending the scarce parallel sentence-pairs with those automatically extracted from Wikipedia and thereby improving the performance of statistical machine translation systems.
In this paper, we will propose a neural approach to parallel sentence extraction and compare the BLEU scores of machine translation systems with and without the use of the extracted sentence pairs to justify the effectiveness of this method. Compared to previous approaches which require spe-  cialized meta-data from document structure or significant amount of hand-engineered features, the neural model for extracting parallel sentences is learned end-to-end using only a small bootstrap set of parallel sentence pairs.

Related Work
A lot of work has been done on the problem of automatic sentence alignment from comparable corpora, but a majority of them (Abdul-Rauf and Schwenk, 2009;Irvine and Callison-Burch, 2013;Yasuda and Sumita, 2008) use a pre-existing translation system as a precursor to ranking the candidate sentence pairs, which the low resource language pairs are not at the luxury of having; or use statistical machine learning approaches, where a Maximum Entropy classifier is used that relies on surface level features such as word overlap in order to obtain parallel sentence pairs (Munteanu and Marcu, 2005). However, the deep neural network model used in our paper is probably the first of its kind, which does not need any feature engineering and also does not need a pre-existing translation system. Munteanu and Marcu (2005) proposed a parallel sentence extraction system which used comparable corpora from newspaper articles to extract the parallel sentence pairs. In this procedure, a maximum entropy classifier is designed for all sentence pairs possible from the Cartesian product of a pair of documents and passed through a sentence-length ratio filter in order to obtain candidate sentence pairs. SMT systems were trained on the extracted sentence pairs using the additional features from the comparable corpora like distortion and position of current and previously aligned sentences. This resulted in a state of the art approach with respect to the translation performance of low resource languages.
Similar to our proposed approach, Barrón-Cedeño et al. (2015) showed how using parallel documents from Wikipedia for domain specific alignment would improve translation quality of SMT systems on in-domain data. In this method, similarity between all pairs of cross-language sentences with different text similarity measures are estimated. The issue of domain definition is overcome by the use of IR techniques which use the characteristic vocabulary of the domain to query a Lucene search engine over the entire corpus. The candidate sentences are defined based on word overlap and the decision whether a sentence pair is parallel or not using the maximum entropy classifier. The difference in the BLEU scores between out of domain and domain-specific translation is proved clearly using the word embeddings from characteristic vocabulary extracted using the extracted additional bitexts.
Abdul-Rauf and Schwenk (2009) extract parallel sentences without the use of a classifier. Target language candidate sentences are found using the translation of source side comparable corpora. Sentence tail removal is used to strip the tail parts of sentence pairs which differ only at the end. This, along with the use of parallel sentences enhanced the BLEU score and helped to determine if the translated source sentence and candidate target sentence are parallel by measuring the word and translation error rate. This method succeeds in eliminating the need for domain specific text by using the target side as a source of candidate sentences. However, this approach is not feasible if there isn't a good source side translation system to begin with, like in our case.
Yet another approach which uses an existing translation system to extract parallel sentences from comparable documents was proposed by Yasuda and Sumita (2008). They describe a framework for machine translation using multilingual Wikipedia articles. The parallel corpus is assembled iteratively, by using a statistical machine translation system trained on a preliminary sentence-aligned corpus, to score sentence-level en-jp BLEU scores. After filtering out the unaligned pairs based on the MT evaluation metric, the SMT is retrained on the filtered pairs.

Approach
In this section, we will describe the entire pipeline, depicted in Figure 2, which is involved in training a parallel sentence extraction system, and also to infer and decode high-precision nearly-parallel sentence-pairs from bilingual article pages collected from Wikipedia.

Bootstrap Dataset
The parallel sentence extraction system needs a sentence aligned corpus which has been curated. These sentences were used as the ground truth pairs when we trained the model to classify parallel sentence pair from non-parallel pairs.

Negative Sampling
The binary classifier described in the next section, assigns a translation probability score to a given sentence pair, after learning from examples of translations and negative examples of nontranslation pairs. For, this we make a simplistic assumption that the parallel sentence pairs found in the bootstrap dataset are unique combinations, which fail being translations of each other, when we randomly pick a sentence from both the sets. Thus, there might be cases of false negatives due to the reliance on unsupervised random sampling for generation of negative labels.
Therefore at the beginning of every epoch, we randomly sample m negative sentences of the target language for every source sentence. From a few experiments and also from the literature, we converged on m = 7 to be performing the best, given our compute constraints.

Model
Here, we describe the neural network architecture as shown in Grégoire and Langlais (2017), where the network learns to estimate the probability that the sentences in a given sentence pair, are translations of each other, p(y i = 1|s S i , s T i ), where s S i is the candidate source sentence in the given pair, and s T i is the candidate target sentence.

Training
As illustrated in Figure 2 (d), the architecture uses a siamese network (Bromley et al., 1994), consisting of a bidirectional RNN (Schuster and Paliwal, 1997) sentence encoder with recurrent units such as long short-term memory units, or LSTMs (Hochreiter and Schmidhuber, 1997) and gated recurrent units, or GRUs (Cho et al., 2014) learning a vector representation for the source and target sentences and the probability of any given pair of sentences being translations of each other. For seq2seq architectures, especially in translation, we have found the that the recommended recurrent unit is GRU, and all our experiments use this over LSTM. The forward RNN reads the variable-length sentence and updates its recurrent state from the first token until the last one to create a fixed-size continuous vector representation of the sentence. The backward RNN processes the sentence in reverse.
In our experiments, we use the concatenation of the last recurrent state in both directions as a final representation h where φ is the gated recurrent unit (GRU). After both source and target sentences have been encoded, we capture their matching information by using their element-wise product and absolute element-wise difference. We estimate the probability that the sentences are translations of each other by feeding the matching vectors into fully connected layers: where σ is the sigmoid function, W (1) , W (2) , W (3) , b and c are model parameters. The model is trained by minimizing the cross entropy of our labeled sentence pairs: where n is the number of source sentences and m is the number of candidate target sentences being considered.

Inference
For prediction, a sentence pair is classified as parallel if the probability score is greater than or equal to a decision threshold ρ that we need to fix. We found that to get high precision sentence pairs, we 4 Experiments

Dataset
We experimented with two language pairs: English -Hindi (en-hi) and English -Tamil (en-ta). The parallel sentence extraction systems for both en-ta and en-hi were trained using the architecture described in 3.2 on the following bootstrap set of parallel corpora: • An English-Tamil parallel corpus (Ramasamy et al., 2014) containing a total of 169, 871 sentence pairs, composed of 3, 984, 038 English Tokens and 2, 776, 397 Tamil Tokens.
• An English-Hindi parallel corpus (Kunchukuttan et al., 2017) containing a total of 1, 492, 827 sentence pairs, from which a set of 200, 000 sentence pairs were picked randomly.
Subsequently, we extracted parallel sentences using the trained model, and parallel articles collected from Wikipedia 2 . There were 67, 449 bilin-gual English-Tamil and 58, 802 English-Hindi titles on the Wikimedia dumps collected in December 2017.

Evaluation Metrics
For the evaluation of the performance of our sentence extraction models, we looked at a few sentences manually, and have done a qualitative analysis, as there was no gold standard evaluation set for sentences extracted from Wikipedia. In Table  3, we can see the qualitative accuracy for some parallel sentences extracted from Tamil. The sentences extracted from Tamil, have been translated to English using Google Translate, so as to facilitate a comparison with the sentences extracted from English.
For the statistical machine translation and neural machine translation evaluation we use the BLEU score (Papineni et al., 2002) as an evaluation metric, computed using the multi-bleu script from Moses (Koehn et al., 2007).

Sentence Alignment
Figures 3a shows the number of high precision sentences that were extracted at ρ = 0.99 without greedy decoding. Greedy decoding could be thought of as sampling without replacement, where a sentence that's already been extracted on one side of the extraction system, is precluded from being considered again. Hence, the number of sentences without greedy decoding, are of an order of magnitude higher than with decoding, as can be seen in Figure 3b.

Machine Translation
We evaluated the quality of the extracted parallel sentence pairs, by performing machine translation experiments on the augmented parallel corpus.

SMT
As the dataset for training the machine translation systems, we used high precision sentences extracted with greedy decoding, by ranking the sentence-pairs on their translation probabilities. Phrase-Based SMT systems were trained using Moses (Koehn et al., 2007). We used the growdiag-final-and heuristic for extracting phrases, lexicalised reordering and Batch MIRA (Cherry and Foster, 2012) for tuning (the default parameters on Moses). We trained 5-gram language models with Kneser-Ney smoothing using KenLM (Heafield et al., 2013). With these parameters, we trained SMT systems for en-ta and en-hi language pairs, with and without the use of extracted parallel sentence pairs.

NMT
For training neural machine translation models, we used the TensorFlow (Abadi et al., 2016) im-plementation of OpenNMT (Klein et al.) with attention-based transformer architecture (Vaswani et al., 2017). The BLEU scores for the NMT models were higher than for SMT models, for both enta and en-hi pairs, as can be seen in Table 4.

Conclusion
In this paper, we evaluated the benefits of using a neural network procedure to extract parallel sentences. Unlike traditional translation systems which make use of multi-step classification procedures, this method requires just a parallel corpus to extract parallel sentence pairs using a Siamese BiRNN encoder using GRU as the activation function.
This method is extremely beneficial for translating language pairs with very little parallel corpora. These parallel sentences facilitate significant improvement in machine translation quality when compared to a generic system as has been shown in our results.
The experiments are shown for English-Tamil and English-Hindi language pairs. Our model achieved a marked percentage increase in the BLEU score for both en-ta and en-hi language pairs. We demonstrated a percentage increase in BLEU scores of 11.03% and 14.7% for en-ta and en-hi pairs respectively, due to the use of parallelsentence pairs extracted from comparable corpora using the neural architecture.
As a follow-up to this work, we would be comparing our framework against other sentence alignment methods described in (Resnik and Smith, 2003), (Ayan and Dorr, 2006), (Rosti et al., 2007) and (Smith et al., 2010). It has also been interesting to note that the 2018 edition of the Workshop on Machine Translation (WMT) has released a new shared task called Parallel Corpus Filtering 3 where participants develop methods to filter a given noisy parallel corpus (crawled from the web), to a smaller size of high quality sentence pairs. This would be the perfect avenue to test the efficacy of our neural network based approach of extracting parallel sentences from unaligned corpora.