JU-USAAR: A Domain Adaptive MT System

,


Introduction
Statistical Machine Translation (SMT) is the currently dominant MT technology. The underlying statistical models in SMT always tend to closely approximate the empirical distributions of the bilingual training data and monolingual target-language text. However, the performance of SMT systems quickly degrades when testing conditions deviate from training conditions. In order to achieve optimal performance, an SMT system should be trained on data from the same domain. Now-a-days domain adaptation has gained interest in SMT to cope with this performance drop. The basic aim of domain adaptation is to maintain the identity of the in-domain data while using the best of the out-domain data. However, large amount of additional out-domain data may bias the resultant distribution towards the out-domain. In practice, it is often difficult to obtain sufficient amount of in-domain parallel data to train a system which can provide good performance in a specific domain. The performance of an in-domain model can be improved by selecting a subset from the out-domain data which is very similar to the indomain data (Matsoukas et al., 2009;Moore and Lewis, 2010), or by re-weighting the probability distributions (Foster et al., 2006;Sennrich et al., 2013) in favor of the in-domain data.
In this task, the information technology (IT) domain English-German parallel corpus released in the WMT-2016 IT-domain shared task serves as the in-domain data and the Europarl, News and Common Crawl English-German parallel corpus released in the Translation Task are treated as outdomain data.
In this paper we describe the joint submission of Jadavpur University (JU) and Saarland University (USAAR) English-German machine translation (MT) system (JU-USAAR) to the shared task on IT domain translation organized in WMT-2016. In our approach we initially applied data selection method where we directly measured cross entropy for the source side of the text; successively we applied Moore and Lewis (2010) method of data selection and ranked the out-domain bilingual parallel data according to cross entropy difference. Finally, we built domain specific language models on both in-domain and selected out-domain target language monolingual corpus, linearly interpolate them choosing weights that minimize perplexity on a held out in-domain development set. In addition, we also interpolated the translation models trained on the in-domain and selected out-domain parallel corpora. However, instead of using bilingual cross-entropy difference, we applied bilingual cross-perplexity difference to model our data selection process. Koehn (2004;Koehn (2005) first proposed domain adaptation in SMT by integrating terminological lexicons in the translation model, as a result of which there was a significant reduction in word error rate (WER). Over the last decade, many researchers (Foster and Kuhn, 2007;Duh et al., 2010;Banerjee et al., 2011;Bisazza and Federico, 2012;Sennrich, 2012;Sennrich et al., 2013;Haddow and Koehn, 2012) investigated the problem of combining multi-domain datasets.

Related Work
To construct a good domain-specific language model, sentences which are similar to the target domain should be included (Sethy et al., 2006) in the monolingual target language corupus on which the language model is trained. Lü et al. (2007) identified those sentences using the tf/idf method and they increased the count of such sentences.
Domain adaptation in MT have been explored in many different directions, ranging from adapating language models and translation models to alignment adaptation approach to improve domainspecific word alignment. Koehn et al. (2007) used multiple decoding paths for combining multiple domain-specific translation tables in the state-of-the-art PB-SMT decoder MOSES. Banerjee et al. (2013) combined an in-domain model (translation and reordering model) with an out-of-domain model into MOSES and they derived log-linear features to distinguish between phrases of multiple domains by applying the data-source indicator features and showed modest improvement in translation quality. Bach et al. (2008) suggested that sentences may be weighted by how much it matches with the target domain. A comparison among different domain adaptation methods for different subject matters in patent translation was carried out by (Ceauşfu et al., 2011) which led to a small gain over the baseline.
In order to select supplementary out-of-domain data relevant to the target domain, a variety of criteria have been explored ranging from information retrieval techniques to perplexity on in-domain datasets. Banerjee et al. (2011) proposed a prediction based data selection technique using an incremental translation model merging approach.

Data selection Approach
Among the different approaches proposed for data selection, the two most popular and successful methodologies are based on monolingual crossentropy difference (Moore and Lewis, 2010) and bilingual cross-entropy difference (Axelrod et al., 2011). The data selection approach taken in the present work is also motivated by the bilingual cross-entropy difference (Axelrod et al., 2011) based data selection. However, instead of using bilingual cross-entropy difference, we applied bilingual cross-perplexity difference to model our data selection process. The difference in crossentropy is computed on two language models (LM); the domain-specific LM is estimated from the entire in-domain corpus (lm in ) and the second LM (lm o ) is estimated from the out-domain corpus. Mathematically, the cross-entropy H(P lm ) of language model probability P lm is defined as in Equation 1 considering a k-gram language model.
(1) We calculated perplexity (P P = 2 H ) of individual sentences of out-domain with respect to indomain LM and out-domain LM for both source (sl) and target (tl) language.
The score, i.e., sum of the two cross-perplexity differences, for the j th sentence pair [s j − t j ] is calculated based on Equation 2.
Subsequently, sentence pairs [s − t] from the out-domain corpus (o) are ranked based on this score.

Interpolation Approach
To combine multiple translation and language models, a common approach is to linearly interpolate them. The language model interpolation weights are automatically learnt by minimizing the perplexity on the development set. For interpolating the translation models, we use moses training pipeline which selects the interpolation weights that optimizes performance on the development set. These weights are subsequently used to combine the individual feature values for every phrase pair from two different phrase-tables (i.e., in-domain phrase table p in (e|f ) and out-domain phrase table p o (e|f )) using the formula in Equation 3 where f and e are source and target phrases respectively and the value of λ ranges between 0 and 1.

Experiments and Results
We first accumulate all the domain specific corpus and clean them. We also use out of domain data to accelerate the performance of the in-domain MT system. The following subsections describe the datasets used for the experiments, detailed experimental settings and systematic evaluation on both the development set and test set.

Datasets
In-domain Data: The detailed statistics of indomain data is reported in Table 1. We considered all the data provided by the WMT-2016 organizers for the IT translation task. We combined all data and performed cleaning in two steps: (i) Cleaning-1: following the cleaning process described in (Pal et al., 2015), and (ii) Cleaning 2: using the Moses (Koehn et al., 2007) corpus cleaning scripts with minimum and maximum number of tokens set to 1 and 80 respectively. Additionally, 1000 sentences are used for development set ('Batch 1' in Table 3) and anther 1000 sentences are used for development test set ('Batch2' in Table 3).
Out-domain Data: We utilized all the parallel training data provided by the WMT-2016 shared task organizers for the English-German translation task. The out of domain training data includes Europarl, News Commentary and Common Crawl. this corpus is noisy and contains some non-German, as well as, non-English words and sentences. Therefore, we applied a language identifier (Shuyo, 2010) on both bilingual English-German parallel data and monolingual German corpora. We discarded those parallel sentences from the bilingual training data which were detected as belonging to some different languages by the language identifier. The same method was also applied to the monolingual data. Successively, the corpus cleaning process was carried out first by calculating the global mean ratio of the number of characters in a source sentence to that in the corresponding target sentence and then filtering out sentence pairs that exceed or fall below 20% of the global ratio (Tan and Pal, 2014). Tokenization and punctuation normalization were performed using Moses scripts. In the final step of cleaning, we filtered the parallel training data on maximum allowable sentence length of 80 and sentence length ratio of 1:2 (either direction). Approximately 36% sentences were removed from the total training data during the cleaning process. Table 2 shows the out-domain data statistics after filtering.

Experimental Settings
We used the standard log-linear PB-SMT model for our experiments. All the experiments were carried out using a maximum phrase length of 7 for the translation model and 5-gram language models. The other experimental settings involved word alignment model between EN-DE trained with Berkeley Aligner (Liang et al., 2006). The phraseextraction heuristics of (Koehn et al., 2003) were used to build the phrase-based SMT systems. The reordering model was trained with the hierarchical, monotone, swap, left to right bidirectional (hier-mslr-bidirectional) (Galley and Manning, 2008) method and conditioned on both the source and target languages. The 5-gram language models were built using KenLM (Heafield, 2011). Phrase pairs that occur only once in the training data are assigned an unduly high probability mass (i.e., 1). To alleviate this shortcoming, we performed smoothing of the phrase table using the Good-Turing smoothing technique (Foster et al., 2006). System tuning was carried out using Minimum Error Rate Training (MERT) (Och, 2003) on a held out development set (Batch1 in Table 3) of size 1,000 sentences provided by the WMT-2016 task organizers. After the parameters were tuned, decoding was carried out on the held out development test set (Batch2 in Table 3) as well as test set released by the shared task organizers. We evaluated the systems using three well known automatic MT evaluation metrics: BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007) and TER (Snover et al., 2006). The evaluation results of our baseline systems trained on in-domain and out-domain data are reported in Table 3.  (Pal et al., 2015) and Cleaning-2 is MOSES cleaner with minimum token is set to 1 and maximum 80

Result and Analysis
We have taken various attempts to enhance the quality of translation for the English-German IT domain translation task. Figure 1 shows how data selection method helps to enhance the in-domain baseline system by incrementally adding a subset of data from the outdomain corpus as additional training material.
We applied bilingual cross-perplexity difference based method (cf. Section 3.1) to rank the out-domain sentences according to their proximity to the in-domain data from which we incrementally select top ranking sentence pairs and add them as additional training material to our indomain training set. We trained the incremental in-domain PB-SMT models in an iterative manner for each incremental batch size of 100K top ranked additional parallel data from the remaining 'ranked' out-domain data. The iterative process is stopped when the learning curve falls down in two successive iterations. BLEU is considered as the objective function for the learning curve experiment. Finally, we selected 400K sentence pairs as additional training material from the entire outdomain data as it provided the optimum result in BLEU on the development test set. The rest of our experiments are carried out with this 400K additional training data. Therefore, our submitted JU-USAAR system is built on 440,780 in-domain training data, as well as 400K additional training data selected from the out-domain parallel corpus.
We made use of the out-domain data selected by the data selection method (Moore and Lewis, 2010; Axelrod et al., 2011) using simple merging as well as interpolation technique (Sennrich, 2012).
Linear interpolation with instant weighting (Sennrich, 2012) was used for interpolating the translation and language models.
Our baseline system was trained on the indomain English-German parallel corpus containing 440,780 sentence pairs. As reported in Table 4, the baseline system obtained a BLEU score of 20 and TER of 68.7 on the test set. We developed two different systems.
System1: System1 is trained on 440,780 indomain training data combined with additional 400K parallel sentences selected from the outdomain dataset. This system produced a BLEU score of 31.9 and a TER of 66.6 on the test set which are far better than the baseline scores.
System2: System2 uses exactly the same amount of training data as System1, however, in this case instead of simply merging the two datasets (440,780 in-domain and 400K selected out-domain sentence pairs) separate translation   Table 4: Systematic evaluation on test set models and language models are built on each dataset and they are interpolated based on instant weighting. Before decoding we forced the decoder to avoid translation of URLs. System2 resulted in 34.5 BLEU (14.5 absolute and 72.5% relative improvements over baseline) and 54.0 TER (14.7 absolute and 21.4% relative improvements over baseline) scores. System2 represents our primary submission.

Conclusions and Future Work
The JU-USAAR system employs two techniques for improving the performance of MT in the English-German translation task for the IT domain. We used bilingual cross-perplexity difference based data selection method and carried out learning curve experiments to identify additional "in-domain like" training material from the outdomain dataset. We made use of the selected additional training data using both simple merging and interpolation. Simple merging yielded in significant improvements over the baseline while linear interpolation of the translation and language models with instant weighting produced further improvements. Our primary submission (data selection and interpolation based model combination) resulted in 14.5 absolute and 72.5% relative improvements in BLEU and 14.7 absolute and 21.4% relative improvements in TER over the baseline system trained on just the in-domain dataset.