ParFDA for Instance Selection for Statistical Machine Translation

,


Abstract
We build parallel feature decay algorithms (ParFDA) Moses statistical machine translation (SMT) systems for all language pairs in the translation task at the first conference on statistical machine translation (Bojar et al., 2016a) (WMT16). ParFDA obtains results close to the top constrained phrase-based SMT with an average of 2.52 BLEU points difference using significantly less computation for building SMT systems than the computation that would be spent using all available corpora. We obtain BLEU bounds based on target coverage and show that ParFDA results can be improved by 12.6 BLEU points on average. Similar bounds show that top constrained SMT results at WMT16 can be improved by 8 BLEU points on average while German to English and Romanian to English translations results are already close to the bounds.

ParFDA
ParFDA  is a parallel implementation of feature decay algorithms (FDA), a class of instance selection algorithms that use feature decay, developed for fast deployment of accurate SMT systems. We use ParFDA for selecting parallel training data and language model (LM) data for building SMT systems. ParFDA runs separate FDA5 (Biçici and Yuret, 2015) models on randomized subsets of the available data and combines the selections afterwards. ParFDA allows rapid prototyping of SMT systems for a given target domain or task. FDA pseudocode is in Figure 1. This year, we have kept record of which 1gram or 2-grams of the test set have already been Figure 1: The Feature Decay Algorithm: inputs are a sentence pool U, test set features F, and number of instances to select N and a priority queue Q stores sentence, S, scores score that sums feature values fval.
included to include an instance if otherwise found and we also use numeric expression identification using regular expressions to replace them with a label (Biçici, 2016) before instance selection. We run ParFDA SMT experiments using Moses (Koehn et al., 2007) for all language pairs in both directions in the WMT16 translation task (Bojar et al., 2016a), which include English-Czech (en-cs), English-German (en-de), English-Finnish (en-fi), English-Romanian (enro), English-Russian (en-ru), and English-Turkish (en-tr).

ParFDA Moses SMT Experiments
The importance of ParFDA increases with the proliferation of training resources available for building SMT systems. Compared with WMT15 (Bojar et al., 2015), WMT16 observed significant increase in monolingual and parallel training data made available. Table 1 presents the statistics of the available training and LM corpora for the constrained (C) systems in WMT16 (Bojar et al., 2016a) as well as the statistics of the ParFDA selected subset training and LM data from C. TCOV lists the target coverage in terms of the 2-grams of the test set. Compared with last year, this year we do not use Common Crawl parallel corpus except for en-ru. We use Common Crawl monolingual corpus fi, ro, and tr datasets and we extended the LM corpora with previous years' corpora. We also use CzEng16pre (Bojar et al., 2016b) for en-cs.
We have increased the size of the training data selected to about 1.6 million instances to help with the reduction of out-of-vocabulary items. Except for translation directions involving Romanian and Turkish, this corresponds to increased training set size compared with ParFDA experiments in 2015, where we were able to obtain the top translation error rate (TER) performance in French to English translation using 1.261 million training sentences . Due to the presence of peaks in SMT performance with increasing training set size (Biçici and Yuret, 2015), increasing the training set size need not improve the performance. We select about 15 million sentences for each LM not including the selected training set, which is added later. Table 1 shows the significant size differences between the constrained dataset (C) and the ParFDA selected data. We use 3-grams for selecting training data and 2-grams for LM corpus selection. Task specific data selection also im-proves the LM perplexity and the performance of the selected LM can be observed in Table 4.
We truecase all of the corpora, set the maximum sentence length to 126, use 150-best lists during tuning, set the LM order to 6 for all language pairs, and train the LM using KENLM (Heafield et al., 2013). For word alignment, we use mgiza (Gao and Vogel, 2008) where GIZA++ (Och and Ney, 2003) parameters set max-fertility to 10, the number of iterations to 7,3,5,5,7 for IBM models 1,2,3,4, and the HMM model, and learn 50 word classes in three iterations with the mkcls tool during training. The development set contains up to 5000 sentences randomly sampled from previous years' development sets (2011-2015) and remaining come from the development set for WMT16.
ParFDA Moses SMT results for each translation direction at WMT16 are in Table 2 using BLEU over cased text, and F 1 (Biçici, 2011). We compare ParFDA results with the top constrained submissions at WMT16 in Table 3. 1 The average difference to the top constrained (TopC) submission in WMT16 is 5.26 BLEU points whereas the difference was 3.2 BLEU points in WMT15 . Performance compared with the TopC phrase-based SMT improved over WMT15 results with 2.52 BLEU points difference on av-   erage, which is likely due to selecting increased number of training data.
We observe that various systems in TopC used character-level split and merge operations (referred as BPE or byte pair encoding) combined with neural networks (Sennrich et al., 2016). 2 We also compare ParFDA results with the TopC BPE and the average difference is 5.86 BLEU points. 3 WMT15 did not contain any submission with BPE. Average difference between TopC BPE and TopC phrase hints that majority of the in-2 For instance within en-de translation results: matrix. statmt.org/matrix/systems_list/1840. 3 Some translation directions did not contain BPE results.
creased performance difference is due to improvements obtained by BPE in TopC BPE results. Table 4 compares the perplexity of the ParFDA selected LM with a LM trained on the ParFDA selected training data and a LM trained using all of the available training corpora and shows reductions in the number of OOV tokens reaching up to 45% and the perplexity up to 45%. Table 4 also presents the average log probability of tokens and the log probability of token <unk> returned by KENLM to token <unk>. The increase in the ratio between them in the last column shows that OOV in ParFDA LM are not just less but also less likely at the same time.   37 -4.61 -4.57 -4.39 -7.8 -7.11 -7.77 Table 4: Perplexity comparison of the LM built from the training corpus (train), ParFDA selected training data (FDA5 train), and the ParFDA selected LM data (FDA5 LM). %red is proportion of reduction and prob. is used for probability.

Translation Upper Bounds with TCOV
In this section, we obtain upper bounds on the translation performance based on the target coverage (TCOV) of n-grams of the test set found in the selected ParFDA training data. We obtain translations based on TCOV by randomly replacing some number of tokens from a given sentence with a fixed OOV label proportional to TCOV starting from 1-grams. After OOVs for 1-grams are identified, OOV tokens for n-grams up to 5-grams are identified and BLEU is calculated with respect to the original. If the overall number of OOVs obtained before i-grams are enough to obtain the igram TCOV, then OOV identification for i-grams is skipped. Number of OOV tokens is identified by two possible functions for a given sentence T : where |T | denotes the length of the sentence in the number of tokens.
We obtain each bound using 10000 such instances and repeat for 10 times. This TCOV BLEU bound is optimistic since it does not consider reorderings in the translation or differences in sentence length. Each plot in Tables 6 and 7 locates TCOV BLEU bound obtained from each n-gram and from n-grams combined up to and including n and locates the ParFDA Moses SMT performance. Table 5 compares TCOV BLEU bounds with ParFDA results and TopC from Table 3 and shows potential improvements in the translation performance for all translation directions at WMT16 and overall on average. Results in bold are close to OOV r TCOV BLEU bound, which indicates that TopC translation results for de-en and ro-en directions are able to obtain results close to this bound.

Conclusion
We use ParFDA for selecting instances for building SMT systems using less computation over-BLEU cs-en de-en fi-en ro-en ru-en tr-en en-cs en-de en-fi en-ro en-ru en-tr  all than the computation that would be spent using all available corpora while still achieve SMT performance that is close to the top performing phrase-based SMT systems. ParFDA results at WMT16 provides new results using the current phrase-based SMT technology towards rapid SMT system development in budgeted training scenarios. ParFDA works towards the development of task or data adaptive SMT solutions using specially moulded data rather than general purpose SMT systems built with a patchwork approach combining various sources of information and several processing steps. We obtain BLEU bounds based on target coverage and show that top constrained results can be improved by 8 BLEU points on average and obtain results close to the bound for de-en and ro-en translation directions. Similar bounds show that ParFDA results can be improved by 12.6 BLEU points on average.