Machine Translation with parfda, Moses, kenlm, nplm, and PRO

We build parfda Moses statistical machine translation (SMT) models for most language pairs in the news translation task. We experiment with a hybrid approach using neural language models integrated into Moses. We obtain the constrained data statistics on the machine translation task, the coverage of the test sets, and the upper bounds on the translation results. We also contribute a new testsuite for the German-English language pair and a new automated key phrase extraction technique for the evaluation of the testsuite translations.


Introduction
Parallel feature weight decay algorithms (parfda) (Biçici, 2018) is an instance selection tool we use to select training and language model instances to build Moses (Koehn et al., 2007) phrase-based machine translation (MT) systems to translate the test sets in the news translation task at WMT19 (Bojar et al., 2019). The importance of parfda increase with the increasing size of the parallel and monolingual data available for building SMT systems. In the light of last year's evidence that shows that parfda phrase-based SMT can obtain the 2nd best results on a testsuite in the English-Turkish language pair (Biçici, 2018) when generating the translations of key phrases that are important for conveying the meaning, we obtain phrase-based Moses results and its extension with a neural LM in addition to the n-gram based LM that we use. We experiment with neural probabilistic LM (NPLM) (Vaswani et al., 2013). We record the statistics of the data and the resources used.
Our contributions are: • a test suite for machine translation that is out of the domain of news task to take the chance of taking a closer look at the current status of SMT technology used by the task participants when translating 38 sentences about international relations concerning cultural artifacts, • parfda Moses phrase-based MT results and data statistics for the following translation directions: -English-Czech (en-cs) -English-Finnish (en-fi), Finnish-English (fi-en), -English-German (en-de), German-English (de-en), -English-Kazakh (en-kk), Kazakh-English (kk-en), -English-Lithuanian (en-lt), Lithuanian-English (lt-en), -English-Russian (en-ru), Russian-English (ru-en), • upperbounds on the translation performance using lowercased coverage to identify which models used data in addition to the parallel corpus.
The sections that follow discuss the instance selection model (Section 2), the machine translation model (Section 3), the testsuite used for evaluating MT in en-de and de-en, and the results.  Table 1: Statistics for the training and LM corpora in the constrained (C) setting compared with the parfda selected data. #words is in millions (M) and #sents in thousands (K). tcov is target 2-gram coverage.  Table 2: Constrained training data lowercased source feature coverage (scov) and target feature coverage (tcov) of the test set for n-grams.
2 Instance Selection with parfda parfda parallelize feature decay algorithms (FDA) (Biçici and Yuret, 2015), a class of instance selection algorithms that decay feature weights, for fast deployment of accurate SMT systems. Figure 1 depicts parfda Moses SMT workflow.
We use the test set source sentences to select the training data and the target side of the selected training data to select the LM data. We decay the weights for both the source features of the test set and the target features that we already select to increase the diversity. We select about 2.2 million instances for training data and about 12 million sentences for each LM data not including the selected training set, which is added later. Table 1 shows size differences with the constrained dataset (C). 1 We use 3-grams to select training data and 2grams for LM data and split the hyphenated words 1 Available at https://github.com/bicici/ parfdaWMT2019 using the "-a" option of the tokenizer used in Moses (Sennrich et al., 2017). tcov lists the target coverage in terms of the 2-grams of the test set. The maximum sentence length is set to 126. Table 2 lists the lowercased coverage of the test set by the constrained training data of WMT19.

Machine Translation with Moses, kenlm and nplm, and PRO
We train 6-gram LM using kenlm (Heafield et al., 2013). For word alignment, we use mgiza (Gao and Vogel, 2008) where GIZA++ (Och and Ney, 2003) parameters set max-fertility to 10, the number of iterations to 7,5,5,5,7 for IBM models 1,2,3,4, and the HMM model, and learn 50 word classes in three iterations with the mkcls tool during training. We use "-mbr" option when decoding the test set. 3 The development set con-     Biçici, 2018). This allows us to find parameters whose tuning score reach 1% close to the best tuning parameter set score in only 4 iterations but we still run tuning for 21 iterations. Truecasing updates the casing of words according to the most common form. We truecase the text before building the SMT model as well as after decoding and then detruecase before preparing the translation, which provided better results than simply detruecasing after decoding (Biçici, 2018). We trained nplm LM in 10 epochs. We also experimented with bilingual nplm, which uses nplm in a bilingual setting to use both the source and the target context and builds a LM on the training set (Devlin et al., 2014). Both nplm and bilingual nplm can be used with Moses as a feature within its configuration file. 4 On average, results in Table 3 shows that using only nplm decrease the scores and improvements are obtained when both nplm and kenlm are used. However, the gain from splitting hyphenated words is more and it is a less computationally demanding option. kenlm takes about 20 minutes whereas building a single nplm model took us 11.5 to 14.25 days or 1000 times longer and it takes about 56 GB space on the disk.

Translation Upper Bounds with tcov
We obtain upper bounds on the translation performance based on the target coverage (tcov) of ngrams of the test set found in the selected parfda training data using lowercased text. For a given sentence T , the number of OOV tokens are identified: where |T | is the number of tokens in the sentence. We obtain each bound using 500 such instances and repeat for 10 times. tcov BLEU bound is optimistic since it does not consider reorderings in the translation or differences in sentence length. Each plot in Figure 2 locates tcov BLEU bound obtained from each n-gram and from ngram tcovs combined up to and including n and locates the parfda result and locates the top constrained result. Based on the distance between the top BLEU result and the bound, we can obtain a sorting of the difficulty of the translation directions in Table 5.

German-English Testsuite
We prepared a MT test suite that is out of the domain of news translation task to take a closer   Table 10 in terms of BLEU (Papineni et al., 2002) and F 1 (Biçici, 2011) scores. However, such automatic evaluation metrics treat the features or n-grams equivalently or group them based on their length, without knowledge about their frequency in use or significance in conveying the meaning. Word order within a sentence does not contain the majority of information (Landauer, 2002) for vocabulary size |V | ≥ n where n is the average sentence length. For n = 25 words with |V | = 10 5 with equivalent representation using n = 10 phrases with |V | = 10 7 or using n = 50 BPE tokens with |V | = 10 4 or using n = 125  chars with |V | = 25 have differring contribution to the information of the sentence in bits from token order or choice (Table 6). If we use keyword subsequences for F 1 based evaluation, we would cover about 91% of the information in a sentence. Key phrase identification is important since when scores are averaged, important phrases that are missing only decrease the score by 1 |p|N |p| for BLEU calculation for a phrase of length |p| over N |p| phrases with length |p|. We extend our evaluation of the testsuite translations using keywords (Biçici, 2018).
We automate key phrase identification within a reference set of N sentences by selecting among N X candidate n-grams that: • are representative and few • cover significant portion of the text min X T (αX p · X l · 1 βX c + 1 1 1 N X )  • are frequent (X c for counts of phrases) • are less likely to be found (X p for the probability of phrases) and formulate the task as a linear program in Table 7. We use up to 6-grams and set minimum coverage of each sentence to 0.5. We removed some stop words from the phrases: 'of', 'the', 'and', 'of the', 'a', 'an' and replaced those parts with '.*?' and obtained regular expressions. The key phrases we obtain are listed in Table 9. The key phrases are used to evaluate using the F 1 score (Table 10). We plan to extend this work towards more objective key phrase evaluation methods.

Conclusion
We use parfda for building task specific MT systems that use less computation overall and release our engineered data for training MT systems. We also contribute a new testsuite for the German-