Robust parfda Statistical Machine Translation Results

We build parallel feature decay algorithms (parfda) Moses statistical machine translation (SMT) models for language pairs in the translation task. parfda obtains results close to the top constrained phrase-based SMT with an average of 2.252 BLEU points difference on WMT 2017 datasets using significantly less computation for building SMT systems than that would be spent using all available corpora. We obtain BLEU upper bounds based on target coverage to identify which systems used additional data. We use PRO for tuning to decrease fluctuations in the results and postprocess translation outputs to decrease translation errors due to the casing of words. F1 scores on the key phrases of the English to Turkish testsuite that we prepared reveal that parfda achieves 2nd best results. Truecasing translations before scoring obtained the best results overall.


Introduction
Statistical machine translation is widely prone to errors in text including encoding, tokenization, morphological variations and the mass they take, the size of the training and language model datasets used, and model errors. parfda is an instance selection tool based on feature decay algorithms (Biçici and Yuret, 2015) we use to select training and language model instances to build Moses phrase-based SMT systems to translate the test sets in the news translation task at WMT18 (WMT, 2018). As we work towards tools that can be used for multiple languages at the same time, we aim to obtain robust results for comparison and record the statistics of the data and the resources used. Our contributions are: • a test suite for machine translation that is out of the domain of news task to take the chance of taking a closer look at the current status of SMT technology used by the task participants when translating 10 sentences taken from literary context in Turkish, which shows that parfda phrase-based SMT can obtain 2nd best results on this test set, • parfda results for language pairs in the translation task and data statistics, • comparison of processing alternatives for translation outputs to obtain better results, • upperbounds on the translation performance using lowercased coverage to identify which models used data in addition to the parallel corpus, • a set of rules that fix tokenization errors in Turkish using Moses' (Koehn et al., 2007) tokenization scripts.
We obtain parfda Moses phrase-based SMT (Koehn et al., 2007) results for the language pairs in both directions in the WMT18 news translation task, which include English-Czech (en-cs), English-Estonian (en-et), English-German (ende), English-Finnish (en-fi), English-Russian (en-ru), and English-Turkish (en-tr). Building a language independent system that can perform well in translation tasks is a challenging task and SMT systems participating at WMT18 have been largely built dependent on the translation direction.

parfda
Parallel feature decay algorithms (parfda) (Bicici, 2016) parallelize feature decay algorithms (FDA), a class of instance selection algorithms that use feature decay, for fast deployment of accurate SMT systems. We use parfda to select parallel training data and language model (LM) data for building SMT systems. parfda runs separate FDA5 (Biçici and Yuret, 2015) models on randomized subsets of the available data and combines the selections afterwards. Figure 1 depicts parfda Moses SMT workflow. The approach also obtained improvements using NMT (Poncelas et al., 2018).
We obtain transductive learning results since we use source sentences of the test set to select data. However, decaying only on the source test set features does not necessarily increase diversity on the target side thus we also decay on the target features that we already select. With the new parfda model, we select about 1.7 million instances for training data and about 15 million sentences for each LM data not including the selected training set, which is added later. Table 1 shows size differences with the constrained dataset (C). We use 3-grams to select training data and 2-grams for LM data. TCOV lists the target coverage in terms of the 2-grams of the test set. We also use CzEng17 (Bojar et al., 2016) for en-cs and SE-TIMES2 (Tiedemann, 2009) for en-tr.
We set the maximum sentence length to 126 and train 6-gram LM using kenlm (Heafield et al., 2013). For increasing the robustness of the optimization results, we use PRO (Section 2.1) and we use varying n-best list size. For word alignment, we use mgiza (Gao and Vogel, 2008) where GIZA++ (Och and Ney, 2003) parameters set max-fertility to 10, the number of iterations to 7,3,5,5,7 for IBM models 1,2,3,4, and the HMM model, and learn 50 word classes in three iterations with the mkcls tool during training. The development set contains up to 4000 sentences randomly sampled from previous years' development sets (2011-2017) and remaining come from the development set for WMT18. Table 2 lists the coverage of the test set.

Robust Optimization Results with PRO
Pairwise ranking optimization (PRO) (Hopkins and May, 2011) is found to obtain scores that monotonically increase, with results that are at least as good as MERT (Och, 2003), and with a standard deviation that is three times lower than MERT. We use PRO for tuning to obtain robust results due to fluctuating scores with MERT. PRO tuning performance graph is compared with MERT performance plot in Figure 2. We used monotonically increasing n-best list size at the start to increase robustness by using multiples of 50 until the 8th iteration, 350 every 10th, and 150 in the remaining. We only need 4 iterations to find parameters whose tuning score reach 1% close to the best tuning parameter set score ( Figure 3).

Testsuite for en-tr and tr-en
We prepared an SMT test suite that is out of the domain of news translation task to take a closer   look at the current status of SMT technology used by the task participants to translate 10 sentences taken from literary context in Turkish. The sentences and their translations are provided in Appendix A. Table 3 details the testsuite results on en-tr and tr-en where the best translations of parfda are selected based on their BLEU (Papineni et al., 2002) and F 1 (Biçici, 2011) scores: en-tr lctc 1 align en-tr ts lctc 1 align tr-en tc 2 align tr-en ts tc 1 align where tc and lctc are defined in Section 2.3.
We count tokens of translation as nontranslation when they are found in the test source, are not a number or punctuation, and are considered by the SMT model's phrase table or the lexical translation table as a token whose translation differs from the source token. We have access to the lexical tables of parfda SMT models and among the tr-en lctc entries (Table 4), 2.7% contain a translation the same as the source. According to the testsuite results using translations from task participants, only RWTH and parfda contained non-translations and RWTH had only a token non-translated. The scores for up to n-grams in Table 12 show that alibaba.5744 achieves the best results in en-tr and online-B achieves the best results in tr-en in all scores. When we look at some of the OOV tokens in en-tr, we observe that lowercasing and then truecasing might help.
We identified 5 key phrases for both en-tr and tr-en that we would like to see translated correctly (Table 5). Some are trimmed to make them closer to their root form so that suffixes can be added without decreasing identification rates. Appendix A presents F 1 scores based on the identification of them in the translations. We see that even though parfda achieves the lowest scores in BLEU, on the key phrases, it provides the 2rd  Table 1: Statistics for the training and LM corpora in the constrained (C) setting compared with the parfda selected data. #words is in millions (M) and #sents in thousands (K). TCOV is target 2-gram coverage.  best in en-tr among 9 models and 4th best among 6 in tr-en. Key phrase identification is important since when scores are averaged, important phrases that are missing only decrease the score by 1 |p|N |p| for BLEU calculation for a phrase of length |p| over N |p| phrases with length |p|.

Comparing Text Processing Settings for SMT
Experiment management system (EMS) (Koehn, 2010) of Moses prepares translations as follows: truecase input → translate input → clean output (XML tags) → detruecase output Truecasing updates the casing of words according to the most common form observed in the whole training corpus. EMS does not truecase the translations of an SMT model when training data are already truecased. However, each casing of words are a different entry in the phrase table and the casing we are interested in might be missing in the translations. Therefore, truecasing (tc) before detruecasing makes sense.
The casing of the text affects the number of tokens in the data sets. A casing of a token might appear in the phrase table but not its lowercased (lc) version. In EMS, truecasing is applied on the input. We experiment with truecasing lowercased text (lctc) to decrease the number of outof-vocabulary words in the translations and to reduce the number of unique n-grams, dataset sizes, and the binary LM size by about 2%.
We process tokenized Turkish text using a set of rules since Moses' (Koehn et al., 2007) tokenization scripts can encounter tokenization errors in Turkish. A simpler approach was also tried for fixing tokenization of Turkish by removing space for unbalanced single quotes (Ding et al., 2016). Additionally, we retain the casing of the     test source sentences using the word alignment information (Ding et al., 2016). Using alignment information is more complicated since not all alignments are 1-to-1. We also experiment with finding the casing of the input words in the development and test sets according to the form found in the translation tables to replace them before decoding. Figure 4 compares tc and lctc approaches to text processing for SMT. Both can use the alignment information for casing words. Table 6 compares the results using translations that contain the alignment information and the unknown words where tc 0 is the baseline. The additional Moses decoder parameter is --print-alignment-info. We obtain the highest en-tr score using the alignments for casing but scores decrease for en-de and de-en. For which translation directions it helps can be seen in the lctc 0 row. The difference between the base and the lowercased results are the gain we can achieve if we fix casing accordingly. Using tc translation as a start, the gain on average is about 1.1 BLEU points (0.011 BLEU). The best setting overall is tc 2. The largest room for improvement with lctc lc BLEU results are for cs-en and tren.  Table 6: parfda tokenized and cased results with different text processing settings. Baseline is tc 0 (in italic). bold lists the best for a translation direction.  parfda results at WMT18 are in Table 7 using BLEU over tokenized text. We compare with the top constrained submissions at WMT18 in Table 7 and at WMT17 in Table 8. 2 Performance compared with the top constrained (TopC) phrase-based SMT improved to 2.252 in 2017 from 3 BLEU points difference on average compared with WMT16 results, which is likely due to the new parfda model and phrase-based SMT being less common in 2017. parfda Moses SMT system can obtain 0.6 BLEU points close to the top result in Finnish to English translation in 2017. All top models use NMT in 2018 and most use backtranslations, which means that their TCOV is upper bounded by LM TCOV.

Translation Upper Bounds with TCOV
We obtain upper bounds on the translation performance based on the target coverage (TCOV) of ngrams of the test set found in the selected parfda training data (Bicici, 2016) but using lowercased text this time. For a given sentence T , the number of OOV tokens are identified: OOV r = round((1 − TCOV) * |T |) (1) 2 Due to different tokenization rules used by mteval-v14.pl in matrix.statmt.org, parfda BLEU scores are higher than the scores in Table 6. where |T | is the number of tokens in the sentence. We obtain each bound using 500 such instances and repeat for 10 times. TCOV BLEU bound is optimistic since it does not consider reorderings in the translation or differences in sentence length. Each plot in Table 9 locates TCOV BLEU bound obtained from each n-gram and from ngram TCOVS combined up to and including n and locates the parfda result and locates the top constrained result. In en-de and en-tr, the top model achieves a higher score than the TCOV BLEU bound, which indicates that data additional to the constrained training data was used. In both, backtranslations were used.

Conclusion
We use parfda for selecting instances for building SMT systems using less computation overall and results at WMT18 provides new data about using the current phrase-based SMT technology towards rapid SMT system development. Our data processing experiments show that lowercasing and then truecasing data can improve SMT models and translation results provided that we can find the casing correctly and truecasing translations before scoring can improve the results. Our BLEU cs-en de-en fi-en lv-en ru-en tr-en en-cs en-de en-fi en-lv en-ru   method of tuning with PRO provides robust results and the BLEU bounds we obtain show which systems used additional training data. We are often interested to conserve the semantic content in the translations and parfda Moses phrase-based SMT achieves 2nd best results on the tr-en testsuite in our evaluations with key phrases. A en-tr and tr-en Testsuite Sentences  After using Latin alphabet for more than eighty years we can say that Turkish writing has been traditionalized against some Latin sourced problems.

7
Without a doubt, Turkish Language Institution's work for 80 years has played an important role in this.

8
The task of preparing, writing, and distributing writing manual is given to Turkish Language Institution according to the ç subitem of the 10th item in 664 numbered decree law based on the 134th item in the constitution.

9
Since its establishment, Turkish Language Institution has been trying to fulfill its duty on determining writing rules and publication of writing manuals.
10 Turkish Language Institution Turkish Dictionary and Writing Manual Working Group has diligently worked on each writing rule and writing to end discussions in writing and to spread writing rules and styles that everybody will accept and use. Turkish