Predicting Translation Performance with Referential Translation Machines

Referential translation machines achieve top performance in both bilingual and monolingual settings without accessing any task or domain speciﬁc information or resource. RTMs achieve the 3 rd system re-sults for German to English sentence-level prediction of translation quality and the 2 nd system results according to root mean squared error. In addition to the new features about substring distances, punctuation tokens, character n -grams, and alignment crossings, and additional learning models, we average prediction scores from different models using weights based on their training performance for improved results.


Introduction
Quality estimation task (QET) in WMT17 (Bojar et al., 2017) (QET17) is about prediction of the quality of machine translation output at the sentence-(Task 1), word-(Task 2), and phraselevel (Task 3) in IT and pharmaceutical domains without using reference translations. Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators if needed. RTMs are capable to model different domains and tasks while achieving top performance in both monolingual (Biçici and Way, 2015) and bilingual settings (Biçici, 2016b). We develop RTM models for all of the three subtasks of QET17, which include English to German (en-de), and German to English (de-en) translation directions. Task 1 is about predicting HTER (human-targeted translation edit rate) scores (Snover et al., 2006), Task 2 is about binary classification of word-level quality, Figure 1: RTM depiction: ParFDA selects interpretants close to the training and test data using parallel corpus in bilingual settings and monolingual corpus in the target language or just the monolingual target corpus in monolingual settings; an MTPPS use interpretants and training data to generate training features and another use interpretants and test data to generate test features in the same feature space; learning and prediction takes place taking these features as input.
and Task 3 is about binary classification of phraselevel quality.

Referential Translation Machines
Referential translation machine (RTM) models are predict data translation between the instances in the training set and the test set. RTMs use interpretants, data close to the task instances, to derive features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and to identify translation acts between any two data sets for building prediction models. RTMs are applicable in different domains and tasks and in both monolingual and bilingual settings. Figure 1 depicts RTMs and explains the model building process. RTMs use ParFDA (Biçici, 2016a) for instance selection and machine translation performance prediction system (MTPPS) (Biçici and Way, 2015) for generat-  ing features where the total number of features becomes 514, increasing depending on the order of n-grams used and we used up to 5-grams for translation features and 7-grams for language model (LM) at QET17. We use ridge regression (RR), k-nearest neighors (KNN), support vector regression (SVR), AdaBoost (Freund and Schapire, 1997), and extremely randomized trees (TREE) (Geurts et al., 2006) as learning models in combination with feature selection (FS) (Guyon et al., 2002) and partial least squares (PLS) (Wold et al., 1984). We use scikit-learn 1 for most of these models. The following parameters are optimized: λ for RR, k for KNN, γ, C, and for SVR, minimum number of samples for leaf nodes and for splitting an internal node for TREE, the number of features for FS, and the number of dimensions for PLS. For AdaBoost, we do not optimize but use exponential loss and 500 estimators like we use also with the TREE model. We use grid search for SVR. Evaluation metrics we use are Pearson's correlation (r), mean absolute error (MAE), relative absolute error (RAE), MAER (mean absolute error relative), and MRAER (mean relative absolute error relative) (Biçici and Way, 2015). DeltaAvg (Callison-Burch et al., 2012) calculates the average quality difference between the top n − 1 quartiles and the overall quality for the test set. Official evaluation metrics include r, MAE, and DeltaAvg.
We improved RTM models (Biçici, 2016b) with additional features: • normalized Levenshtein distance between the 1 http://scikit-learn.org/ source sentence and its translation and their longest common prefix, suffix, and substring (Tian et al., 2017) normalized by the minimum length of the compared sentences.
• number of tokens about punctuation in the source sentence and the translation (Kozlova et al., 2016) and the cosine between them.
We also use prediction averaging (Biçici, 2017), where the performance on the training set is used to obtain weighted average of the top k predictions,ŷ with evaluation metrics indexed by j ∈ J: (2) MAER is used to select the predictions and weights are inverted to decrease error.
We use Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) ( Table 2 lists the number of sentences in the training and test sets for each task and the number of instances used as interpretants in the RTM models (M for million). We tokenize and truecase all of the corpora using Moses's (Koehn et al., 2007) processing tools. 2 LMs are built using KENLM (Heafield et al., 2013).

QET 2017 Results
The results on the Task 1 test set are listed in Table 1. 3 For Task 2 and Task 3, we list the results 2 https://github.com/moses-smt/ mosesdecoder/tree/master/scripts 3 We calculate rS using scipy.stats.
we obtain after the challenge for coherent presentation on the training sets in Table 3 and on the test set in Table 4. The results we obtained in the challenge are similar. Ranks for Task 1 are out of 14 submissions and 9 systems. Top RTM models that competed in Task 1 were MIX 4, which combines top 4 predictions, PLS GBR, and TREE. RTM becomes the 2nd system according to RMSE and 3rd system in de-en and 6th system in en-de.

Recomputing QET 2016 Results
QET17 also compares results on QET16 test sets. QET16 test set domain was different than the domain of QET17, overlapping on the IT domain. We use the RTM models built for QET17 to obtain results on the QET16 test sets, which is categorized as transductive transfer learning. 4 Transfer learning attempt to re-use and transfer knowledge from models developed in different domains or for different tasks such as using models developed for handwritten digit recognition for handwritten character recognition (Guyon et al., 2012). The results are in Table 5 for Task 1, which does not show improvement, and in Table 7, which show improvements with RTM models built for QET17.

Comparison with Previous Results
We compare the difficulty of tasks according to MRAER levels achieved. In Table 6

Conclusion
Referential translation machines achieve top performance in automatic, accurate, and language independent prediction of translation performance and achieve to become the 2nd system according to RMSE when predicting the translation performance from German to English. RTMs pioneer a language independent approach for predicting translation performance and remove the need to access any task or domain specific information or resource.