Referential Translation Machines for Predicting Translation Performance

Referential translation machines (RTMs) pioneer a language independent approach for predicting translation performance and to all similarity tasks with top performance in both bilingual and monolingual settings and remove the need to access any task or domain speciﬁc information or resource. RTMs achieve to become 1 st in document-level, 4 th system at sentence-level according to mean absolute error, and 4 th in phrase-level prediction of translation quality in quality estimation task.


Referential Translation Machines
Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators if needed. Referential translation machines achieve top performance in automatic and accurate prediction of machine translation performance independent of the language or domain of the prediction task. Each referential translation machine (RTM) model is a data translation prediction model between the instances in the training set and the test set and translation acts are indicators of the data transformation and translation. RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance in both monolingual (Biçici and Way, 2015) and bilingual settings (Biçici et al., 2015b). Figure 1 depicts RTMs and explains the model building process (Biçici, 2016).
RTMs use ParFDA (Biçici et al., 2015a) for selecting instances and interpretants, data close to the task instances for building prediction models and machine translation performance prediction system (MTPPS) (Biçici and Way, 2015) for generating features. We improve our RTM models (Biçici et Figure 1: RTM depiction: ParFDA selects interpretants close to the training and test data using parallel corpus in bilingual settings and monolingual corpus in the target language or just the monolingual target corpus in monolingual settings; an MTPPS uses interpretants and training data to generate training features and another uses interpretants and test data to generate test features in the same feature space; learning and prediction takes place taking these features as input. al., 2015b) with numeric expression identification using regular expressions and replace them with a label (Biçici, 2016).

RTM in the Quality Estimation Task
We develop RTM models for all of the four subtasks of the quality estimation task (QET) in WMT16 (Bojar et al., 2016) (QET16), which include English to Spanish (en-es), English to German (en-de), and German to English (de-en) translation directions. The subtasks are: sentencelevel prediction (Task 1), word-level prediction (Task 2), phrase-level prediction (Task 2p), and document-level prediction (Task 3). Task 1 is about predicting HTER (human-targeted translation edit rate) (Snover et al., 2006) scores of sentence translations, Task 2 is about binary classification of word-level quality, Task 2p is about binary classification of phrase-level quality, and  Task 3 where document-level translation performance is predicted. Separate MTPPS instances are run for each train and test document to obtain corresponding feature representations, which are filtered and processed before learning and prediction.
We tokenize and truecase all of the corpora using code released with Moses (Koehn et al., 2007) 1 . Table 1 lists the number of sentences in the training and test sets for each task. We also list the size of the interpretants used by the corresponding RTM models (K for thousand, M for million). We use the same number of interpretants for training as last year in Task 1. We increase the number of instances used for the LM to 10M. This year, we did not include features from backward LM in MTPPS and we used numeric expression identification in Task 1 and Task 3.

RTM Prediction Models
We present results using support vector regression (SVR) with RBF (radial basis functions) kernel (Smola and Schölkopf, 2004) and extremely randomized trees (TREE) (Geurts et al., 2006) for sentence and document translation prediction tasks. We also use them after a feature subset selection (FS) with recursive feature elimination (RFE) (Guyon et al., 2002) or a dimensionality reduction and mapping step using partial least squares (PLS) (Specia et al., 2009), or PLS after FS (FS+PLS). We use Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) (Biçici et al., 2015b) for word-level translation performance prediction. GLMd uses weights in a range [a, b] to update the learning rate dynamically according to the error rate as shown in Figure 3. Figure 2    and obtain corresponding features (depicted with a green or salmon colored sphere). We obtain an RTM representation vector instance from each of these by using only the document-level features from MTPPS and the min, max, and average of the sentence-level features.

Training Results
We use mean absolute error (MAE), relative absolute error (RAE), root mean squared error (RMSE), Pearson's correlation (r P ), and Spearman's correlation (r S ) as well as relative MAE (MAER) and relative RAE (MRAER) to evaluate (Biçici and Way, 2015). MAER and MRAER consider both the predictor's error and the fluctuations of the target scores at the instance level. RTM test performance on various tasks sorted according to MRAER can help identify which tasks and subtasks may require more work. DeltaAvg (Callison-Burch et al., 2012) calculates the average quality difference between the top n − 1 quartiles and the overall quality for the test set. Table 2 presents the training results for Task 1 and Task 3. Table 3 presents Task 2 training results obtained after the challenge.

Test Results
The results on the test set are listed in Table 4 2  and Table 5. Ranks are out of 9, 8, 6, and 5 system submissions in Task 1, Task 2, Task 2p, and Task 3 respectively. RTMs with FS SVR is able to achieve the 6th rank in Task 1 according to r P and 4th according to MAE. The top MAE is 12.3 where RTM obtains 9% more MAE. RTMs with FS+PLS TREE is able to achieve the 1st rank in Task 3. Table 6 lists the RTM results optimizing the target evaluation metric, r, obtained after the challenge.

Target Optimized Results
The results show that numerical expression identification did not improve the test results for QET Task 1 but we have observed improvements in semantic textual similarity in English (Biçici, 2016).

Comparison with Previous Results
We compare the difficulty of tasks according to MRAER levels achieved. In Table 7, we list the RTM test results for tasks and subtasks that predict HTER or METEOR from QET16, QET15 (Biçici et al., 2015b), QET14 (Biçici and Way, 2014), and QET13 (Biçici, 2013). Compared with QET15 Task 1 performance, MAER improved in QET16 and obtained the top MAER performance in sentence-level prediction. Compared with QET15 Task 2 performance, both F 1 OK and F 1 BAD improved even though the training error tripled. wF 1 calculation in QET16 is different than the calculation used in QET15.

Conclusion
Referential translation machines achieve top performance in automatic, accurate, and language independent prediction of translation performance. RTMs pioneer a language independent approach for predicting translation performance and to all similarity tasks and remove the need to access any task or domain specific information or resource.