RTM at SemEval-2017 Task 1: Referential Translation Machines for Predicting Semantic Similarity

We use referential translation machines for predicting the semantic similarity of text in all STS tasks which contain Arabic, English, Spanish, and Turkish this year. RTMs pioneer a language independent approach to semantic similarity and remove the need to access any task or domain specific information or resource. RTMs become 6th out of 52 submissions in Spanish to English STS. We average prediction scores using weights based on the training performance to improve the overall performance.


Referential Translation Machines (RTMs)
Semantic textual similarity (STS) task  at SemEval-2017 (Bethard et al., 2017) is about quantifying the degree of similarity between two given sentences S 1 and S 2 in the same language or in different languages. RTMs use interpretants, data close to the task instances, to derive features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and to identify translation acts between any two data sets for building prediction models. RTMs are applicable in different domains and tasks and in both monolingual and bilingual settings. Figure 1 depicts RTMs and explains the model building process. RTMs use ParFDA (Biçici, 2016a) for instance selection and machine translation performance prediction system (MTPPS) (Biçici and Way, 2015) for generating features for the training and the test set mapping both to the same space where the total number of features in each task becomes 368. The new features we include are about punctuation: number of tokens about punc-Figure 1: RTM depiction: ParFDA selects interpretants close to the training and test data using parallel corpus in bilingual settings and monolingual corpus in the target language or just the monolingual target corpus in monolingual settings; an MTPPS use interpretants and training data to generate training features and another use interpretants and test data to generate test features in the same feature space; learning and prediction takes place taking these features as input.
RTMs are providing a language independent text processing and machine learning model able to use predictions from different predictors. We use ridge regression (RR), k-nearest neighors (KNN), support vector regression (SVR), Ad-aBoost (Freund and Schapire, 1997), and extremely randomized trees (TREE) (Geurts et al., 2006) as learning models in combination with feature selection (FS) (Guyon et al., 2002) and partial least squares (PLS) (Wold et al., 1984). For most of the models, we use scikit-learn. 1 We optimize the models using a subset of the training data for the following parameters: λ for RR, k for KNN, γ, C, and for SVR, minimum number of samples for leaf nodes and for splitting an in- Evaluation metrics we use are Pearson's correlation (r), mean absolute error (MAE), relative absolute error (RAE), MAER (mean absolute error relative), and MRAER (mean relative absolute error relative) (Biçici and Way, 2015). Official evaluation metric is r.
This year, we experiment with averaging scores from different models. The predictions,ŷ, are sorted according to their performance on the training set and the mean of the top k predictions (equally weighted averaging) or their weighted average according to their performance are used: The weights are inverted since we are trying to decrease MAER and normalize by the sum. We use MAER for sorting and selecting predictions.

SemEval-17 STS Results
SemEval-2017 STS contains STS sentence pairs from the languages listed in Table 1 where the top r from among our officially submitted results are listed, which contain a mean averaged, a weight averaged, and a top prediction corresponding to weight 3, mean 3, and SVR model predictions.
These results do not contain AdaBoost results and 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 C they are optimized less. We build individual RTM models for each subtask with RTM team name. Interpretants are selected from the corpora distributed by the translation task of WMT17 (Bojar et al., 2017) and they consist of monolingual sentences used to build the LM and parallel sentence pair instances used by MTPPS to derive the features. For monolingual STS, we use the corresponding monolingual corpora. We built RTM models using: • 275 thousand sentences for en-en, 200 thousand sentences for en-tr, and 250 thousand sentences for others for training data • 7 million sentences for the language model which are close to the fixed training set size setting in (Biçici and Way, 2015). We identified numeric expressions using regular expressions as a pre-processing step, which replaces them with a label. Identification of numerics improve the performance on the test set (Biçici, 2016b). For en-es or es-en, we did not use any language identification tool and separated sentences based on left/right difference rather than using the mixed format that was made available to the participants even though identification of the language increase r on the test set from 0.5375 to 0.6066 while decreasing error (Biçici, 2016b). For en-tr, we were not provided any training data; therefore, we used the training data from other subtasks.   results warn us that ar-ar, ar-en, en-es, and esen obtain MRAER larger than 1 suggesting more work towards these tasks. en-en has slightly more than 1 in MRAER and this is worse than the 0.719 MRAER obtained by RTMs in STS in 2016. For es-es, we obtain slightly lower results compared with 0.729 MRAER of RTMs in STS in 2016 where we used language identification. The test set domain is different this year; Stanford Natural Language Inference corpus (Bowman et al., 2015) is focusing on inference and entailment tasks and entailment assumes direction and in contrast the goal in STS is the bidirectional grading of equivalence (Agirre et al., 2015). Table 3 list the ranks we can obtain with RTMs these new results. Figure 3 plots the performance on the test set where instances are sorted according to the magnitude of the target scores.

Experiments After the Challenge
Also in this section, we present results about transfer of learning. Transfer learning attempt to re-use and transfer knowledge from models de-veloped in different domains or for different tasks such as using models developed for handwritten digit recognition for handwritten character recognition (Guyon et al., 2012). We cross use RTM SVR models developed for different tasks as a cross-task TL 2 and present the results in Table 4 with #train listing the size of the training set used for each task. Cross use of RTM es-es model increase r for en-en from 0.71 to 0.75 and for en-ar from 0.19 to 0.50 while making all tasks except 4b en-es below the 1 MRAER threshold we seek for showing improvements in prediction performance relatively better than a predictor knowing and using the mean of the target scores on the test set.

Conclusion
Referential translation machines pioneer a clean and intuitive computational model for automatic prediction of semantic similarity by measuring the acts of translation involved. Averaging predictions improve the correlation on the test set.