RTM at SemEval-2016 Task 1: Predicting Semantic Similarity with Referential Translation Machines and Related Statistics

We use referential translation machines (RTMs) for predicting the semantic similarity of text in both STS Core and Cross-lingual STS. RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain speciﬁc information or resource. RTMs become 14th out of 26 submissions in Cross-lingual STS. We also present rankings of various prediction tasks using the performance of RTM in terms of MRAER, a normalized relative absolute error metric.


Semantic Agreement
We participated in Semantic Textual Similarity task at SemEval-2016 (Bethard et al., 2016) with RTMs. RTMs identify translation acts between any two data sets with respect to interpretants, data close to the task instances, effectively judging monolingual and bilingual similarity. We use RTMs for predicting the semantic similarity of text. Interpretants are used to derive features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and the presence of the acts of translation, which may ubiquitously be observed in communication.
Semantic Web's dream is to allow machines to share, exploit, and understand knowledge on the web (Berners-Lee et al., 2001). As more and more shared conceptualizations of domains emerge, we get closer to this goal. Semantic textual similarity (STS) task (Agirre et al., 2016) at SemEval-2016 (Bethard et al., 2016) is about quantifying the degree of similarity between two given sentences S 1 and S 2 in the same language (English) in STS Core (STS English) or in different languages (English or Spanish) in Cross-lingual STS (STS Spanish), with a real number in [0,5]. S 1 and S 2 may be constructed using different models and with different conceptualizations of the world or different ontologies and different vocabulary. Even if two instances are categorized as same, they may have different implications for commonsense reasoning (both albatros and penguin are a bird) (Biçici, 2002).
The existence of a single ontology that can cover all the required conceptual information for reaching semantic understanding is questionable because it would presume an agreement among all ontology experts. Yet, semantic agreement using heterogeneous ontologies may not be possible as well since in the most extreme case, they would not use the same tokens. Therefore, semantic textual similarity is harder than the Chinese room thought experiment (Internet Encyclopedia of Philosophy, 2016) since we are not given any instructions about how to answer queries. Our goal is to quantify the level of semantic agreement between S 1 and S 2 and RTMs use interpretants, data close to the task instances for building prediction models for semantic similarity.

Referential Translation Machine
Each RTM model is a data translation prediction model between the instances in the training set and the test set and translation acts are indicators of the data transformation and translation. RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance in both monolingual (Biçici and Way, 2015) and bilingual settings (Biçici et al., 2015b). Our encouraging results in the semantic similarity tasks increase our understanding of the acts of translation we ubiquitously use when communicating and how they can be used to predict semantic similarity. Figure 1 depicts RTMs and explains the model building process. Given a training set train, a test set test, and some corpus C, preferably in the same domain, the RTM steps are: RTMs use ParFDA (Biçici et al., 2015a) for instance selection and machine translation performance prediction system (MTPPS) (Biçici and Way, 2015) for generating features.
We use support vector regression (SVR) for building the predictor in combination with feature selection (FS) and partial least squares (PLS). Assuming thatŷ, y ∈ R n are the prediction and the target respectively, evaluation metrics we use are defined in Equation (1) where metrics are Pearson's correlation (r), mean absolute error (MAE), relative absolute error (RAE), relative Pearson's correlation (r R ), MAER (mean absolute error relative), and MRAER (mean relative absolute error relative).
We use MAER and MRAER for easier replication and comparability. MAER is the mean absolute error relative to the magnitude of the target and MRAER is the mean absolute error relative to the absolute error of a predictor always predicting the target mean assuming that target mean is known (Biçici and Way, 2015).
. caps its argument from below to where = MAE(ŷ, y)/2, which represents half of the score step with which a decision about a change in measurement's value can be made.
We compare different tasks in Table 9 with evaluation results that are calculated relative to the magnitude of each target score instance. r multiplies distance ofŷ i and y i to their own means (Equation (1)). We obtain normalized correlation, r R , using = σ(y)/2.

SemEval-16 STS Results
SemEval-2016 STS contains sentence pairs from different domains: answer-answer, headlines, plagiarism, postediting, question-question for English and multisource and newswire for Spanish. Official evaluation metric in STS is the Pearson's correlation score. Table 1 lists the number of instances in the test set where only some of the instances are actually evaluated.
We build individual RTM models for each subtask. Our team name is RTM. Interpretants are selected from the corpora distributed by the translation task of WMT16 (Bojar et al., 2016) and they consist of monolingual sentences used to build the LM and parallel sentence pair instances used by MTPPS to derive features and for word alignment features. We use monolingual corpora in English for STS English to select interpretants and also for STS Spanish shuffled dataset, which is the official format that was made available to the participants.
We used English-Spanish parallel corpus and English and Spanish monolingual corpora for our STS Spanish experiments in Section 4 after the challenge using the language identified version. We built RTM models using 200 thousand sentences for training data and 5 million sentences for the language model, which corresponds to the fixed training set size setting in (Biçici and Way, 2015). We identified numeric expressions using regular expressions as a preprocessing step, which replaces them with a label. For training RTM models for STS Spanish, we use STS English training data and STS Spanish data from SemEval-2015 after scaling the scores to range   [0, 5]. Table 2 and Table 3 list the results on the test set. Ranks are out of 26 submissions in STS Spanish. We also observe that r over all of the test set, which does not compute the weighted average of r according to the number of instances in each domain can differ from the weighted r scores.

Experiments After the Challenge
In this section, we detail the training performance of our model based on major modeling differences with our previous RTM models on SemEval tasks (Biçici and Way, 2015). This year, we identified numeric expressions using regular expressions as a preprocessing step, which mainly identifies integers and real numbers that can have exponents. After sending the test results, we further worked on the numeric expression identification to expand the types of identified expressions. We also experimented with language identification for STS Spanish. Language identification is done using the manually corrected results starting from the output of automatic language identification tool mguesser. 1 After language identification, corpora were split into English and Spanish rather than the shuffled format that was made available to the participants. We compare the performance after identification of numeric expressions and identification of the language using SVR. Both STS Spanish models use previous years' training data from both STS English and STS Spanish for training, which total to 13823 instances. Table 4 and Table 5 presents the results before and after identification of numerics on STS English. We observe that identification of numerics improve the performance on the test set (bolded results). Table 6 presents the results on STS Spanish with the default shuffled setting and with the setting where we model the prediction as machine translation performance prediction from English to Spanish after identifying the language of each sentence in the training set. STS Spanish training dataset contains English sentences in majority and shuffled +numerics setting only use English corpora even though Spanish sentences are shuffled in the test set and eventually, shuffled +numerics setting obtains better results than language identified +numerics setting on the training set. Even so, we observe that identification of the language improve the performance on the test set (bolded results). Training results on setting language identified +numerics is lower, which may be due to the RTM model using language identified test corpus and the same training corpus as the shuffled +numerics setting. Table 8 plots the performance on the test set where instances are sorted according to the magnitude of the target scores. For STS English, we observe decreasing AER and a valley of absolute errors, which may be due to SVR preferring predictions close to the mean of train score distribution.

RTMs Across Tasks and Years
We compare the difficulty of various prediction tasks where RTMs participated (Biçici and Way, 2015)      cording to MRAER in Table 9. MAER and MRAER considers both the predictor's error and the fluctuations of the target scores at the instance level, which is at the sentence level in STS 2016. The best results are obtained for the CLSS 2014 paragraph-to-sentence subtask, which may be due to the larger contextual information that paragraphs can provide for the RTM models. We observe that the performance in STS improved in 2016 compared to STS in previous years. Table 9 can be used to evalu-ate the difficulty of various tasks and domains based on RTM. We separated the results having MRAER greater than 1 as in these tasks and subtasks RTM does not perform significantly better than the mean predictor, and fluctuations render these as tasks that may require more work. Our findings are negative towards re-use of those datasets and results obtained without further work. STS Spanish is able to achieve MRAER less than 1 in 2016. We also note that RTMs achieve the top result in both CLSS 2014 (Jurgens et al., 2014) and in all QET tasks in Table 9, including the QET 2015 German-English METEOR task (Bojar et al., 2015).

Contributions
Referential translation machines pioneer a clean and intuitive computational model for automatically measuring semantic similarity by measuring the acts of translation involved. We show that identification of numeric expressions in STS English and identification of the language in STS Spanish improve the performance on the test set. RTM test performance on various tasks sorted according to MRAER can identify which tasks and subtasks and the datasets provided are mature enough for further results.