USAAR-SHEFFIELD: Semantic Textual Similarity with Deep Regression and Machine Translation Evaluation Metrics

This paper describes the USAAR-SHEFFIELD systems that participated in the Semantic Textual Similarity (STS) English task of SemEval-2015. We extend the work on using machine translation evaluation metrics in the STS task. Different from previous approaches, we regard the metrics’ robustness across different text types and conﬂate the training data across different subcorpora. In addition, we introduce a novel deep regressor architecture and evaluated its efﬁciency in the STS task.


Introduction
Semantic Textual Similarity (STS) is the task of measuring the degree to which two text snippets have the same meaning (Agirre et al., 2014). For instance, given the two texts, "a dog sprints across the water" and "a dog jumps through water", participating systems are required to predict a real number similarity score on a scale of 0 (no relation) to 5 (semantic equivalence). This paper presents a collaborative submission between Saarland University and University of Sheffield to the STS English shared task at SemEval-2015. We have submitted three models that use Machine Translation (MT) evaluation metrics as features to build supervised regressors that predict the similarity scores for the STS task. We introduce two variants of a novel deep regressor architecture and a classical baseline regression system that uses MT evaluation metrics as input features.

Related Work
Previously, research teams have applied MT evaluation metrics for the STS task with increasingly better results (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014). Rios et al. (2012) trained a Support Vector Regressor scoring a Pearson correlation mean of 0.3825 (Baseline 1 : 0.4356). Barrón-Cedeño et al. (2013) also used a Support Vector Regressor and did better than the baseline at 0.4037 mean score (Baseline: 0.3639). Huang and Chang (2014) used a linear regressor and scored 0.792 beating the baseline system (Baseline: 0.613).
Another notable mention of MT technology in the STS task is the use of referential translation machines to predict and derive features instead of using MT evaluation metrics (Biçici and van Genabith, 2013;Biçici and Way, 2014).
These previous approaches have trained a different system for each subcorpus provided by the task organizers. We have chosen to combine the different subcorpora since MT evaluation metrics are expected to be robust against text types and domains (Han et al., 2012;Padó et al., 2009).
Much of the previous work on using MT evaluation metrics is based on improving the regressors through algorithm choice, feature selection and parameters tuning. We introduce a novel architecture of hybrid supervised machine learning, Deep Regression, which attempts to combine different regressors and automating feature selection by means of dimensionality reduction.

Deep Regression Architecture
Ensemble learning constructs a set of models based on different algorithms and then labels new data points by taking a (weighted) vote from the algorithms' predictions (Dietterich, 2000). A typical single layer feed-forward neural network creates a layer of perceptrons that receives inputs and predicts a series of outputs converted by means of an activation function and then the outputs will enter a final layer of a single classifier to provide a final prediction (Auer et al., 2008). We propose a deep regression architecture that is a unique way to combine a single-layer feed-forward neural net architecture with ensemble-like supervised learning.  Figure 1 presents the Deep Regression architecture where the inputs are fed into the different hidden regressors and unlike traditional neural network, each regressor produces a discrete output with a different cost function unlike the consistent activation function in neural nets. Different from ensemble learning, the voting/selection determinant has been replaced by a last layer of a single regressor that takes latent layer as input to produce the final output STS score.
By designing the architecture in this way, the feature space from the input is reduced to the number of hidden regressors and the input for the last layer regressors is a latent layer in the higher dimensional space. Within a standard neural net, every node in the latent layer is influenced by all the perceptrons in the previous layer. In contrast, each latent dimen-sion is only dependent on one regressor; in this respect it resembles ensemble learning where the regressors/classifiers are trained independently.

Feature Matrix
Machine Translation evaluation metrics consider varying degrees of information at the lexical, syntactic and semantic levels. Each metric comprises several features that compute the translation quality by comparing every translation against one or several reference translations. We consider three sets of features: n-gram overlaps, Shallow Parsing metrics and METEOR. These metrics correspond to the lexical, syntactic and semantic levels respectively.

N -gram Overlaps
Gonzàlez et al. (2014) reintroduces the notion of language independent metrics relying on n-gram overlaps. This is similar to the BLEU metric that calculates the geometric mean of n-gram precision by comparing the translation against its reference(s) (Papineni et al., 2002) without the brevity penalty.
Different from BLEU, the n-gram overlaps are computed as similarity coefficients instead of taking the crude proportion of overlap n-gram.
n -gram overlap = sim n -gram trans ∩ n -gram ref We use 16 features of n-gram overlap by considering both the cosine similarity and Jaccard Index in calculating the n-gram overlaps for character and token n-gram from the order of bigrams to 5-grams. In addition, we use the ratio of n-gram lengths and the Jaccard similarity of pseudo-cognates (Simard et al., 1992) as the 17th and 18th n-gram overlap features.

Shallow Parsing
The Shallow Parsing (SP) metric measures the syntactic similarities by computing the overlaps between the translation and the reference translation at the Parts-Of-Speech (POS), word lemmas and base phrase chunks level. The purpose of the SP metric is to capture the proportion of lexical items correctly translated according to their shallow syntactic realization.
The base phrase chunks are tagged using the BIOS toolkit (Surdeanu et al., 2005) and POS tag-ging and lemmatization are achieved using SVM-Tool (Giménez and Màrquez, 2004). For instance, given a pair of sentences in the format (word/POS/lemma/chunk): We consider the overlap proportions for the POS features, lemma, IOB features, shallow chunks. The Inside, Outside, Begin (IOB) features refer to the shallow parsing tags at the lexical level, e.g. B-NP represents the beginning of a noun phrase (Sang et al., 2000). The IOB features are measured lexically by considering each IOB tag while the shallow chunk features only consider the number of bracketed chunks.
For instance, the POS tag DT occurs twice in first sentence one and once in second sentence, thus we extract the feature SP-POS(DT) = 1/2 = 0.5. For SP-POS, SP-LEMMA and SP-IOB, we use the NIST-like measure where we not only consider the individual POS, LEMMA or IOB tags but an accumulated score over a sequence of 1-5 ngrams, e.g. SP-POS(DT+NN,DT+NN+VBZ, ...) or SP-LEMMA(a+dog,a+dog+jump, ...).

METEOR
METEOR aligns the translation to a reference translation first then it uses unigram mapping to match words at their surface forms, word stems, synonym matches and paraphrase matches (Banerjee and Lavie, 2005;Denkowski and Lavie, 2010).
Different from the n-gram and shallow parsing features, METEOR makes a distinction between content words and function words and the precision and recall is measured by weighing them differently.
It also accounts for word order differences by penalizing chunks from the translation that do not appear in the translation.
We use the METEOR 1.5 system with tuned weights and penalty using the WMT12 data. For the STS experiment, we use all four variants of METEOR: exact matches, stem matches, synonym matches and paraphrase matches.

Training Data
We conflated all training and test data of various text types from previous SemEval STS shared tasks into a single training set with 10597 paragraph/sentence/caption pairs. The MT metrics for each text pair were computed with the Asiya toolkit (Giménez and Màrquez, 2010). Tokenization and preprocessing operations, such as lemmatization, POS tagging, parsing and n-gram extraction, are performed by the Asiya toolkit.

Models
We submitted three models to the SemEval-2015 STS English Task: • ModelX: Deep Regression framework with the full feature set from n-gram overlaps, Shallow Parsing and METEOR. Our baseline model achieved modest results ranking 24 out of 73 submissions, however our deep regressors have failed to function on par with a simple baseline regressor. We note that the deep regressor with the full feature set (ModelX) scored lower than the deep regressor with only the METEOR features (ModelZ). This reiterates the effectiveness of semantically motivated METEOR features in determining similarity as previously indicated by Huang and Chang (2014). Interestingly, the conflation of datasets has no obvious detrimental effects on the performance for any specific domains. Figure 2 presents a comparison of results between ModelY, the top system from DLSU and the organizers' baseline system (TokenCos). It shows that the distribution of Spearman's correlation for our model is as wellbalanced as the best system.

Conclusion
In this paper, we have described our submissions to the STS English task for SemEval-2015. We have introduced a novel deep regression infrastructure with MT evaluation metrics to measure semantic similarity. Although our deep regressors performed poorly, our baseline system have achieved promising results amongst the participating systems and we showed that conflating datasets of different genres has negligible effects on a semantic similarity system based on MT evaluation metrics.
The results also confirm the good performance of METEOR, a traditional MT evaluation metric, for the STS task.