DLS@CU at SemEval-2016 Task 1: Supervised Models of Sentence Similarity

We describe a set of systems submitted to the SemEval-2016 English Semantic Textual Similarity (STS) task. Given two English sentences, the task is to compute the degree of their semantic similarity. Each of our systems uses the SemEval 2012–2015 STS datasets to train a ridge regression model that combines different measures of similarity. Our best sys-tem demonstrates 73 . 6% correlation with average human annotations across ﬁve test sets.


Introduction
Identification of short-text semantic similarity is an important research problem with application in a multitude of NLP tasks: question answering (Yao et al., 2013;Severyn and Moschitti, 2013), short answer grading (Mohler et al., 2011;Ramachandran et al., 2015), text summarization (Dasgupta et al., 2013;Wang et al., 2013), evaluation of machine translation (Chan and Ng, 2008;Liu et al., 2011), and so on. The SemEval Semantic Textual Similarity (STS) task series (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015) is a core platform for the task: a publicly available corpus of more than 14,000 sentence pairs have been developed over a span of four years with human annotations of similarity for each pair; and about 300 system runs have been evaluated.
In this article, we describe a set of systems that participated in the SemEval-2016 English Semantic Textual Similarity (STS) task. Given two English sentences, the objective is to compute their semantic similarity in the range [0,5], where the score increases with similarity (i.e., 0 indicates no similarity and 5 indicates identical meanings). The official evaluation metric is the Pearson productmoment correlation coefficient with human annotations. Our systems leverage different measures of sentence similarity and train ridge regression models that learn to combine predictions from these different sources using past SemEval data. The best of our three system runs achieves 73.6% with human annotations among all submitted systems on five test sets (containing a total of 1186 test pairs).
Early work in sentence similarity (Mihalcea et al., 2006;Li et al., 2006;Islam and Inkpen, 2008) established the basic procedural framework under which most modern algorithms operate: computing sentence similarity as a mean of word similarities across the two input sentences. With no human annotated STS dataset available, these algorithms are unsupervised and were evaluated extrinsically on tasks like paraphrase detection and textual entailment recognition. The SemEval STS task series has made an important contribution through the large annotated dataset, enabling intrinsic evaluation of STS systems and making supervised STS systems a reality.
At SemEval 2012-2015, most of the topperforming STS systems used a regression algorithm to combine different measures of similarity (Bär et al., 2012;Šarić et al., 2012;Wu et al., 2013;Lynum et al., 2014;Sultan et al., 2015), with the notable exception of a couple of unsupervised systems that relied primarily on alignment of related words in the two sentences (Han et al., 2013;Sultan et al., 2014b).
Our models are based on the successful linear re-Robin Warren was awarded a Nobel Prize .
gression architecture of past SemEval systems in general, and the winning system of SemEval-2015(Sultan et al., 2015 in particular. We use the features in the latter system unchanged in one of our runs and augment them with simple word and character n-gram overlap features in the other two runs.

System Description
Our system employs a ridge regression model (linear regression with L 2 error and L 2 regularization) to combine a set of similarity measures. The model is trained on SemEval 2012-2015 data. Our three runs differ in the subset of features drawn from the feature pool. We describe the feature set in this section; the individual runs will be discussed in Section 4.

Features
Word Alignment Proportion. This feature operationalizes the hypothesis that highly semantically similar sentences should also have a high degree of conceptual alignment between their semantic units, i.e., words and phrases. To that end, we apply the monolingual word aligner developed by Sultan et al.
(2014a) to input sentence pairs. 1 This aligner aligns words based on their semantic similarity and the similarity between their local semantic contexts in the two sentences. It uses the paraphrase database PPDB (Ganitkevitch et al., 2013) to identify semantically similar words, and relies on dependencies and surface-form neighbors of the two words to determine their contextual similarity. Word pairs are aligned in decreasing order of a weighted sum of their semantic and contextual similarity. Figure 1 shows an example set of alignments.
Additionally, we also consider a Levenshtein distance 2 of 1 between a misspelled word and a correctly spelled word (of length > 2) to be a match.
Given sentences S (1) and S (2) , the alignmentbased similarity measure is computed as follows: where n c (S (i) ) and n a c (S (i) ) are the number of content words and the number of aligned content words in S (i) , respectively.
Sentence Embedding. A fundamental limitation of the above feature is that it only relies on PPDB to identify semantically similar words; consequently, similar word pairs are limited to only lexical paraphrases. Hence it fails to utilize semantic similarity or relatedness between non-paraphrase word pairs (e.g., sister and related). In the current feature, we leverage neural word embeddings to overcome this limitation. We use the 400-dimensional vectors 3 developed by Baroni et al. (2014). They used the word2vec toolkit 4 to extract these vectors from a corpus of about 2.8 billion tokens. These vectors perform well across different word similarity datasets in their experiments. Details on their approach and findings can be found in (Baroni et al., 2014).
Instead of comparing word vectors across the two input sentences, we adopt a simple vector composition scheme to construct a vector representation of each input sentence and then take the cosine similarity between the two sentence vectors as our second feature for this run. The vector representing a sentence is simply the sum of its content lemma vectors. Word n-gram Overlap. This feature computes the proportion of word n-grams (lemmatized) that are in both S (1) and S (2) . We employ separate instances of this feature for n = 1, 2, 3. The goal is to identify high local similarities in the two snippets and learn the influence that such local similarities might have on human judgment of sentence similarity.
Character n-gram Overlap. This feature computes the proportion of character n-grams that are in both S (1) and S (2) in their surface form. We employ separate instances of this feature for n = 3, 4. The goal is to identify and correct for spelling errors as well as incorrect lemmatizations.
Soft Cardinality. Soft Cardinality (Jimenez et al., 2012) is a measure of set cardinality where similar items in a set contribute less to its cardinality than dissimilar items. Jimenez et al. (2012) propose a parameterized measure of semantic similarity based on soft cardinality that computes sentence similarity from word similarity and the latter from character n-gram similarity. This measure was highly successful at SemEval-2012 (Agirre et al., 2012). We employ this measure with untuned parameter values as a feature for our model: p = 1, bias = 0, α = 0.5, bias sim = 0, α sim = 0.5, q 1 = 2, and q 2 = 4. (Please see the original article for a detailed description of these parameters as well as the similarity measure.)

Data
The 1186 test sentence pairs at SemEval-2016 are divided into five sets, each consisting of pairs from a particular domain. Each pair is assigned a similarity score in the range [0, 5] by human annotators (0: no similarity, 5: identicality). Test sets are discussed briefly in Table 1.
We train our supervised systems using data from the past four years of SemEval STS (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015). The selections vary by test set, which we discuss in the next section.

Runs
We submit three runs at SemEval-2016. Each run employs a ridge regression model; we use Scikitlearn (Pedregosa et al., 2011) (2012) pairs. These selections are based on the similarity between the source and the target domainsnews data for headlines, machine translation data for postediting. For the other three test sets, all past annotations (except those for fnwn (2013)) are used, as we did not find any close matches for these test sets in the SemEval 2012-2015 data.

Run 1
Run 1 is a ridge regression model based only on the first two features-alignment and embeddings. The regularization strength parameter α is set using cross-validation on training data.

Run 2
Run 2 employs a model similar to the run 1 model, but uses the entire feature set described in Section 2.1. The same training sets are used for each test set and the model parameter α is again set using cross-validation on training data.

Run 3
Run 3 is identical to run 2, except that it assigns a lower value to the regularization parameter α (100 as opposed to 500 in run 2). Table 2 shows the performances of the three runs (measured by Pearson's r, the official evaluation metric at SemEval STS) alongside the score for the  Runs 1 and 3 have very similar overall performances, slightly better than that of run 2. Among the different test sets, the models perform well on news headlines, plagiarism and machine translation data, but poorly on the Q&A forums data.

Ablation Study
From the overall performances in Table 2, it is clear that the three new features added to the Sultan et al. (2015) model do not improve performance. Therefore, we run a feature ablation study only on the run 1 model. Table 3 shows the results. Similar to the findings reported in (Sultan et al., 2015), the alignment-based feature performs better across test sets. However, the addition of the embedding feature improves performance on almost all test sets.

Relation between the Runs
We compute pairwise correlations between the predictions of our three runs to see how different they are. As Table 4 shows, the predictions are highly correlated, which is expected given the results in Table 2.

Conclusions and Future Work
We present three supervised models of sentence similarity based on the winning system at SemEval-2015(Sultan et al., 2015. Our additional features  .9920 Table 4: Pairwise correlations between the three runs.
do not improve performance and results show similar influences of alignment and embedding features as in SemEval-2015.
Besides high performance, the run 1 model has the key advantage of simplicity and high replicability. All the major design components are also available for free download (links provided in Section 2).
A key limitation of the system is its inability to model semantics of units larger than words (phrasal verbs, idioms, and so on). This is an important future direction not only for our system but also for STS and text comparison tasks in general. Incorporation of stop word semantics is key to identifying similarities and differences in subtle aspects of sentential semantics like polarity and modality. Domain-specific learning of the word vectors can also improve results.