DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition

We describe a set of top-performing systems at the SemEval 2015 English Semantic Textual Similarity (STS) task. Given two English sentences, each system outputs the degree of their semantic similarity. Our unsupervised system, which is based on word alignments across the two input sentences, ranked 5th among 73 submitted system runs with a mean correlation of 79.19% with human annotations. We also submitted two runs of a supervised system which uses word alignments and similarities between compositional sentence vectors as its features. Our best supervised run ranked 1st with a mean correlation of 80.15%.


Introduction
Identification of short text similarity is an important research problem with application in a multitude of areas: natural language processing (machine translation, text summarization), information retrieval (question answering), education (short answer scoring), and so on. The SemEval Semantic Textual Similarity (STS) task series (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015) has become a central platform for the task: a publicly available corpus of more than 14,000 sentence pairs have been developed over the past four years with human annotations of similarity for each pair; and a total of 290 system runs have been evaluated.
In this article, we describe a set of systems that were submitted at the SemEval 2015 English STS task (Agirre et al., 2015). Given two English sentences, the objective is to compute their semantic similarity in the range [0,5], where the score increases with similarity (i.e., 0 indicates no similarity and 5 indicates identicality). The official evaluation metric was the Pearson correlation coefficient with human annotations. The best of our three system runs achieved the highest mean correlation (80.15%) with human annotations among all submitted systems on five test sets (containing a total of 3000 test pairs).
Early work on sentence similarity (Corley and Mihalcea, 2005;Mihalcea et al., 2006;Li et al., 2006;Islam and Inkpen, 2008) established the basic procedural framework under which most modern algorithms operate: computing sentence similarity as a mean of word similarities across the two input sentences. With no human annotated STS data set available, these algorithms were unsupervised and were evaluated extrinsically on tasks like paraphrase detection and textual entailment recognition. The SemEval STS task series has made an important contribution through the large annotated data set, enabling intrinsic evaluation of STS systems and making supervised STS systems a reality.
At SemEval 2012, domain-specific training data was provided for most of the test pairs (Agirre et al., 2012) and consequently, supervised systems were the most successful (Bär et al., 2012;Šarić et al., 2012). These systems combined different similarity measures, e.g., lexico-semantic, syntactic and string similarity, using regression models. However, at the 2013 and 2014 STS events, no such training data was provided; instead, the systems were allowed to use all past data to train their systems. Interestingly, the best systems at these two events were unsupervised (Han et al., 2013;Sultan et al., 2014b); some super-Robin Warren was awarded a Nobel Prize .
(We show only part of the second sentence.) Besides exact word/lemma matches, it identifies and aligns semantically similar word pairs using PPDB (awarded -received in this example).
vised systems did well, too (Wu et al., 2013;Lynum et al., 2014). The core component of a typical unsupervised system is term alignment: semantically related terms across the two sentences are aligned at first and then their semantic similarity is computed as a monotonically increasing function of the degree of alignment. At SemEval 2015, we submitted an unsupervised system based on word alignments which is almost identical to our winning system at SemEval 2014 (Sultan et al., 2014b). We also submitted a supervised ridge regression model that uses (1) the output of our unsupervised system, and (2) the cosine similarity between the vector representations of the two sentences (derived from neural word embeddings of their content words (Baroni et al., 2014)) as its features. Our unsupervised system ranked 5th and the two supervised runs ranked 1st and 3rd. Evaluation also shows that our best run outperforms the winning systems at all past SemEval STS events.

System Overview
We describe our three system runs in this section in order of their complexity -new capabilities and/or features are added with each run.

Run 1: U
This is an unsupervised system that first aligns related words across the two input sentences and then outputs the proportion of aligned content words as their semantic similarity. It is similar to our last year's system (Sultan et al., 2014b) based on the word aligner described in (Sultan et al., 2014a). However, where last year's system computed a separate proportion for each sentence and then took their harmonic mean, this year's system computes a single proportion over all words in the two sentences. In other words, given sentences S (1) and S (2) , where n c (S (i) ) and n a c (S (i) ) are the number of content words and the number of aligned content words in S (i) , respectively. This is a conceptually simpler step and yielded better experimental results on data from past STS events.
The aligner aligns words based on their semantic similarity and the similarity between their local semantic contexts in the two sentences. It uses the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) to identify semantically similar words, and relies on dependencies and surface-form neighbors of the two words to determine their contextual similarity. Word pairs are aligned in decreasing order of a weighted sum of their semantic and contextual similarity. We also consider a levenshtein distance 1 of 1 between a misspelled word and a correctly spelled word (of length > 2) to be a match. In all runs, we truncate at the extremes to keep the score in [0, 5].

Run 2: S 1
A fundamental limitation of our unsupervised system is that it only relies on PPDB to identify semantically similar words; consequently, similar word pairs are limited to only lexical paraphrases. Hence it fails to utilize semantic similarity or relatedness between non-paraphrase word pairs (e.g., 'sister' and 'related'). In this run, we leverage neural word embeddings to overcome this limitation. We use the 400-dimensional vectors 2 developed by Baroni et al. (2014). They used the word2vec toolkit 3 to extract these vectors from a corpus of about 2.8 billion tokens. These vectors performed exceedingly well across different word similarity data sets in their experiments. Details on their approach and findings can be found in (Baroni et al., 2014).
Instead of comparing word vectors across the two input sentences, we adopt a simple vector composition scheme to construct a vector representation of each input sentence and then take the cosine similarity between the two sentence vectors as our second feature for this run. The vector representing a sentence is the centroid (i.e., the componentwise average) of its content lemma vectors.
Finally, we combine the two features -output of our unsupervised run (U ) and the sentence vectors' cosine similarity -using a ridge regression model (implemented in scikit-learn (Pedregosa et al., 2011), with α = 1.0 and the 'auto' solver that automatically selects a feature weight learning algorithm from a pool depending on the type of the data). The model is trained using annotations from SemEval 2012-2014 (details in Section 3).

Run 3: S 2
The aligner used in our previous two runs aligns content words even if there are no similarities between their contexts in the two sentences. In this run, we use an alignment-based feature (in addition to our two features in S 1 ) where content words are aligned only if they have some contextual similarity -a common word either in their dependencies or in a neighborhood of 3 words to the left and 3 words to the right (considering only content words for the latter).

Data
The 3000 test sentence pairs at SemEval 2015 were divided into five sets, each consisting of pairs from a different domain. Each pair was assigned similarity scores in the range [0, 5] by multiple human annotators (0: no similarity, 5: identicality) and the average  of the annotations was taken as their final similarity score. We describe each data set briefly in Table 1. We trained our supervised systems using data from the past three years of SemEval STS (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014). For answers-forums, answers-students and belief, we used all past annotations. For headlines, we used all headlines (2013), headlines (2014), deft-news (2014) and smtnews (2012) pairs. For images, we used all msrpar (2012; train and test), msrvid (2012; train and test) and images (2014) pairs. The specific training corpus selections for the two latter data sets were based on our experiments with past headlines and images data, where these subsets yielded better results than an all-inclusive training set (seemingly due to the fact that they were drawn from similar domains and were still large-enough to provide the model with effective supervision).

Evaluation
In addition to the official evaluation at SemEval 2015, we report evaluation results on past STS (2012STS ( -2014 test data. For all these evaluations, the performance metric is the Pearson correlation coefficient between system output and average human annotations. Correlation is computed for each individual test set, and a weighted sum of all correlations (i.e. over all test sets) is used as the final evaluation metric. The weight of a test set is proportional to the number of sentence pairs it contains.
Before presenting the results, we describe a preprocessing step for one of the 2015 test sets. Identifying the right stop words (some of which can be domain-specific) proved key in our past investigation of STS (Sultan et al., 2014b); therefore we consider it very important to manually examine individual domains to ensure proper categorization of words. An inspection of the trial data for the answersstudents set indicated that the expressions in the  Table 2: Performance on STS 2015 data. Each number in rows 1-5 is the correlation between system output and human annotations for the corresponding data set. The rightmost column shows the best score by any system. The last two rows show the value of the final evaluation metric and the system rank, respectively, for each run.
following pairs are semantically equivalent for the given domain: {'battery terminal', 'terminal'} and {'electrical state', 'state'}. Therefore, we treated the two words 'battery' and 'electrical' as special stop words during occurrences of these pairs across the input sentences.

STS 2015 Results
Performances of our three runs on each of the STS 2015 test sets are shown in Table 2. Each bold number represents the best score by any system on the corresponding test set and each italic number shows the best score among our runs. The weighted mean of correlations and rank for each run is also shown. Our best run (S 1 ) did not perform the best on all test sets (in fact it does so on only one test set), but it maintained the best balance across all test sets. The second best overall system run (ExBThemisthemisexp) had a mean correlation of 79.42%. We found the difference of 0.73% between this system and S 1 to be statistically significant at p < 0.0001 in a two-sample one-tailed z-test 4 (unlike last year's 0.05% (Agirre et al., 2014)).
The third feature in S 2 did not prove useful as S 2 performed worse than S 1 on almost all test sets. This result falls in line with our observation reported in (Sultan et al., 2014a): "more often than not content words are inherently sufficiently meaningful to be aligned even in the absence of contextual evidence when there are no competing pairs." Year S1 Winning  Table 3: Performance of our top system (S 1 ) on past STS test sets (mean correlation with human annotations). The score of the winning system at each event is shown on column 3. S 1 outperforms all past winning systems.
Contrary to our findings from past years' data, the special stop words for the answers-students test set (discussed in the previous section) did not improve performance -considering these words as content words, we observed a slightly higher correlation of 0.7895 for our unsupervised system U . Table 3 shows the performance of our best system S 1 on test data from SemEval 2012-2014. To ensure fair comparison with other systems, for years 2013 and 2014, we used only past data to train our model. For year 2012, we used the designated training data for test sets msrpar, msrvid and smteuroparl, and all 2012 training pairs for the other two test sets.

Results on Past Test Sets
S 1 outperformed all winning systems from 2012 through 2014. Without any domain-specific training data, the top systems at SemEval 2013 and 2014 were unsupervised. S 1 achieved the best performance on both despite its supervised nature.

Ablation Study
We performed a feature ablation study for S 1 on STS 2015 data to determine the relative importances of its two features. Table 4 shows the results. Columns 2 and 4 show performances of our U and S 1 systems. (Remember that the former is used as a feature by the latter.) Column 3 shows the performance of the second feature of S 1 (i.e. cosine similarity between the sentence vectors) as a measure of STS.
On four of the five test sets, U outperformed sentence vector similarity. However, combining the two features improved system performance on four out of five test sets, and overall. These results indicate that each feature captures aspects of STS that the other does not and consequently the two complement each other when used together.

Conclusions and Future Work
At SemEval 2014, we reported a top-performing unsupervised STS system (Sultan et al., 2014b) that relied only on word alignment. This year, we present a supervised system that is statistically significantly better than our last year's system. Combining a vector similarity feature derived from word embeddings with alignment-based similarity, it outperforms all past and current STS systems. Since it makes use of only off-the-shelf software 5 and data, it is easily replicable as well.
The primary limitation of our system is the inability to model semantics of units larger than words (phrasal verbs, idioms, and so on). This is an important future direction not only for our system but also for STS and text comparison tasks in general. Incorporation of stop word semantics is key to identifying similarities and differences in subtle aspects of sentential semantics like polarity and modality. Finally, rather than studying STS as a standalone problem, the time has come to develop algorithms that can adapt to requirements posed by different data domains and applications.