SemEval-2016 Task 1 : Inferring sentence-level semantic similarity from an ensemble of complementary lexical and sentence-level features

We present a description of the system submitted to the Semantic Textual Similarity (STS) shared task at SemEval 2016. The task is to assess the degree to which two sentences carry the same meaning. We have designed two different methods to automatically compute a similarity score between sentences. The first method combines a variety of semantic similarity measures as features in a machine learning model. In our second approach, we employ training data from the Interpretable Similarity subtask to create a combined wordsimilarity measure and assess the importance of both aligned and unaligned words. Finally, we combine the two methods into a single hybrid model. Our best-performing run attains a score of 0.7732 on the 2015 STS evaluation data and 0.7488 on the 2016 STS evaluation data.


Introduction
If you ask a computer if two sentences are identical, it will return an accurate decision in a split-second. Ask it to do the same for a million sentence pairs and it will take a few seconds -far quicker than any human. But similarity has many dimensions. Ask a computer if two sentences mean the same and it will stall, yet the human can answer instantly. Now we have the edge. But what metrics do we use in our personal cognitive similarity-scoring systems? The answer is at the heart of the semantic similarity task. In our solution, we have incorporated several categories of features from a variety of sources. Our approach covers both low-level visual features such as length and edit-distance as well as high-level semantic features such as topic-models and alignment quality measures.
Throughout our approach, we have found it easier to consider similarity at the lexical level. This is unsurprising, as each lexeme may be seen as a semantic unit with the sentence's meaning built directly from a specific combination of lexemes. We have explored several methods for abstracting our lexicallevel features to the sentence level as explained in Section 4. Our methods are built from both intuitive features as well as the results of error-analyses on our trial runs. We have combined multiple feature sources to build a powerful semantic-similarity classifier, discarding non-performing features as necessary.

Related Work
The presented system has been submitted to the Semantic Textual Similarity (STS) shared task at Se-mEval 2016, which has been organised since 2012 (Agirre et al., 2016). Existing approaches to the problem adopt a plethora of similarity measures including string-based, content-based and knowledgebased methods. String-based methods (Bär et al., 2012;Malakasiotis and Androutsopoulos, 2007;Jimenez et al., 2012) exploit surface features (e.g., character n-grams, lemmas) to compute a semantic similarity score between two sentences. Bär et al. (2012) showed that string-based features improve performance when using machine learning. Knowledge-based features (Mihalcea et al., 2006;Gabrilovich and Markovitch, 2007) estimate the semantic similarity of textual units using external knowledge resources (e.g., WordNet). As an example, Liu et al. (2015) used the shortest path that links two words in the WordNet taxonomy as a similarity measure between words. To calculate a similarity score between sentences, the authors used the sum of the similarity scores of the constituent words. Content-based features are based upon the distributional similarity of words and sentences. Distributional semantics methods (Mikolov et al., 2013;Baroni et al., 2014) encode the lexical context of words into a vector representation. A vector representation of a sentence may be estimated as a function of the vectors of the constituent words (Mitchell and Lapata, 2010). The semantic relatedness between sentences is measured using the cosine of the angle of the composed sentence vectors.

Feature Sources
We have collected a variety of resources to help us create semantic similarity features. Some of these (Subsections 3.2 and 3.4) give features at the level of the sentence itself. Other resources (Subsections 3.1, 3.3, 3.5. 3.6 and 3.7) give features at the level of the individual words. We explain how we adapt these features to sentence-level metrics in Section 4.

Distributional semantics
We use a count-based distributional semantics model (Turney and Pantel, 2010) and the Continuous Bag-Of-Words (CBOW) model (Mikolov et al., 2013) to learn word vectors. The training corpus that we used is a combination of all monolingual texts provided by the organiser of the 2014 Machine Translation Shared Task 1 , whose size is about 20 GB. Before training, we tokenised and transferred the corpus into lowercase text. The size of the context window is 5 for both of the models. The numbers of dimensions in the resulting vectors are 150,000 and 300 for the count-based and the CBOW models respectively.

Machine translation
A pair of input sentences can be considered as the input and output of a machine translation system. Therefore, we can apply machine translation (MT) metrics to estimate the semantic relatedness of the input pair. Specifically, we used three popular MT metrics: BLEU (Papineni et al., 2002), Translation Edit Rate (TER) (Snover et al., 2006), and ME-TEOR (Denkowski and Lavie, 2014).

Lexical paraphrase scores
Another promising resource for similarity estimation is the lexical paraphrase database (PPDB) by Ganitkevitch et al. (2013). Each of the word pairs in the PPDB has a set of 31 different features (Ganitkevitch and Callison-Burch, 2014). In this work, we calculate the similarity of a word pair by using the formula that Ganitkevitch and Callison-Burch (2014) recommended to measure paraphrases' quality. However, for word pairs that have been seen very rarely, i.e., their rarity penalty score in PPDB is higher than 0.1, we simply set the similarity score at 0 instead of applying the formula.

Topic modelling
We induce a topic-based vector representation of sentences by applying the Latent Dirichlet Allocation (LDA) method (Blei et al., 2003). We hypothesise that a varying granularity of vectorrepresentations provide complementary information to the machine learning system. Based upon this, we extract 26 different topic-based vector representations by varying the number of topics; starting from a small number of 5 topics which resulted in a coarse-grained topic-based representation to a larger number of 800 topics which produced a more finegrained representation. In our experiments, we used the freely available MALLET toolkit (McCallum, 2002). Additionally, we performed hyper-parameter optimisation for every 10 Gibbs sampling iterations and set the total number of iterations to 2, 000.

WordNet
For WordNet-based similarity between a pair of words we have chosen Jiang-Conrath (Jiang and Conrath, 1997) similarity based on an evaluation by Budanitsky and Hirst (2006). To compute the score, we lemmatise the words using Stanford CoreNLP (Manning et al., 2014), find corresponding synsets in Princeton WordNet (Fellbaum, 1998) and obtain the Jiang-Conrath value using the WS4J library 2 .

Character string
Sometimes semantically related words are very similar as sequences of characters, e.g. cooperate and co-operate or recover and recovery. To handle such cases we compute Levenshtein distance (Levenshtein, 1966) between words. To keep the similarity score x l in the [0, 1] range, we adjust the obtained distance d by computing x l = (l − d)/l, where l denotes the length of the longer word.

Word importance measures
We calculated the combined probability of each word occurring in a given context. We used smoothed unigram and bigram probabilities taken from the Google Web1T data (Brants and Franz, 2006). We multiplied the smoothed unigram probability of a word together with the smoothed bigram probability of both the word itself and the word immediately before it to give the 2-word contextual probability of a word's appearance. Whilst this model could theoretically be extended to longer sequences, we found that memory resources limited our capacity to a sequence of 2 words.
We also investigated psycholinguistic properties as measures of word importance. We used the MRC Psycholinguistic norms (Wilson, 1988) to attain values for the 'familiarity', 'imagery' and 'concreteness' of each word. These metrics can be defined as follows: Familiarity: This indicates how likely a word is to be recognised by a user. Words which occur very often such as cat are likely to be more familiar than less common words such as feline.
Concreteness: This indicates whether a reader perceives a word as referring to a physical entity.
A conceptual phrase such as truth will have a lower value for concreteness than an object which is more easily relatable such as house or man.
Imagery: This metric indicates how easy it is to call-up images associated with a word. This is related to concreteness, but may differ in some cases. For example, some actions (jumping, flying) or common emotions (joy, sadness, fear) may have high imagery, but low concreteness.

Feature Generation
We found that the greatest challenge of the task was to combine different existing word-level relatedness cues into a sentence-level similarity measure. We have submitted three permutations of our system, as allowed by the task. The first system 'Aggregation (Macro)' passes a set of 36 features through a random forest classifier. The second system 'Alignment (Micro)' first aligns the sentences using word-level metrics and then calculates 49 features describing this alignment. Our final system 'Hybrid' is simply the combination of the former two approaches.

Aggregation (Macro)
In this approach feature sources are used separately, each applied as a sentence similarity measure that is further represented as a single feature.
• Compositional semantic vectors. A sentence vector is simply calculated by cumulatively adding its component word vectors. The similarity of a sentence pair is then estimated by the cosine of the angle between the corresponding vectors.
• Average maximum cosine. Given a sentence pair of (s, t) whose number of words are m and n respectively, the similarity score is calculated as the average of the maximum cosine similarity of word pairs as follows: • MT metrics. We apply MT metrics at the sentence level. We used smoothed BLEU-4 and TER implemented within Stanford Phrasal 3 while the METEOR score was provided using techniques from Denkowski and Lavie (2014).
• Paraphrase DB. To compute sentence similarity using PPDB, we find the most similar counterpart for each of the words and return an average of obtained similarity scores, weighted by word length.
• Topic modelling metric. Given that each sentence is represented by a topic-based vector, we can compute a similarity score for a sentence pair using the cosine similarity.
This leads to 36 features, each expected to be positively correlated with sentence similarity.

Alignment (Micro)
In this approach we combine different techniques of assessing relatedness to create a single word similarity measure, use it to align compared sentences, and compute features describing a quality of alignment.

Word-level similarity
We employ a machine learning model to create a measure of semantic similarity between words. The model uses four measures: 1. adjusted Levenshtein distance x l , 2. word-vector cosine similarity x wv , 3. WordNet Jiang-Conrath similarity x wn , 4. paraphrase database score x p .
Each measure returns values between 0 (no similarity) and 1 (identical words). As a training set, we have used all the data from the previous year's interpretable subtask (Agirre et al., 2015), which includes pairs of chunks with assigned similarity between 0 and 5. As this set contain pairs of chunks, not words, we extended these measures for multiword arguments. This has been done by (1) using the whole chunk for Levenshtein distance, (2) finding the best pair for WordNet similarity and (3) using solutions for sentence-level aggregation (described in the previous section) for word vectors and paraphrase database. Negative examples have been created by selecting unaligned pairs and assigning to them a score equal to 0. In that way we obtain 19,673 training cases, from which the following linear regression model has been deduced: The coefficients of this model show that the word vectors and WordNet features have a lower influence on the final score output by the model, whereas the paraphrase database score has a greater influence. Although the final model was trained on all the data from last year's interpretable similarity subtask, we performed a separate evaluation on this data in which we partitioned the data into train and test subsets. The resulting correlation on the test subset was 0.8964, which we consider to be very reasonable.

Finding alignment
Having a universal similarity measure between words, we can compute an alignment of sentences. To do this, we tokenise each sentence and compute similarity value between every pair of words. Then we find an alignment in a greedy way, by pairing free words in order of decreasing similarity of pairs. This process stops when we reach 1 (in a 0-5 scale, see previous section), which usually leaves some of the words unaligned.

Features
The features generated in this approach describe the quality of the alignment of a pair of sentences. The simple measures are: mean similarity between aligned pairs, length of aligned parts as a proportion of sentence length (average from two sentences) and a number of crossings in permutation defined by the alignment, i.e. Kendall's tau (Kendall, 1955). Secondly, we also include sums of lengths of aligned and unaligned words with particular tags, using Stanford CoreNLP (Manning et al., 2014) with slightly simplified Brown tagset (19 tags). This follows our intuition that significance of success or failure of alignment of a particular word is different for different parts of speech (e.g proper nouns vs. determiners). Finally, we measure the importance of matched and unmatched words by their probability (sum of logarithms) and psycholinguistic metricsfamiliarity, concreteness and imagery (sum of values). As a result of this process we get 49 features.

Hybrid
The hybrid approach simply combines outputs from the two feature generation solutions described above. This leads to 85 features, some of which may be redundant.

Learning
Each of the three feature sets described above has been used to create a single regression model. For this purpose we explored different methods available in the R environment (R Core Team, 2013): linear regression, decision trees (Breiman et al., 1984), multivariate adaptive regression splines (Friedman, 1991) and generalised additive models (Hastie and Tibshirani, 1990), but internal evaluation has shown that random forests (Breiman, 2001) perform the best in this task. Since the dimensionality of the task is not very high, we have not performed any feature selection, leaving this to the random forests. They offer a useful measure of variable importance, namely the decrease in residual sum of squares averaged over all trees. Tables 1 and 2 show the ten most useful variables for this purpose from the aggregation and alignment approaches, respectively.

Evaluation
As explained in previous sections, we have used three features sets to create three classification models, called Macro (from aggregation-based features), Micro (from alignment-based features) and Hybrid (including both feature sets). According to the shared task guidelines, the performance has been measured by computing a correlation between predicted and gold standard (assigned by humans) scores in each data set, and then obtaining their average.
We have performed two main experiments.  Firstly, we have used the data that have been available for training and testing in the 2015 competition (Agirre et al., 2015) to select the best approach. Then, we have created three final models on all available 2015 data and used them to label the test data provided by the organisers of the 2016 task. Table 3 shows the results of both experiments.

Discussion
What may seem the most surprising in the results is their diversity -some of the methods achieve correlation well over 0.8 for one data set and below 0.6 for another. In our opinion, that not only shows that there is room for improvement but also reveals a fundamental problem in this task, namely that training and test sets come from different sources. The distribution of features also differs. This situation simulates many real-world scenarios, but also impedes any ML-based approach. In this light, our drop in performance between development (2015) and evaluation (2016) sets seems satisfactorily small. The data sets related to questions and answers have turned out to cause the most difficulties for our system. We have performed a post-hoc error analysis to understand why. We found that these sets expose weaknesses of our reliance on abstracting from word-level similarity to the sentence-level. For example, consider the following two questions: What is the difference between Erebor and Moria? and What is the difference between splicing and superim-  position? As you can see, they have the same structure and a lot of identical words, but the two remaining words create a wholly different meaning. This means that our approach, confirmed by the ranking of features (see table 2), deserves further work. We may be able to boost our performance on the 2016 task by a few simple measures. Firstly, we could include semantic features from a wider variety of sources. Especially for the word-importance metrics. Secondly, we could consider pre-filtering the sentences for a list of stop-words which typically do not contain much semantic importance to a sentence. Finally, we could attempt to weight our word-level measures based on the semantic importance of each word in a sentence. For example, verbs and nouns are probably more important than adjectives and adverbs, which in turn are likely to be more important than conjunctions.