BIT at SemEval-2016 Task 1: Sentence Similarity Based on Alignments and Vector with the Weight of Information Content

This paper describes three unsupervised systems for determining the semantic similarity between two short texts or sentences submitted to the SemEval 2016 Task 1, all of which make use of only off-the-shelf software and data making them easy to replicate. Two systems achieved a similar Pearson correlation coefﬁcient (0.64661 by simple vector, 0.65319 by word alignments). We include experiments on using our alignment based system on evaluation data from the 2014 and 2015 STS shared task. The results suggest that beyond the core similarity algorithm, other factors such as data preprocessing and use of domain-speciﬁc knowledge are also important to similarity prediction performance.


Introduction
Given two short texts or sentences, similarity systems or models should output a score that reflects how similar the two texts are in meaning. Semantic textual similarity (STS) formalizes an operation that is an important component of many natural language processing systems and has generated substantial interest within the research community (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015). STS methods can be applied in example-based machine translation, machine translation evaluation, information retrieval, text summarization, question answering, and recommendation systems. *

Corresponding author 2 System Overview
In STS 2016, we submitted three system runs, and all of which were unsupervised. They could be generally divided into two kinds: vector based and alignment based.

Run 1: Simple Vector Method
In this run, we use a sentence vector derived from word embeddings obtained from word2vec (Mikolov et al., 2013). Using these sentence level vector representations, the similarity between two texts can be computed using the cosine operation.
We train word embeddings by running the word2vec toolkit1 over the fifth edition of the Gigaword corpus (LDC2011T07). We preprocess the Gigaword data with the following tools from the Moses machine translation toolkit (Koehn et al., 2007): the data is tokenized using tokenizer.perl; truecase.perl4 is used to standardize capitalizing.
As illustrated in Equation (1), we construct the sentence vector s by simply summing together the word embeddings, t i , associated with each token in a sentence.
Here |s| is the number of tokens that the sentence contains.
The similarity between a pair of sentences is computed as the cosine of their associated sentence level embedding vectors.

Run 2: Weighted Vector Method
The above method weights all word embeddings equally. We submitted an alternative run that weights the word embeddings by the information content (IC) of the concepts referenced by their word sense tagged tokens (Resnik, 1995). Word sense disambiguation is performed using BabelNet (Navigli and Ponzetto, 2012) with the WordNet (Miller, 1995) sense inventory. NLTK (Bird, 2006) is used to obtain the frequencies of words belongs to the WordNet synset. The probability associated with each concept is estimated over the BNC 1 using add one smoothing. Following Resnik (1995), we then compute the information content of each concept as follows: Here P (c) refers to the statistical frequency of concept c.
This method allows us to compute IC based weights only for the nouns and verbs covered by WordNet. We heuristically set the weight of adjectives and adverbs to 5 and other words to 2.

Run 3: Word Alignment Method
Our final run differs from the vector based methods described above and follows a popular alternative approach to assessing sentence similarity through word alignments. We make use of Sultan et al.  (3), similarity is computed as Here n a c S (i) and n c S (i) are the number of content words and the number of aligned content words in sentence S (i) , respectively.

Data
As shown in Table 1, the 2016 STS shared task included 5 distinct datasets. Systems were required to annotated between 1,498 and 3,287 pairs per dataset. System performance was evaluated on a subset of each dataset consisting of between 209 to 255 gold standard (GS) pairs. The GS similarity scores for each pair range from 0 to 5, with the values having the corresponding interpretations: 5 indicates completely equivalence; 4 expresses mostly equivalent with differences only in some unimportant details; 3 means roughly equivalent but with differences in some important details; 2 means non-equivalence but sharing some details; 1 means the pairs only share the same topic; and 0 represents no overlap in similarity.
We note that there is a big gap between 0 and 1 in GS metric: Intuitively, within the range [1,5], scores linearly represent the similarity between two texts. However, there is a much larger conceptual range of topical similarity that spans from pairs on the exact same topic to those that are completely dissimilar.

Evaluation
The evaluation metric is the Pearson correlation coefficient (PCC) (Brownlee, 1965) between system output and the gold standard. PCC is used for each individual test set, and the final evaluation is measured by weighted mean of PCC on all datasets (Agirre et al., 2012).

STS 2016 Results
Performances of our three systems on each of STS 2016 test sets are showed in Table 2, and the last two columns show the results of the following modified versions of Run 2 and Run 3.
Run 2': Word embedding vectors are normalized to have length=1, and the heuristic IC weights are  adjusted as follows: 6 for adjectives and adverbs and 3 for other words. Run 3': If there is no content word aligned, we make use of longest common substring algorithm to obtain the longest common consecutive words (LCCW) of the compared sentences. Similarity is computed as (4) Here LCCW S (1) , S (2) is the number of words that are present in the LCCW of S (1) and S (2) .
Words are classified as content words if they are either nouns, verbs, adjectives or adverbs with a small number of exceptions. We elected to classify think, know, want and act as non-content words based on their IDF scores.
From Table 2, we make the following observations:  word level embedding vectors needs to account for differences in the magnitude of the raw embeddings.
3. The best performance of all of our systems is achieved by Run 3', which included additional logic to handle pairs with no aligned content words. However, both Run 3 and Run 3 performed particularly badly on the questionquestion dataset. Inspecting the data reveals that some sentence pairs have a GS score of 0 even when there is some level of similarity between what is being asked, such as "What's the best way to store asparagus?" vs "What's the best way to store unused sushi rice?". We also observe that many pairs in this dataset set have similarly structured sentences with particular core words playing a decisive role.

Results on Past Test Sets
In order to better frame the performance of our systems, we examined the performance of Run 3 and Run 3', our word alignment base systems, on the STS shared task evaluation sets from 2014 to 2015. Recall that our method is unsupervised and most comparable to Sultan et al. (2015) 2015)'s unsupervised system except for differences in text preprocessing. We observe that our performance may have been diminished by not performing the following preparation steps: 1. Our systems didn't use a spelling correction module, such as a levenshtein distance of 1 between a misspelt word and a correctly spelt word before running the aligner or finding word vectors.
2. Knowledge of domain-specific stop words wasnot taken into account in submitted systems.
We suspect these contributed to the performance gap between our system and even the very similar Sultan et al. (2014b) submission.

Conclusions and Future Work
At SemEval 2016, we submittted three unsupervised STS systems: simple vector method, weighted vector method and word alignment method. Two make use of sentence level embedding vectors and the other applies a known well performing method for calculating STS similarity scores that is based on monolingual word alignments. We observe that both types of systems are able to achieve a similar PCC. Based on observations obtained by running our system on evaluation sets from earlier years, we believe our system could have been improved by including more of the text preprocessing steps performed in prior work.
First, our systems should introduce a spelling correction module to deal with misspelt words, which is a good way to increase the recall of the input. Second, domain-specific knowledge should be taken into account, such as domain-specific stop words, which can adapt to requirements posed by different data domains and applications. In future work, we hope to investigate the use of domain-specific weights for words as well as other methods for term weighting such as TF-IDF.