DalGTM at SemEval-2016 Task 1: Importance-Aware Compositional Approach to Short Text Similarity

This paper describes our system submission to the SemEval 2016 English Semantic Textual Similarity (STS) shared task. The proposed system is based on the compositional text similarity model, which aggregates pairwise word similarities for computing the semantic similarity between texts. In addition, our system combines word importance and word similarity to build an importance-similarity matrix. Three different word similarity measures are used in our three submitted runs. The evaluation results show that taking into account context dependent word importance information improves performance. However, the performance of the system varies drastically be-tween different evaluation subsets. The best of our submitted runs achieves rank 60th with weighted mean Pearson correlation to human judgements of 0.6892.


Introduction
Semantic Textual Similarity (STS) measures the degree of equivalence in the underlying semantics of paired natural language texts. It is an extensively researched problem with applications widely used in many research areas including natural language processing, information retrieval, and text mining. The STS task has been held annually since 2012 (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirrea et al., 2015) to encourage research into understanding sentence-level semantics. Systems for this task compute semantic similarity scores for paired text snippets. Performance is evaluated by the Pearson correlation between the system scores and human judgements. This paper describes our system submission to the SemEval 2016 STS shared task (Agirre et al., 2016). The proposed system is based on the compositional text similarity model, which have been broadly researched in the literature by (Mihalcea et al., 2006;Li et al., 2006;Islam and Inkpen, 2006;Ho et al., 2010;Islam et al., 2012;Bär, 2013). The compositional text similarity model makes use of word-level context independent similarity value. In addition, this measure takes into account context dependent word importance information.
It first computes the word importance value w using Eq. 3 and adapted into every entry sim in the importance-similarity matrix using Eq. 4. similarity values as the building blocks to compute sentence-level semantics. Computing textual similarity using this approach proceeds as follows: tokenize the input text, compute pairwise word similarities between all words, and aggregate the resulting scores to a sentence level textual similarity score. State-of-the-art word similarity measures can be used in this model to provide context independent word relatedness. However, words an be more or less important depending on the contexts in which they appear.
We extend traditional compositional models with an importance term for each word. Our three submitted runs use this extended model in combination with three different word similarity measures: Google Trigram Method (Islam et al., 2012), Skipgrams word embedding (Mikolov et al., 2013), and GloVe word embedding (Pennington et al., 2014). The evaluation results show that including matching importance information improves the performance of compositional models on most of the evaluation sets for STS 2015 and 2016. However, the relative performance of our systems varies dramatically when comparing against other systems submitted to the shared task. The best of our submitted runs achieves rank 60th with weighted mean Pearson correlation of 0.6892 with human judgements.
The rest of the paper is organized as follows: Section 2 describes the details of the submitted sys-tems. Section 3 shows the experimental results for our three runs using evaluation data from SemEval 2015 and 2016. Section 4 summarizes our observations and concludes.

System Description
Our proposed approach takes advantage of the compositional model while also taking the importance of words into consideration. It first computes the matching importance, which characterizes the importance of a word in a particular text pair similarity computation. Instead of building the similarity matrix used by the traditional compositional approach ( Fig. 1), we construct an alternative importance weighted similarity matrix. The importance weighted similarity values are then used when computing the overall textual similarity score.

Text Preprocessing
The input texts are tokenized using the Penn Treebank tokenization with additional rules from Google. 1 Punctuation and the 33 most common English words are filtered because these tokens contribute little to the semantic meaning of a text. Then, we lemmatized the remaining words taking into account their POS tags. The preprocessed input text is represented by these lemmas. POS tagging and lemmatization use NLTK toolkit. 2

Word Similarity Computation
Word similarities are the core building blocks in compositional text similarity measures. Our three different runs each explore using a different word similarity algorithm.
Google Trigram Method (Islam et al., 2012) is an unsupervised statistical similarity measure that can be applied to word pairs. This word similarity method characterizes the co-occurrence feature using the frequencies of trigrams starting and ending with a word pair. It is computed using the following formula: where f n (w 1 , w 2 ) indicates the total frequency of n-grams starting with w 1 and ending with w 2 . f (w) stands for the word (i.e. uni-gram) frequency of w and f max is the largest word frequency in the corpus. Then, the following normalization function is applied to Eq. 1 to bound the word similarity values in range [0, 1]: We use the efficient implementation of this method described in Mei et al. (2015).
Skip-grams (Mikolov et al., 2013) is a neural network model for learning word embeddings. Word embeddings are trained using a model that attempts to discriminatively predict word co-occurrences within a fixed context window. The resulting word embedding vectors have been shown to be effective at capturing wordlevel semantic information. We use the pretrained vectors that were learned on a part 2 http://www.nltk.org/ of Google News dataset (about 100 billion words). 3 GloVe (Pennington et al., 2014) is an unsupervised learning algorithm for word embeddings. The method learns word embedding vectors using a model that predicts global word co-occurrence statistics extracted from a corpus. We use the pre-trained vectors built using the Wikipedia 2014 dump and the English Gigaword Fifth Edition. 4 For Skip-gram and GloVe, we use cosine similarity to compute pairwise similarity value.

Matching Importance Computation
We define matching importance as a function that characterizes the importance of a word in a particular textual similarity computation. Given w in one text and w 1 , w 2 , . . . , w n in the other text, the matching importance of w is computed by this expression: where function µ and ρ stands for the mean and standard deviation of a set of values. This expression is used in Islam et al. (2012) for selecting important matchings. The mean of similarities is an indicator of semantic relatedness, whereas the standard deviation indicates distinctiveness. We take the weighted sum of both features as the final importance score.

Matching Importance Adaptation
To incorporate the importance information, we rescale the pairwise word-level similarity scores by the minimum of the context dependent importance scores for the words being compared:

Textual Similarity Computation
Given a preprocessed text pair, we count (δ) and remove the identical words in both texts. Let the remaining words be T 1 = {w 11 , . . . , w 1n } and T 2 = {w 21 , . . . , w 2m }, we construct an importance weighted similarity matrix M n×m . Prior work suggests that only using the most important entries in the matrix may suppress interference during semantic analysis. (Mihalcea et al., 2006;Islam and Inkpen, 2008;Islam et al., 2012) Thus, we set up a threshold t i to filter the less important matchings in the ith row of the matrix: The textual similarity between two sentences is computed as follows in Eq. 6: t sim = (δ + 1≤i≤n µ(S))(n + m + 2δ) 2(n + δ)(m + δ) sim is the importance weighted similarity from Eq. 4. n + δ and m + δ are the lengths of two preprocessed texts. The textual similarity score ranges within [0, 1].

Evaluation
We evaluated our three system submissions using the STS 2015 and 2016 evaluation datasets. The SemEval 2015 and 2016 datasets contain test sentence pairs distributed across nine domains. Each pair was assigned a similarity scores in the range [0, 5] by multiple human annotators. The performance of our three system submissions is shown in Table 2 and 3. Recall that our three systems only differ in the method they use for assessing lexical similarity: Google Trigram Method (GTM), Word2vec (W2V), and GloVe. Systems that make use of matching importance are tagged with +IAC. Otherwise, the system directly uses pairwise similarity values to compute the aggregate similarity score using Eq. 6. Note that systems with the proposed matching importance approach perform consistently better than the original compositional model in most of the domain subsets. This shows that adding an importance feature can effectively improve the performance of the compositional model. However, comparing against the average system performance in each domain, the performance of our submitted systems vary dramatically in their relative performance to systems submitted by other participating teams. For example, our systems perform well on the postediting dataset and dramatically worse, even relative to other systems, on the question-question data. This suggests that the proposed system may have an implicit domain specific bias.

Conclusions and Future Work
In this paper, we present an Importance-Aware Compositional Approach to STS and its evaluation during the SemEval 2016 STS shared task. Experimental results show that the proposed approach performs consistently better than matched compositional similarity models that do not take importance into account. In future work, it would be useful to investigate a more robust weighting scheme for word importance, incorporating syntactic analysis of texts and using external knowledge-bases for word sense disambiguation.