MayoNLP at SemEval-2016 Task 1: Semantic Textual Similarity based on Lexical Semantic Net and Deep Learning Semantic Model

Given two sentences, participating systems assign a semantic similarity score in the range of 0 - 5. We applied two different techniques for the task: one is based on lexical semantic net (corresponding to run 1) and the other is based on deep learning semantic model (corresponding to run 2). We also combined these two runs linearly (cor responding to run 3). Our results indicate that the two techniques perform comparably while the combination outperforms the individual ones on four out of five datasets , namely answer - answer, headlines, plagiarism, and question - question , and on the overall weighted mean of STS 2016 and 2015 datasets .


Introduction
There is a growing need for an effective method to compute semantic similarity between two text snippets. Many natural language processing (NLP) applications can benefit from effective semantic textual similarity (STS) techniques such as paraphrase recognition (Dolan et al., 2004), textual summarization (Aliguliyev, 2009), automatic machine translation evaluation (Kauchak and Barzilay, 2006), tweets search (Sriram et al., 2010), and student answer assessment (Rus and Lintean, 2012;Niraula et al., 2013).
The SemEval STS task series (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015) have provided a vital platform for this task by making available a huge collection of sentence pairs with manual annotation for each sentence pair. The objective of the task is to compute semantic similarity between a pair of sentences in the range [0,5], and where 0 indicates no similarity and 5 indicates complete similarity.
The approaches for computing semantic similarity between text snippets can be broadly classified into three approaches: information retrieval vector space model (e.g. Meadow, 1992;Sahami and Heilman, 2006), text alignment on the basis of semantic equivalence (e.g. Mihalcea et al., 2006) and machine learning models on the basis of lexical, syntactic and semantic features (e.g. Saric et al., 2012).
In this paper, we describe two techniques for the SemEval 2016 English STS task: the first one adopts the text alignment technique on the basis of semantic equivalence by leveraging a semantic net and corpus statistics; and the second technique is a machine learning models where we utilize a deep learning semantic model.
In the following, we present our techniques in detail and provide the evaluation results of our systems on STS 2016 and 2015 datasets. Conclusions and future directions are also provided.

System Description
This section describes the data and our systems in detail.

Data
There are five datasets in the SemEval 2016 English STS task: answer-answer, headlines, plagiarism, postediting and question-question collected form different sources in this year's English STS task. Each dataset consists of sentence pairs with human-assigned similarity score in the range of 0-5.

Preprocessing
In the preprocessing phase, for each sentence, we use regular expressions in Python 1 to find and replace all text contractions with their respective complete text. For example, "wouldn't" is replaced with "would not" All the URLs, email addresses, and parenthetical expressions (in the cases for abbreviation definition) are removed. Then we normalize common abbreviations with their expansions by using synonyms of those abbreviations in a lexical database, WordNet 2 , (for example, "USA" is replaced with "United States of America") and a manually crafted abbreviation list based on STS 2015 data (for example, "govt" is replaced with "government", "1.6lb" and "5m" are replaced with "1.6 pound" and "5 million"). Moreover, we replace all the negations such as 'not present' with 'absent' using WordNet antonyms, and correct misspelled words using an English dictionary provided by a dictionary module pyenchant 3 in Python. Stopwords are removed for datasets headlines, plagiarism and postediting, according to the Stanford stopwords list. For the datasets of answer-answer and question-question, stopwords are not removed since many pairs only contain stopwords, such as sentence pair "Can you do this?" "You can do it, too", etc. Finally, the remaining words in each sentence are lemmatized to their base forms using the lemmatizer provided by natural language toolkit (NLTK) in Python. The entire preprocessing workflow is illustrated in Figure 1.

Run 1
Our first technique (official run 1), illustrated in Figure 2, is inspired by the work of Li et al. (Li et al., 2006) where the textual similarity computation is based on both syntactic and semantic similarities.
The main differences of our technique comparing to theirs include preprocessing and synsets selection. We select the most similar pair of synsets from WordNet instead of just picking the first noun synset for each word.
After the preprocessing phase, the textual similarity is derived from the combination of semantic (semantic vectors) and syntactic information (syntactic vectors) present in a sentence pair as detailed in the following. We first form a joint word vector (JWV) by collecting unique words that occur in the sentence pair. The semantic similarity is computed by comparing the semantic meaning hidden in the sentences and the syntactic similarity is computed according to the word order in the two sentences.
To compute semantic similarity of the pair, each sentence is mapped to a vector: if ith word in JWV is present in the sentence, we assign a value 1 to the ith entry; otherwise, we calculate the semantic similarity score between the ith word and each word in the sentence and choose the highest score to be the ith entry with the aid of the WordNet using the following strategy. For every two words in the WordNet, we use the path length and depth of two words in the WordNet to calculate ith entry: where is the length of the shortest path between words ! and ! , ℎ returns a measure of depth in the WordNet, and , are constants (0.2 and 0.45 respectively in our experiment). and are dependent on the knowledge base and for WordNet we have used the values for alpha and beta found to be best by Li et al., (2003). The reason behind using depth along with the length of the shortest path is that in the WordNet, words at upper layers are more generic comparing to words at lower layers. If ! , ! is greater than a threshold (0.2 in our experiment) then we set the value to be the score, otherwise we set the value to 0.
The similarity score of the ith entry is then normalized by multiplying information content score of the ith word. The information content of word is defined by (Resnik, 1995): where N is the total number of words in the Brown corpus, n is the frequency of the word w in the corpus. According to this definition, the information content score is between 0 and 1. In our system, the information content of each word was computed using the Brown corpus (Francis and Kučera, 1979). The semantic similarity is then defined as !"# , which is the cosine similarity between the two vectors corresponding to the sentence pair.
Similarly, to compute syntactic similarity, each sentence is mapped to a syntactic vector. The dimension of the syntactic vector is identical as the size of the JWV. If the ith word in the JWV occurs at the jth position in the sentence, then the value of the ith entry in the vector is j. If the ith word does not exist in the sentence, the value of the ith entry is the position of the most similar word obtained from WordNet using 1, 2 . Different from the form of semantic similarity measure, syntactic similarity should take word order information into consideration. Therefore, the syntactic similarity is defined by: where ! and ! represent syntactic vector for sentences s1 and s2 respectively. From the definition of syntactic vector, we see that it contains the basic structural information of a sentence and measure the word order similarity between two sentences.
Since semantic similarity represents the lexical similarity while syntactic similarity contains information about the orders between words, both contribute in conveying the similarity of the sentences. Therefore, the final sentence pair similarity is defined by combining semantic similarity and syntactic similarity, i.e. ! , ! = Ω !"# + 1 − Ω !"# Since syntax plays a subordinate role for semantic processing of text we used Ω (Ω = 0.85) in the final similarity.

Run 2
Our second system (run 2) is built upon a Deep Structured Semantic Model (DSSM) proposed by Huang et al., (2013). DSSM is a deep learning based technique that is proposed for semantic understanding of textual data. It maps short textual strings, such as sentences, to feature vectors in a low-dimensional semantic space. Then the vector representations are utilized for document retrieval by comparing the similarity between documents and queries. DSSM is reported to outperform other semantic models applying to document retrieval (Huang et al., 2013). However, the performance has not been evaluated to measure the degree of similarity in the underlying semantics of paired snippets of text.
DSSM uses the typical deep neural network (DNN) architecture to represent a sentence (document) in the semantic vector space. Unlike other DNN where bag-of-word (BoW) term vectors are used as input, DSSM employ a novel word hashing method to reduce the dimensionality of the BoW vectors. Specifically, each word (for example, 'girl') is attached with a word starting mark and an ending mark (i.e., '#girl#') and split into letter trigrams (i.e., #gi, gir, irl, rl#). By doing so, the dimension is reduced to the number of trigrams, which is 3073 in our system. Then, each word representing by a vector of letter trigrams is used as input to the DNN.
The layers in DNN consists of three parts, namely, word hashing layer, hidden layers, and top layer, where the layer functions are: where is the input term vector, the output vector, ! , = 1, … , − 1 the hidden layers, ! the ith weight matrix, ! the ith bias, and (•) the tanh activation function. The feature vectors generated by the word-hashing layer are projected through the hidden layers, and formed as semantic feature vectors in the top layer.
After obtaining the semantic feature vectors for each paired snippets of text, cosine similarity is utilized to measure the semantic similarity between the pair. Since the elements of semantic vector are nonnegative and the range of cosine similarity is [0, 1], we used the simplest linear transformation to map the range into [0,5].
Regarding the implementation of DSSM, we exploit the Sent2Vec tool. Sent2Vec provides DSSM and convolutional DSSM (C-DSSM) . Based on our experiments on the SemEval 2015 English STS datasets, the performance of DSSM is slightly superior to C-DSSM without statistical significance. Therefore, DSSM is utilized in the official run 2. In the DSSM, we used one hidden layer with 1000 hidden nodes in the neural networks. A larger number of hidden layers or hidden nodes may be applied though we used the simplest one due to the computational complexity.

Evaluation
We submitted 3 runs at SemEval 2016 English STS task based on the Section 2.

Dataset
Run 1 Table 1 shows the performance of each official run. This is reported as the Pearson correlation between the scores produced by our systems and the human annotations. The last row shows the value of weighted mean that is the sum of all datasets correlation scores. The weight assigned to a dataset is proportional to its number of pairs. Our run 1 performed better than run 2 on the following datasets: answer-answer and postediting while run 2 performed better than run 1 on the following datasets: headlines, plagiarism and question-question. Our run 3 that is a linear combination of run 1 and run 2 is the best performing system in terms of weighted mean.

Performance on STS 2015 Data
We also applied our systems on the SemEval 2015 English STS task data. Table 2 shows the evaluation results on STS 2015 data. We can observe that mean run 3 outperforms run 1 and run 2 in terms of overall weighted mean. However, for answers-students dataset the performance of run 1 is better than other runs, and for datasets belief and images run 2 perform better than other two runs. The reason might be that run 1 considers word order information, which is crucial information when describing math or chemistry problems in dataset answers-students.

Conclusions and Future Work
In this paper, we presented three systems for measuring the semantic similarity between two sentences. The systems will be released in the future at GitHub. 1 Our systems give promising results on both English STS datasets 2015 and 2016. Our first system relies on a lexical database and corpus statistics. A lexical database represents common human knowledge about a natural language while a corpus reflects the actual usage of language and words. Our second system utilizes deep learning semantic model while our third system is an ensemble of system 1 and 2. In our future work, we would like to investigate the different contributions of syntactic similarity and semantic similarity in run 1 and improve the performance of run 1 by adding new corpora to increase the impact of corpus statistics. For the second system, we would like to test whether adding additional hidden layers into the deep neural net improves the performance. Currently, our third system is a linear combination of system 1 and 2 by using heuristic weights since no training data were given in the task and we concerned about over-fitting if the data from previous years were used to train our system. In the future we would like to investigate whether the performance could be improved by applying supervised or unsupervised machine learning algorithms to calculate the optimal weights based on the previous data sets. Moreover, we would like to extend our systems into the biomedical domain with the use of biomedical lexical databases and biomedical corpus.