DTSim at SemEval-2016 Task 1: Semantic Similarity Model Including Multi-Level Alignment and Vector-Based Compositional Semantics

In this paper we describe our system (DT-Sim) submitted at SemEval-2016 Task 1: Semantic Textual Similarity (STS Core). We developed Support Vector Regression model with various features including the similarity scores calculated using alignment based meth-ods and semantic composition based methods. The correlations between our system output and the human ratings were above 0.8 in three datasets.


Introduction
The task of measuring the Semantic Textual Similarity (STS) is to quantify the degree of semantic similarity between the given pair of texts. For example, the similarity score of 0 means that the texts are not similar at all and 5 means that they have same meaning (Agirre et al., 2015;Banjade et al., 2015). In this paper, we describe our system DTSim and the submitted three different runs in this year's SemEval shared task on Semantic Textual Similarity English track (STS Core; Agirre et al. (2016)). We applied Support Vector Regression (SVR) with various features in order to predict the similarity score for the given sentence pairs. The features of the model included semantic similarity scores calculated using individual methods (described in Section 3) and other general features. The pipeline of components in DTSim is shown in Figure 1.

Preprocessing
Hyphens were replaced with whitespaces if they were not composite verbs (e.g. video-gamed). The composite verbs were detected based on the POS tag assigned by the POS tagger. Also, the words starting with co-, pre-, meta-, multi-, re-, pro-, al-, anti-, ex-, and non-were left intact. Then, the hyphen-removed texts were tokenized, lemmatized, POS-tagged and annotated with Named Entity tags using Stanford CoreNLP Toolkit . We also marked each word as whether it was a stop word. We also created chunks using our own Conditional Random Fields (CRF) based chunking tool (Maharjan et al., 2016) which outperforms OpenNLP chunker when evaluated with human annotated chunks provided in interpretable similarity shared task in 2015. We normalized texts using mapping data. For example, pct and % were changed to percent.

Feature Extraction
We used various features in our regression models including semantic similarity scores generated using individual methods. Before describing those individual methods, we present word similarity methods which were used for sentence similarity calculation.

Word-to-Word Similarity
We used vector based word representation models, PPDB 2.0 database (Pavlick et al., 2015), and Word-Net (Miller, 1995) in order to measure the similarity between words as given below. Where m ∈ {ppdb, LSAwiki, word2vec, GloV e} X1 and X2 are vector representations of words w1 and w2 respectively.
We first checked synonyms and antonyms in WordNet 3.0. If the word pair was neither synonym nor antonym, we calculated the similarity score based on the model selected. The word representation models used are: word2vec (Mikolov et al., 2013) 1 , Glove (Pennington et al., 2014) 2 , and LSA Wiki (Stefanescu et al., 2014a) 3 . The cosine similarity was calculated between the word representation vectors. We also used the similarity score found in PPDB database 4 .
Handling missing words: We checked for the representation of word in raw form as well as in base (lemma) form. If neither of them was found, we used vector representation of one of its synonyms in WordNet for the given POS category. The same strategy was used while using PPDB to retrieve similarity score.

Word Alignment Based Method
In this approach, all the content words (in lemma form) in two sentences (S1 and S2) were aligned optimally (OA) using Hungarian algorithm (Kuhn, 1955) as described in (Rus and Lintean, 2012) and implemented in SEMILAR Toolkit (Rus et al., 2013). The process is same as finding the maximum weight matching in a weighted bipartite graph. The nodes are words and the weights are the similarity scores between the word pairs. The sentence similarity is calculated as: In order to avoid the noisy alignments, we reset the similarity score below 0.5 (empirically set threshold) to 0.

Chunk Alignment Based Method
We chunked texts (see Section 2) and aligned chunks optimally as described in (Ş tefȃnescu et al., 2014b). The difference is that the chunks containing Named Entities were aligned using rules: (a) the chunks were treated as equivalent if both were named entities and at least one of the content words was matching, (b) they were treated as equivalent if one was the acronym of another. In other cases, chunk-tochunk similarity was calculated using optimal word alignment method. The process is same as word alignment based method. First, the words in chunks were aligned to calculate chunk-to-chunk similarity. Finally, chunks in two sentences were aligned optimally for sentence level similarity. In order to avoid noisy alignments, we set similarity score to 0 below 0.5 for word alignment and 0.6 for chunk alignment. These thresholds were set empirically.

Interpretable Feature Based Method
We aligned chunks from one sentence to another and assigned semantic relations and similarity scores for each alignment. The semantic labels were EQUI, OPPO, SIMI, REL, SPE1, SPE2, and NOALI. For example, the semantic relation EQUI was assigned if the given two chunks were equivalent. The similarity score range from 0 (no similarity) to 5 (equivalent). We aligned chunks and assigned labels as described in (Maharjan et al., 2016). Once the chunks were aligned and semantic relation types and similarity scores were assigned, sentence level scores were calculated for each relation type as well as an overall score was calculated using all alignment types as shown next.

Vector Algebra Based Method
In this approach, we combined vector based word representations to obtain sentence level representations through vector algebra as: Where W is the set of content words in sentence S and V w is the vector representation for word w. The cosine similarity was calculated between the resultant vectors -RV(S1) and RV(S2). Word representations from LSA Wiki, word2vec and GloVe models were used.

Similarity Matrix Based Method
The approach is similar to the word alignment based method and similarity scores for all pairs of words from given two sentences are calculated. However, a key difference is that all word-to-word similarities are taken into account, not just the maximally aligned word similarities as described in (Fernando and Stevenson, 2008).

Features
All or subset of the following features was used for three different runs as described in Section 4. We used word2vec representation and WordNet antonym and synonym for word similarity unless anything else is mentioned specifically.
1. Similarity scores generated using word alignment based methods where word-to-word similarity was calculated using methods described in Section 3.1.
2. Similarity score using optimal alignment of chunks where word-to-word similarity scores were calculated using representation from word2vec model.
3. Similarity scores using similarity matrix based methods. The similarities between words were calculated using different word similarity methods discussed in Section 3.1.
4. Similarity scores using chunk alignment types and alignment scores (interpretable features). 5. Similarity scores using the resultant vector based method using word representations from word2vec, GloVe, and LSA Wiki models.
6. Noun-Noun, Adjective-Adjective, Adverb-Adverb, and Verb-Verb similarity scores and similarity score for other types of words using word alignment based method.
7. Multiplication of noun-noun similarity scores and verb-verb similarity scores.
where C i1 and C i2 are the counts of i ∈ {all tokens, adjectives, adverbs, nouns, and verbs} for sentence 1 and 2 respectively.
9. Presence of adjectives and adverbs in first sentence, and in the second sentence.
10. Unigram overlap with synonym check, bigram overlap and BLEU score.
11. Number of EQUI, OPPO, REL, SIMI, and SPE relations in aligning chunks between texts relative to the total number of alignments.
12. Presence of antonym pair among all word pairs between given two sentences.

Building Models
Training Data: For building models, we used data released in previous shared tasks (summarized in Table 1). We selected datasets that included texts from different genres. However, some others, such as Tweet-news and MSRPar were not included. For instance, Tweet-news data were quite different from most other texts.
Models and Runs: Using the combination of features described in Section 3.3, we built three different Support Vector Regression (SVR) models corresponding to three runs (R1-3) submitted. In Run 1 (i.e. R1), all of the features except chunk alignment based features were used. The XL version of PPDB 2.0 was used. In Run 2, we selected the features using Weka's correlation based feature selection tool (Hall and Smith, 1998) which also included chunk alignment based similarity score. In Run 3, we took the representative features from all of the features described in Section 3.3. For example, alignment based similarity scores generated using word2vec model were selected as it performed relatively better in training set compared to GloVe and LSA Wiki models. Also, we used XXXL version of the PPDB 2.0 database (the precision maybe lower but the coverage is higher as compared to the smaller version of the database). We used LibSVM library (Chang and Lin, 2011) in Weka 3.6.8 5 to develop SVR models. We evaluated our models in training data using 10-fold cross validation approach. The correlation scores in training set were 0.791, 0.773 and 0.800 for R1, R2, and R3 respectively. The best results in training set was obtained using RBF kernel. All other parameters were set to Weka's default.

Results
The test data contained 1186 sentence pairs as: Headlines (249), Plagiarism (230), Postediting (244),, and Answer-Answer (254). The further details about the test data can be found in (Agirre et al., 2016). Table 2 shows the correlation (Pearson) of our system outputs with human ratings. The correlation scores of all three runs are 0.8 or above for three datasets -Headlines, Plagiarism, and Postediting. However, the correlations are comparatively lower for Question-question and Answer-answer datasets. One of the reasons is that these two datasets are quite different from the texts we used for the training (we could not include them as such type of datasets were not available during model building  workout plan? have high lexical overlap but they are asking very different things. Analyzing the focus of the questions may be needed in order to distinguish the questions, i.e. the similarity between such pairs may need to be modeled differently. With the release of this type of dataset will foster the development of similarity models where the text pair consists of questions. It should to be noted that we used a single set of training data in all models without tailoring our models to specific test data. Another interesting observation is that the results of three different runs are similar to each other. The most predictive feature was the word alignment based similarity using word2vec model. The correlation in full training set was 0.725. It is not surprising considering that the alignment based systems were top performing systems in the past shared tasks as well (Han et al., 2013;Sultan et al., 2015;Agirre et al., 2015). Selecting smaller set of features that best predict the similarity scores should be considered in the future which will reduce the complexity of the model and potential of overfitting.

Conclusion
This paper presented the DTSim system and three different runs submitted at SemEval 2016 task on STS English track. We developed support vector regression models with various features in order to predict the similarity score for the given pair of texts. The correlation of our system output were up to 0.83. However, the relatively lower scores for two datasets which were of new types (such as questionquestion) indicate that different datasets may need to be treated differently.