ITNLP-AiKF at SemEval-2017 Task 1: Rich Features Based SVR for Semantic Textual Similarity Computing

Semantic Textual Similarity (STS) devotes to measuring the degree of equivalence in the underlying semantic of the sentence pair. We proposed a new system, ITNLP-AiKF, which applies in the SemEval 2017 Task1 Semantic Textual Similarity track 5 English monolingual pairs. In our system, rich features are involved, including Ontology based, word embedding based, Corpus based, Alignment based and Literal based feature. We leveraged the features to predict sentence pair similarity by a Support Vector Regression (SVR) model. In the result, a Pearson Correlation of 0.8231 is achieved by our system, which is a competitive result in the contest of this track.


Introduction
Semantic Evaluation (SemEval) contest devotes to pushing the research of semantic analysis, which attracts many participants and promote a lot of groundbreaking achievements in natural language processing (NLP) field. Semantic textual similarity (STS) task works for computing word and text semantics, which has made extensive attraction to the researchers and NLP community since SemEval 2012 (Agirre et al., 2012).
In STS 2017, The organizers added monolingual Arabic and Cross-lingual Arabic-English semantic calculation in order to increase the difficulty in the contest. The task definition is given two sentences participating systems are asked to predict a continuous similarity score on a scale from 0 to 5 of the sentence pair, with 0 indicating that the semantics of the sentences completely independent and 5 semantic equivalence. The evaluation criterion uses Pearson Correlation Coefficient, which computing the correlation between golden standard scores and semantic system predicted scores.
In our system, in order to predict similarity score of two sentences, we trained a Support Vector Regression (SVR) model with abundant features including Ontology based features, Word Embedding based features, Corpus based features, Alignment based features and Literal based features. All the English training, trial and evaluation data set released by previous STS tasks in Se-mEval were used to construct our system. Our best system achieved 0.8231 Pearson Correlation coefficient in the SemEval 2017 evaluation data set, and the winner achieved 0.8547.

Feature Engineering
In our system, many features are tried to promote the performance of our system. Five kinds of features are used: Ontology based features, Word Embedding based features, Corpus based features, Alignment based features and Literal based features.The following is a detailed description of the five kinds features.

Ontology Based Features
WordNet (Miller, 1995) is used to exploit Ontology based features.
WordNet is a large lexical database of English.
In WordNet, nouns, verbs, adjectives and adverbs are divided into sets of cognitive synonyms called synsets. Each synonym expresses a distinct concept. WordNet measures two words similarity based on Path similarity, Res similarity, Lin similarity, Wup similarity, Lch similarity and so on. In our system, we directly use WordNet APIs provided by NLTK toolkit (Bird, 2006) to calculate the similarity of two words.
Path similarity measure is based on the shortest path similarity measure. The Path similarity for-mula is defined as Eq 1: (1) where c 1 and c 2 are concepts, deep max is a fixed value of the WordNet and len(c 1 ,c 2 ) is the shortest path of concepts c 1 an c 2 in WordNet.
Lch similarity (Leacock et al., 1998) measure two words similarity by using the depth of concepts in the WordNet hierarchy tree. The Lch similarity formula is as Eq 2: Res similarity (Resniks Measure) calculates similarity based on two concepts common information content in the taxonomy.
The Res similarity formula is defined as Eq 3: where lso(c 1 , c 2 ) is the lowest subsumer of concepts c 1 and c 2 in the taxonomy. The value of Lch similarity and Res similarity is not in [0, 1], so we need to scale features into [0, 1]. Lin similarity (Lin, 1998) considers the similarity depending on the commonality and differences of the information contained in the different meaning concepts. The Lin similarity formula is defined as Eq 4: Wup similarity (Wu and Palmer, 1994) measures similarity based on the path of conception node, shared parent node and root node. The Wup similarity formula is defined as Eq 5: We can use two vectors S 1 and S 2 to represent two sentences. For each word in S 1 or S 2 , search for the most similar word in another sentence by above methods. For S 1 , add all elements together, which are divided by the length of S 1 , and then get the value of V 1 . Do the same calculation for S 2 , and then get the value of V 2 . Computing the harmonic mean (denoted by harmonic mean) of V 1 and V 2 , and the result is the similarity of the two sentences. The harmonic mean is defined as Eq 6:

Word Embedding Based Features
Word Embedding maps words or phrases from defined vocabulary with dense vectors of real values, which have been applied as features in document classification (Sebastiani, 2002), question answering (Tellex et al., 2003), and named entity recognition (Turian et al., 2010) tasks. In our system, we obtained word vectors using two kinds of unsupervised models: Word2Vec (Mikolov et al., 2013) and Global Vectors (GloVe) (Pennington et al., 2014), which can produce high-quality word vectors from millions of corpus data. With the obtained word vectors, the following sentences similarities are calculated: W2V similarity, IDFV similarity, S2V similarity, Text similarity, WFSV similarity.
In order to get a better word vector, we used full Wikipedia English corpus to train Word2Vec vectors (400 dimensions) and the Global vector of twitter (200 dimensions) provided by GloVe.
W2V similarity measures two sentences similarity by using word vectors. The W2V similarity formula is defined as Eq 7: where W 2V (w) is the word embedding vector, and len(S 1 ), len(S 2 ) is the length of sentence.
The cosine similarity is defined as Eq 8: S2V similarity is another method that measures two sentences similarity directly, by using the fol-lowing formula as Eq 9: S2V Sim(S 1 , S 2 ) = 1 len(S 1 ) w∈S 1 maxSim(w,S 2 ) + len(S 2 ) w∈S 2 maxSim(w,S 1 ) (9) maxSim(w,S) is to find the maximum similarity value between one word in one sentence and all words in another sentence, which is defined as Eq 10. maxSim(w, S) = M ax{Cos Dis(W 2V (w), W 2V (w s )), w s ∈ S} (10) Text similarity uses maxSim method and the weight of tf-idf to calculate the pair of sentence. Text similarity measures (Mihalcea et al., 2006) two sentences similarity uses the following formula as Eq 11: IDF W2V similarity and Freq W2V similarity represent sentence vector with word embedding, word frequency and word tf-idf. IDF W2V similarity and Freq W2V similarity formula are as Eq 12 and Eq 13: (13) where IDF(w) and WF(w) are the word tf-idf and frequency based on all Wikipedia english corpus.
After getting the sentence vectors, comput cosine distance between two vectors and the value is a feature of two sentences.

Corpus Based Features
Latent semantic analysis (LSA) is a technique of global matrix factorization methods, to analyse the relationships between a set of documents and the words. Based on optimal vector space structure, LSA method can leverage statistical information efficiently, and be always used to measure wordto-word similarity.
There are several publicly available tools to construct LSA models, such as SemanticVectors Package (Widdows and Ferraro, 2008) and S-Space Package (Jurgens and Stevens, 2010) can be used to generate LSA space vectors. For this part, we directly use the word vectors provided by SEMILAR 1 (Stefanescu et al., 2014) to calculate the features: W2V LSI similarity, S2V LSI similarity, Text LSI similarity, IDF LSI similarity, WFSV LSI similarity.

Alignment Based Features
Alignment similarity based on monolingual alignment measures sentences similarity. Alignment try to discover similar meaning word pairs by exploiting the semantic and contextual similarities. In our work, we directly use the monolingual word aligner provided by (Sultan et al., 2014a,b). Alignment similarity uses the following formula Eq 14: sts(S 1 , S 2 ) = n a c (S 1 ) + n a c (S 2 ) n c (S 1 ) + n c (S 2 ) where n a c (S 1 ) and n a c (S 2 ) is the amount of word alignment in two sentences, and n c (S 1 ) and n c (S 2 ) is the length of sentence.

Literal Based Features
For literal similarity, we use the edit distance and jaccard distance to calculate sentences similarity. Edit distance also known as Levenshtein Distance, is the minimum step of editing operations from one sentence to another.
Firstly, for jaccard distance, we extracted partof-speech tagging of each word from a sentence. Then calculate jaccard distance by using the formula defined by Eq 15: where S 1 and S 2 are the tag of each word in a sentence, which ignores the order. We use the NLTK toolkit part-of-speech tagging.

Experiments and Results
In our system, We build our data set by collecting all off-the-shelf English data sets which released by prior STS evaluations (except the evaluation data set of STS 2016). After that, 80% data set are used as train data set and 20% as valid data set. In our system, we trained SVR model, and the SVR parameters are set as Table 2.
parameter kernel C gamma epsilon value rbf 0.1 auto 0.0 Ontology based, Word embedding based, Corpus based, Alignment based and Literal based features are used in SVR model respectively, in order to explore the effect of each kind of features. We used SemEval 2016 evaluation data set to test the performance of different feature set, and the results of Pearson Correlation coefficients are shown in Table 1.
The Table 1 indicates Word2Vec performed better in HDL, Postediting, Plagiarism data set, and WordNet performed better in Ans-Ans, Qus-Qus data set. The reason maybe that training Word2vec uses all the English corpus of Wikipedia, and it can learn better word vectors. WordNet can make full uses of lexical information to match the synonyms between two sentences.
We also used SemEval 2017 evaluation data to test our system, and adding each kind of feature one by one. The result of Pearson Correlation coefficients are shown in Table 3. From

Conclusion and Future Works
In this paper, we describe our system in the Semantic Textual Similarity task1 subtask 5 English monolingual similarity in SenEval 2017. We used 5 kinds of features and SVR model to build the ultimate system. We find that Ontology based feature, Word Embedding based feature and Alignment based feature performed better in some aspects of semantic similarity calculation. With the limitation of time, we do not try other methods. In our future work, we are going to attempt LSTM tree method to calculate sentences similarity.