WSL: Sentence Similarity Using Semantic Distance Between Words

A typical social networking service contains huge amounts of data, and analyzing this data at the level of the sentence is important. In this paper, we describe our system for a SemEval2015 semantic textual similarity task (task2). We present our approach, which uses edit distance to consider word order, and introduce word appearance in context. We report the results from SemEval2015.


Introduction
The Internet, particularly sites related to social networking services (SNS), contains a vast array of information used for a variety of purposes. The vector space model is conventionally used for natural language processing. This model creates vectors on the basis of frequency of word appearance and co-occurring words, without taking word order into account. When it comes to short texts, word co-occurrence is rare (or even non-existent), and the number of words is often less than in a typical newspaper article. Because the average SNS contains data consisting mostly of short sentences, the vector space model is not the best choice.
In this work, we describe a system we developed and submitted to SemEval2015. In the proposed system, we compute sentence similarity using edit distance to consider word order along with the semantic distance between words. We also introduce word appearance in context.
The rest of this paper is organized as follows. Section 2 reviews related work and in Section 3 we present the three systems we submitted for SemEval2015. In Section 4, we discuss the results of our evaluation at SemEval2015.We conclude in Section 5 with a brief summary.  (Li et al., 2006).

Related Work
Recent research has introduced the lexical database as a dictionary to analyze short texts (Aziz et al.,2010). Aziz uses a set of similar noun phrases and similar verb phrases and common words to compute sentence similarity. Li combines semantic similarity between words into a hierarchical semantic knowledge base and word order (Li et al.,2006). There are currently a few hierarchical semantic knowledge bases available, one of which is WordNet (Miller,1995). WordNet contains 155,287 words and 117,659 synsets that were stored in 2012 into the lexical categories of nouns, verbs, adjectives, and adverbs(WordNet Statistics, 2014). All synsets have semantic relation to other synsets. An example in the case of using nouns is shown in Fig.1 where l is the shortest path length between w1 and w2 and h is the depth of the subsumer of w1 and w2 in WordNet. For example, we describe the path between "boy" and "girl" in Fig. 1. The shortest path is boy-male-person-female-girl, which is 4, so l = 4. The subsumer of "boy" and "girl" is "person, human...", so the depth of this synset is h. In hierarchical semantic nets, words at the upper layers have a general meaning and less similarity than words at the lower layers. Li sets = 0.2 and  = 0.45. Not only the similarity between words but also word order is important. For example, the two sentences "a dog bites Mike" and "Mike bites a dog" consist of the same words, but the meanings are very different. In this case, we use vectors such that when each vector completely matches, the sentence similarity is high. Our approach is based on edit distance to take into account word order and combined semantic similarity between words.

System Details
The proposed system uses edit distance to take word order into account. It also uses the impact of word appearance in each context.
In this paper, we describe sentence S1 as S1={a1,a2, … ,an} and sentence S2 as S2 ={b1,b2, …,bm}. S1 consists of n words and S2 consists of m words. ai is the i th word of S1 and bj is the j th word of S2. We describe the similarity Sim(S1,,S2) between S1 and S2 within the range of 0 (no relation) to 1 (semantic equivalence).

Edit Distance
Edit distance is a way of computing the dissimilarity between two strings. Conventionally, the distance is computed for a set of characters with three kinds of operations (substitution, insertion, deletion). However, our approaches are for word sets. Here, we describe the two kinds of edit distance extended in our system.

Jaro-Winkler Distance
The Jaro distance between S1 and S2 (|S1|=n,|S2|=m) is dj: where q is the number of matching words between S1 and S2. We consider two words as matching when they are the same and not father than m n t is half the number of transpositions. The Jaro-Winkler distance is dw: where k is the length of common words at the start of the sentence. p is constant and usually set to p = 0.1.

Semantic Distance
We borrow our approach to compute similarity between words from Li (Li et al.,2006)(Eq. (1)). It can be used for both nouns and verbs because both are organized into hierarchies. However, it is not available for adjectives and adverbs, which are not organized into hierarchies. Therefore, in addition to Eq. (1), when w1∈synsetA, w2∈synsetB, we define semantic similarity between words if they are adverbs and adjectives as s(w1,w2) is 1 if w1 and w2 are in the same synset.
Conventionally, we calculated edit distance on the basis of match or mismatch between words and ignored how similar two words are. However, with this approach, if two words have the same meaning although they are different words (e.g., "fall" and "autumn"), edit distance defines them as a mismatch. We address this issue by introducing semantic similarity between words as distance.

(a) Levenshtein distance
We rewrite Eq. (3) as We propose a measure for the sentence similarity of S1 and S2 Sim(S1,,S2) as We rewrite Jaro-distance dj defined by Eq. (4) as We define q' in Eq. (10). q' indicates the sum of all semantic similarity between words in S1 and S2 (1≦i≦n,1≦j≦m ,SUM(c2(ai,bj)) ). Further, originally, we calculated t only if two words are matching (ai=bj); however, in our proposed methods we change to s(ai,bj)>0.5 to take into account of the semantic similarity of words.
C2 in Eq. (10) is defined by Eq. (11). It means the semantic similarity of words.
We propose a measure for the sentence similarity of S1 and S2 Sim(S1,, S2) as

The Impact of Word Appearance in Context
There is one issue when we compute Sim(S1,,S2), as follows. Let us consider two sentences: "I ate an apple" and "I hate an apple". These sentences indicate opposite meanings. However, except for "ate" and "hate", both sentences consist of the same words and have the same word order. Therefore, the method we mentioned above (Eq. (8)) computes the Sim(S1,,S2) as high. However, we decide that the similarity between these sentences have opposite meanings because of "ate" and "hate". For this reason, we introduce conditional probability to estimate word appearance for each context and extract the probabilities from a corpus as training data. Further, we give this word appearance for semantic similarity (Eq. (1)) as a weight.
Let us show an example. P(I | S2), P(ate| S2), P(an| S2), and P(apple|S2) are words of S1 appearance in context S2. We define S * as the set of nouns, verbs, adjectives, and adverbs (e.g., when sentence S is "It is a dog", S * is {"is", "dog"}). We measure each word appearance weight(w) in context S as: where docw,S* is the number of documents that contains both w and S * and docS* is the number of documents that contains S * . We set  = 5.0 .
We take into account the impact of words in context and apply it to Levenshtein distance, rewriting Eq. (7) as When a word in one sentence co-occurs with words in the other sentence frequently, the impact is low, and when it co-occurs less frequently, the impact is high. We use Eq.(15) when ai and bj are nouns or verbs and s(ai,bj) < 0.7 .

Results
STS systems at SemEval 2015 were evaluated on five data sets. Each data set contained a number of sentence pairs that have a gold-standard score in the range of 0-5 as correct answers. The STS systems were evaluated by Pearson correlation between the system output and the gold-standard score. We used the Reuters Corpus as training data.

Submissions
We submitted the outputs of three of our system runs. In the STS task, the similarity between the score(S1,S2) of two sentences needed to be in the range of 0-5. Accordingly, we set score(S1,S2) as score(S1,S2) =5* Sim(S1,S2). For pre-processing, we use Stanford-NLP tools for tokenization and POStagging. We also remove punctuation marks. And we use JWNL to measure the similarity between words. (Eq. (1)) -run1 Levenshtein distance approach (Eq. (8)) -run2 Jaro-Winkler distance approach (Eq. (12)) -run3 Using run1 (Eq. (8)) in conjunction with word appearance in context (Eq. (15)) Table 1 shows the results (Pearson correlation) of each of our three runs evaluated on five data sets. Our best system was run3. It was ranked 64 out of 74 systems.

Evaluation on STS 2015 Data
The weighted-mean scores of run1 and run2 were almost the same. When we compare the scores of run1 and run3, run3 performed better on four datasets (the exception was "answersforums"). Overall, the best performance in terms of weighted-mean score was by run3.

Conclusion
In this paper, we proposed methods for determining sentence similarity. We adopted the semantic distance of word on edit distance along with word appearance in context. Evaluation results suggest that using word appearance in context is an effective element for determining sentence similarity.