SCIR-QA at SemEval-2017 Task 3: CNN Model Based on Similar and Dissimilar Information between Keywords for Question Similarity

We describe a method of calculating the similarity of questions in community QA. Question in cQA are usually very long and there are a lot of useless information about calculating the similarity of questions. Therefore,we implement a CNN model based on similar and dissimilar information between question’s keywords. We extract the keywords of questions, and then model the similar and dissimilar information between the keywords, and use the CNN model to calculate the similarity.


Introduction
We participate in SemEval-2017 Task 3 Subtask B (Nakov et al., 2017) on Community Question Answering. In this task, we are given a question from community forum (named original question) and 10 related questions. We need to re-rank the related questions according to their similarity between the origin question.
Both the original question and the related question have question subject and question body. The subject is short. The body is long and contains a lot of useless information. In our system, we try to use keywords to replace questions to locate more important information on the question, so we use a keyword extraction algorithm that combines syntactic information to get more accurate keywords. Then we use a CNN model based on similar and dissimilar information between questions to calculate the similarity of questions. The model can make good use of similar information and dissimilar information between questions to get better results.
The paper is organized as follows: Section 2 introduces our system. Section 3 introduces the ex-periment. And in section 4, there are the conclusions.

Model
In this section we describe our system in detail. In Section 2.1 we show how we extract keywords from the subject and body, and then in Section 2.2 we describe how to construct the CNN model based on similar and dissimilar information on question keywords.

Keyword extraction
First, we cut the question subject and question body. Then, we extract keywords from each subsentence. We combine all the extracted keywords together as a result.
We use an unsupervised keyword extraction method based on dependency analysis. The method uses syntactic dependency relations between words as clues. For the given question, we not only use the statistical information and word vector information, but also construct the dependency graph to calculate the correlation intensity between words, and then construct the weighted graph according to the dependency degree, and use the TextRank algorithm (Mihalcea and Tarau, 2004) to iterate to calculate the word importance score. The main steps include preprocessing, the construction of the non-directional weighted graph, graph ranking, and the selection of the t words with the highest score as keywords of the question, as shown in Figure 1.
Preprocess: The preprocessing process includes word segmentation and removing the stop words. We use the remaining words as the candidate words of the keywords.
Construct the undirected weighted graph: After preprocessing, all candidate words are represented as vertices of the graph. If two words cooccur in a sentence, there is an edge to the two vertices. The weight of the edge is calculated by the statistical information on words, the word vector information and the dependent syntax analysis information.
The methods that can be used to calculate the correlation between two words are: Pointwise Mutual Information (PMI), Average Mutual Information(AMI) (Terra and Clarke, 2004), etc. However, these methods only consider the statistical information between words, and do not consider the syntactic dependencies. The syntactic dependency between words has a positive effect on measuring the importance of words.
The result of the dependency syntax analysis is analogous to the tree structure. If we remove its root node, and ignore the arc of the point, we can get an undirected dependency diagram G = (V , E ), V = w 1 , w 2 , ..., w n , E = e 1 , e 2 , ..., e m , where w i denotes a word and e j denotes an undirected relationship between two words. The undirected dependency graph guarantees that there is a dependency path between any two words in the question, and the length of the dependency path reflects the intensity of the dependency relationship. Therefore, we introduce the concept of dependency degree according to the length of the dependent path (Zhang et al., 2012), as shown in Equation (1), where dr path len(w i , w j ) represents the dependency path length between words w i and w j , b is the superparameter.
The degree of correlation between two words, that is, the weight of the edge is multiplied by the gravitational value of the two words by the length of the dependent path, as shown in Equation (2).
Among them, the concept of gravitational values proposed by (Wang et al., 2015), inspired by gravitation. The word frequency is regarded as the object mass, and the distance between the words is taken as the distance of the object. The gravitational value f (w i , w j ) of the two words is given by the Equation (3).
Graph ranking: We use the weighted Tex-tRank algorithm to sort the graph. In the undirected graph G = (V, E), V is the set of vertices, E is the set of edges, and C(v i ) is the set of vertices connected to the vertex v i . The score of the vertex v i is calculated from the Equation (4), where weight(w i , w j ) is calculated from the Equation(3), d is the damping coefficient.
Then we select the t words with the highest score as the keywords.

CNN model based on similar and dissimilar information
We use a CNN model based on similar parts and dissimilar parts between two sentences to get sentence similarity. This model is proposed by (Wang et al., 2016), now we will introduce the model briefly. Figure 2 shows the structure of the model.
Given a sentence pair, the model represents each keyword as a vector, and calculates a semantic matching vector for each keyword based on part of keywords in the other sentence. Then each word vector is decomposed into two components based on the semantic matching vector: a similar component and a dissimilar component. After this, we use a two-channel CNN to compose the similar and dissimilar components into a feature vector. Finally, a fully connected neural network is used to predict the sentence similarity through the composed feature vector. First, with word embedding pre-trained by Stanford using GloVe's model (Pennington et al., 2014), we transform keywords of question S and T into matrix S = [s 1 , s 2 , ..., s m ] and T = [t 1 , t 2 , ..., t n ], where s i and t j are 300-dimention vectors of corresponding keywords, and m and n are the length of keywords of S and T. Second, for judging the similarity between two sentences, we check whether each keyword in one sentence can be covered by the other sentence. For a sentence pair S and T, we first calculate a similarity matrix A (m×n) , where each element a (i,j) ∈ A (m×n) computes cosine similarity between words s i and t j as We calculate a semantic matching vectorŝ i for each word s i by composing part of word vectors in the other sentence T. In this way, we can match a keyword s i to some keywords in T. Similarly, we also calculate all semantic matching vectorst i in T. We define a semantic matching functions over where k = argmax j a i,j w indicates the size of the window to consider centered at k (the most similar word position). So the semantic matchisng vector is a weighted average vector from t k−w to t k+w .
Third, after semantic matching, we have the semantic matching vectors ofŝ i andt j . Take s as an example. We interpretŝ i as a semantic coverage of word s i by the sentence T. However, there must be some difference between s i andŝ i . So based on its semantic matching vectorŝ i , our model further decomposes word s i into two components: similar componentŝ i + and dissimilar componentŝ i − . Then we choose a linear decomposition method. The motivation for the linear decomposition is that the more similar between s i andŝ i , the higher proportion of s i should be assigned to the similar component. First, we calculate the cosine similarity between s i andŝ i . Then, we decompose s i linearly based on α. Eq. (7) gives the corresponding definition: Finally, due to the dissimilar and similar components have strong connections, we use a twochannel CNN model (Kim, 2014) to compose them together. In the CNN model, we have three layers. The first is a convolution layer. We define a list of filters w o . The shape of each filter is d h, where d is the dimension of word vectors and h is the window size. Each filter is applied to two patches (a window size h of vectors) from both similar and dissimilar channels, and generates a feature. Eq.(8) expresses this process: The second layer is a pooling layer. We choose max-pooling method to deal with variable feature size. And the last layer is a full-connected layer. We use a sigmoid function to constrain the result within the range [0,1].

Experiment
We experimented with the corpus provided by SemEval-2017 task3. Training set has 267 questions, each question has 10 related questions, a total of 2670 question pairs. Development set has 50 questions, 500 question pairs. The test set has 88 questions, 880 question pairs. We do the experiment without preprocessing. We use Stanfod Parser (De Marneffe and Manning, 2008) to parse sentences. And we use the keyword extraction algorithm described in 2.1, for each sub-sentence we extract 1/3 of the words as keywords and set b = 1.4, d = 0.8. In the CNN model, we set up the filter shape is 3*300. The number of filters is 500. We set the similarity threshold of 0.5, that is, a score greater than 0.5 is considered a positive case. And we set the learning rate as 0.001. After 20 rounds of training, we got the result in devlopment set and test set.  The results in test set are shown in Table 1, the first two lines are the baseline, the next two lines are the best results, the last line is our result. And results in development set are shown in Table 2. In test set, our results are better than the baseline, but there is still some distance from the best results. In development set, our result is all not so good.
We think that because we do the experiment without preprocessing, there exists too many unknown words in word embeddings, which results in poor system performance. On the other hand, because the training corpus is too small, the neural network can not be well trained and can not find meaningful features. Therefore, in the future work, we will add features of artificial extraction into neural network to improve performance. And we will add features of artificial extraction into neural network to improve performance.

Result and Future Work
We implement a CNN model based on similar and dissimilar information between questions keywords, and experiment on SemEval-2017 corpus. The experimental results show that our method is better than baseline, we can extract the key information from the long sentence to model the question better, which helps us to calculate the similarity of the question. We think that keyword extraction is important in this task, and in the future we will try other keyword extraction methods to achieve better results.