ITNLP-AiKF at SemEval-2016 Task 3 a quesiton answering system using community QA repository

Community Question Answering (CAQ) systems play an important role in people's lives due to the huge knowledge accumulated in them. In order to take full advantage of the huge knowledge, the target of semeval2016 task3 is to find the best answers to a new question in CQA. This work proposes to use rich semantic text similarity (STS) features to complete the task. We address the task as a ranking problem and Support Vector Regression (SVR) model is chosen to combine rich semantic similarity features and context features. Finally, we used genetic algorithm to do feature selection. Our method achieves an MAP (mean average precision) of 71.52%, 71.43% and 48.49% in subtask A, B and C respectively. It ranked 8 th in subtask A and subtask B, and 7 th in subtask C.


Introduction
The CQA system with interactive and open character, can better adapt to the diversity of needs of users. With the growth of the number of users, community question answering system has accumulated a lot of QA pair archives. It has presented new challenges to analyze user's requirement and recommends high-quality answers to users.
In response to this problem, Semeval2015 Task3 -"Answer Selection in Community Question Answering" 1 (Nakov et al., 2015) proposed a task to divide the answers into three 1 http://alt.qcri.org/semeval2015/task3/ levels in accordance with the relevance of the question. However, the classification system does not fully comply with the question requirement, as it does not implement the recommending function.
Semeval2016 Task3 -"Community Question Answering" 2 (Nakov et al., 2016) puts forward new requirements to automate the process of finding good answers to new questions in a community-created discussion forum based on the Semeval2015 Task3. The task is divided into three parts: subtask A -"Question-Comment Similarity", subtask B -"Question-Question Similarity" and subtask C -"Question-External Comment Similarity".
In our work, we focus on using features that employ STS knowledge, such as extracting text similarity features from word vectors, structured resource and topic models, to deal with the task. Word vectors has been used in (Liu, Sun, Lin, Zhao, & Wang, 2015) and (Nicosia et al., 2015) to compute STS, and (Jin, Sun, Lin, & Wang, 2014) has evaluated word-phrase semantic similarity with structured resource.

Feature
The main idea of our method is to find the similarity between most similar words in two sentences to estimate sentence similarity. Our features include the following categories: WordNet-based features, vector features, word matching features, topic features and answer features.

Vector Features
There are three approaches that we are applying to measure sentence similarity with word vector.
The first one uses the sum of all the words' vectors in sentence s as the representative of s, and calculate the distance of two sentences' vector.
(2) Where s is a sentence, vec(w) is the vector of word w,and c_sim (v,u) is cosine similarity which will be mentioned below.
Where v and u are two N-dimensional vectors. vi is the i-th element of v.
The second and the third are similar to each other. The procedure that computing sentence pair similarity includes the following three steps.
First, given two sentences s1 and s2, and for each word v in sentence s1, we find the most similarity word u in sentence s2, to word v. And we do the same to sentence s2. Second, we calculate the similarity of a sentence-sentence pair based on each sentence respectively: Where l is the number of the words with stopwords removed from sentence s, and idf(w) is the inverse document frequency (Sparck Jones, 1972) of word w in the Wikipedia data.
Third, the value is averaged over the two sentence: We trained two word2vec 3 models using Gensim toolkit 4 (Řehůřek & Sojka, 2010). The first one is trained on the training data, and the second one on Wikipedia data 5 . Only these latter two ways are used in both models.
In our systems, the six methods are all used to measure word similarity. The WordNet-based features are computed using the same formulas as the last two methods of vector features.

Word Match Features
Longest Common Subsequence (Allison & Dix, 1986) can retain the words' position information when computing the sentence similarity: 2 ) , ( Where lcs(s1,s2) is the length of the longest common subsequence, l1 and l2 are the numbers of the words in s1 and s2.
We also use the bag of word to search the hidden relationship between words and sentences, and the cosine similarity is used to be the measure of vector similarity.
Besides, we use Stanford CoreNLP toolkit (Finkel, Grenager, & Manning, 2005) to get the two sentences' nouns and measure their similarity by bag of word.

Topic Features
All the features mentioned above are based on lexical similarity. In order to overcome the limitation of the lexical features, we build Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) model and Latent Semantic Analysis (Hofmann, 2001) model using the Gensim toolkit (Řehůřek & Sojka, 2010), which are both trained on Wikipedia data.
The topic models 8 can get sentence vector directly, and we calculate the vector distance by cosine similarity.

Answer Features
Closely analyzing the train data, we noticed that many "Good" comments would like to suggest questioners to visit a web site or ask further questions by email, and many "Good" comments prefer to contain pictures or numbers to explain themselves more clearly. Moreover, "Good" comments' sentence length is much longer than "PotentiallyUserful" comments and "Bad" comments'.
In addition, respondents themselves have a great influence on the quality of the answers. It may lead to a "Bad" comment if the respondent is also the questioner, and if the respondent is not the questioner but asks a question, it may also lead to a "Bad" comment. If a respondent was accustomed to submit high-quality comments, he/she has a high likelihood of offering a "Good" suggestion in the current question. So, we have voted the accuracy and error rates of comments for all users.
The answer features are only applied in subtask A.

Feature generation
Each question has brief description and detailed description. Take the following question as an example: OrgQSubject: What is the purpose of heaven? OrgQBody: What is the point? What is in it for the ones that get there? Let's leave the purpose of hell for another thread. I invite you to ponder. You can quote scripture or Sura's etc if you want but you must expand upon them with your own thoughts.
As we can see, people can get a broad understanding on the question by reading the brief description, and experiments show that the features of brief description lead to a better result. We assume that if the features come off well on a classification model, they would do a good job on ranking model. So, we extracted eigenvalue from both brief description and detailed description for subtask B. The subtask A is trained models with all the characteristics mentioned above. We multiply the subtask A's results by the subtask B's as subtask C's. And eventually we got 38 features for subtask A and 56 features for subtask B. Table 2 lists all the features.

Feature Selection
Considering that there may be a feature subset performing better than other subset of all features, we designed a genetic algorithm (Renna, 2000) to find the best one. The genetic algorithm (GA) can be described as follows: Encoding: Assuming that there are n features, nbit binary will be needed to encode a chromosome then. The process of feature selection is as the Figure 1 shows.   Individual creation: Relying on the hypothesis that a feature can make a feature subset work better if it is added to the current feature subset, we increase the probability 9 that each feature is selected.

Fitness:
We employ SVR as the evaluation function of feature selection. Selection: The reproduction operator just selects the top individuals of fitness as a part of the next generation, instead of adopting a probability selection algorithm to select superior individuals. Crossover: Here we use the single-point crossover. Mutation: Get a probability, and if the value is less than the preset threshold, an individual will be selected and a binary will be changed randomly.
In order to retain the best feature subset, all operations mentioned above are among the superiors, and the aberration rate is set to a larger value 10 to escape the local minimum. Figure 2 is the flowchart of GA. Where n is quorum, m is selection scale, and thresh is fitnessthreshold. 9 The value is 0.75. 10 The value we set is 0.3. GA can lead to different results for each run, so we run the selection function several times and choose the best one. Table 3 and Table 4 show the results of subtask A and subtask B in 20 runs of GA respectively. All experiments below train models on training dataset, and test them on development dataset.

Training Model
We trained a Maximum Entropy Modeling using the maxent toolkit (Le, 2004) and a SVR model (Smola & Schölkopf, 2004) using scikit-learn toolkit (Pedregosa et al., 2011). Table 5 and Table 6 show the results of different models.

Result
We just submit one time, and our system perform better in subtask A and subtask C than subtask B.

Conclusion and Future Work
We have tested the system by taking part in Semeval2016 Task 3 on English sub tasks, and our system works better on subtask A and subtask C than the IR system provided by organizers. Aware our method's shortcomings that the features center on lexical similarity, we will pay attention to process long sentence similarity in further work.