EICA Team at SemEval-2017 Task 3: Semantic and Metadata-based Features for Community Question Answering

We describe our system for participating in SemEval-2017 Task 3 on Community Question Answering. Our approach relies on combining a rich set of various types of features: semantic and metadata. The most important group turned out to be the metadata feature and the semantic vectors trained on QatarLiving data. In the main Subtask C, our primary submission was ranked fourth, with a MAP of 13.48 and accuracy of 97.08. In Subtask A, our primary submission get into the top 50%.


Introduction
SemEval-2017 Task 3 on Community Question Answering (Nakov et al., 2017) aims to solve a real-life application problem. The main subtask C (Question-External Comment Similarity) asks to find an answer in the forum that is appropriate as a response to a newly posted question. This is achieved by retrieving similar questions and ranking their answers with respect to the new question. Three additional supporting subtasks are defined: Subtask A (Question-Comment Similarity): Given a question from a question-comment thread, rank the comments within the thread based on their relevance with respect to the question. The comments in a question-comment thread are annotated as Good, PotentiallyUseful and Bad. A good ranking is the one that ranks all Good comments above PotentiallyUseful and Bad ones.
Subtask B (Question-Question Similarity): Given a new question, re-rank the similar questions retrieved by a search engine with respect to that question. The potentially relevant questions are annotated as PerfectMatch, Relevant and Irrelevant with respect to the original question. A good ranking is the one that the PerfectMatch and the Relevant questions are both ranked above the Irrelevant ones.
Subtask C (Question-External Comment Similarity): Given a new question and the set of the first 10 related questions (retrieved by a search engine), each associated with its first 10 comments appearing in its thread. Re-rank the 100 comments (10 questions × 10 comments) according to their relevance with respect to the original question.

Related Work
This year's SemEval-2017 Task3 is a follow up of SemEval-2016 Task3  on Answer Reranking in Community Question Answering. There are three reranking subtasks associated with the English dataset. Subtask A is the same as subtask A at SemEval-2015 Task 3 (Joty et al., 2015), but with slightly different annotation and a different evaluation measure.
The research of rerank can be classified into two categories, traditional feature engineering and newest deep neural network employing. The first type of method pays more attention on textural features exploiting. Textual features have been exploited well, including lexical features (e.g., ngrams), syntactic features (such as parse trees) and semantic features (for instance wordnet-based). Some work exploit various feature extraction approaches and indicates the importance of feature selection in the rerank task. (Filice et al., 2016;Franco-Salvador et al., 2016;Mihaylova et al., 2016). However those methods all face the problem of feature merging, due to many features may affect each other.
Most recently, convolution neural networks (C-NN) and recurrent neural networks (RNN) are employed in the task of text rerank (Wu and Lan, 2016;Qiu and Huang, 2015). Wu's team use both convolutional neural network and long-short ter-m memory network (Wu and Lan, 2016) to train the model. Qiu's model (Qiu and Huang, 2015) integrates sentence modeling and semantic matching into a single model, which can not only capture the useful information with convolutional and pooling layers, but also learn the matching metrics between the question and its answer. However, these methods all face the problem of too many parameters in the model and it is hard to choose the best parameters. We build our system on top of the framework developed by . In addition, we extract more different kinds of features. In order to solve the problem of feature merging, we just try different combinations of features and choose the best one in the development set.

Data
There are 6,398 questons and 40,288 comments for subtask A, 317 original + 3,169 related questions for subtask B, and 317 original questions + 3,169 related questions + 31,690 comments for subtask C .
We also used semantic vectors pretrained on Qatar Living Forum: 200 dimensional vectors, available for 472,100 words and phrases.

Method
In particular, we formulate all the three tasks as classification problems.
We use various of features like question and comment metadata; distance measures between the question and the comment; lexical semantics vectors for the question and for the comment.

Features
We use several semantic vector similarity and metadata feature groups. For the similarity measures mentioned below, we use cosine similarity (Nguyen and Bai, 2010): Semantic Word Embeddings. We use semantic word embeddings obtained from Word2Vec models trained on different unannotated data sources including the QatarLiving and DohaNews (Abbar et al., 2016). For each piece of text such as comment text, question body and question subject, we construct the centroid vector from the vectors of all words in that text.

Semantic Features
We use various similarity features calculated using the centroid word vectors on the question body, on the question subject and on the comment text, as well as on parts thereof: Question to Answer similarity. We assume that a relevant answer should have a centroid vector that is close to that for the question (Min et al., 2017). We use the question body to comment text similarity, and question subject to comment text similarity.
Maximized similarity. We rank each word in the answer text to the question body centroid vector according to their similarity and we take the max similarity of the top N words (Fu and Murata, 2016). We take the top 1,2 and 3 similarities as features.
The assumption here is that if the average similarity for the top N most similar words is high, then the answer might be relevant.
Aligned similarity. For each word in the question body, we choose the most similar word from the comment text and we take the average of all best word pair similarities as suggested in (Tran et al., 2015) Dependency syntax tree based word vector similarities. We obtain the dependency syntax tree with the Stanford parser (De Marneffe and Manning, 2008), and we take similarities between centroid vectors of noun phrases from the comment text and the centroid vector of the noun phrases from the question body text. The assumption is that same parts of dependency syntax tree between the question and the comment might be closer than other parts of dependency tree.
Word clusters (WC) similarity. We cluster the word vectors from the Word2Vec vocabulary into 500 clusters (with 400 words per cluster on average) using K-Means clustering (Basu et al., 2002). We then calculate the cluster similarity between the question body word clusters and the answer text word clusters. For all experiments, we use clusters obtained from the Word2Vec model trained on QatarLiving forums with vector size of 100, window size 10, minimum words frequency of 5, and skip-gram 1.
LDA topic similarity. We perform topic clustering using Latent Dirichlet Allocation (LDA) as implemented in the gensim toolkit (Rehurek and Sojka, 2010)on Train1+Train2 questions and comments. We build topic models with 150 topics. For each question body and comment text, we get the corresponding distribution, and calculated similarity.
The assumption here is that if the question and the comment share similar topics, they are more likely to be relevant to each other.
Semantic features above can fully represent the similarity between the question and the comment, which is very important in the next classification part.

Metadata-based Features
Metadata-based features provide clues about the social aspects of the community (Kıcıman, 2010). Thus, except for the semantic features described above, we also used some common sense metadata features: Answer containing a question mark. We think if the comment has a question mark, it may be another question, which might indicate a bad answer (Katzman et al., 2017).
The presence and the number of links in the question and in the comment. We count both inbound and outbound links. Our hypothesis is that the presence of a reference to another resource is indicative of a relevant comment (Newton et al., 2017).
Answer length.
The assumption here is that longer answers could bring more useful detail (Yang et al., 2017).
Question length. If the question is longer, it may be more clear, which may help users give a more relevant answer (Figueroa, 2017).
Question to comment length. If the question is long and the answer is short, it may be less relevant.
The comment is written by the author of the question If the answer is posted by the same user who posted the question and it is relevant, why has he/she asked the question in the first place?
Answer rank in the thread. Earlier answers could be posted by users who visit the forum more often, and they may have read more similar questions and answers. Moreover, discussion in the forum tends to diverge from the question over time.

Other-extra Features
Some features neither belong to the semantic nor metadata-based features, we call them extra features. They are also useful in the task of rerank.
Special symbols. We think whether the comment text contains smiley, e-mails, phone numbers, only laughter, "thank you" phrases, personal opinions, or disagreement is an important feature (Toba et al., 2014).
Numbers of special part of speech We extract statistics about the number of verbs, nouns, pronouns, and adjectives in the question and in the comment, as well as the number of numbers.
Numbers of misspelled words We obtain the features relate to spelling and include number of misspelled words that are within edit distance 2 from a word in our vocabulary and number of offensive words from a predefined list (Agichtein et al., 2008).

Classifier
For each Question and Comment pair, we firstly extract the features described above from the Question body and the comment text. Then we concatenate the extracted features in a bag of features vector and have them normalized. After the normalization, the value are mapped to interval [-1,1]. At last, we input them into the classifier. In our experiments, we use L2-regularized logistic regression classifier (Buitinck et al., 2013) and SVM classifier (Zweigenbaum and Lavergne, 2016) respectively. For the logistic regression classifier, we tune the classifier with different values of the C (cost) parameter (Aono et al., 2016), and we take the one that yield the best accuracy on 10-fold cross-validation on the training set. For the SVM classifier, we choose different kernels (Moreno et al., 2003) and achieve the best results with RBF kernel. We only show the better results of above two classifiers in the next section. We use binary classification Good vs. Bad (including both Bad and Potentially Useful original labels). The output of the evaluation for each test example is a label, either Good or Bad, and the probability of being Good in the 0 to 1 range. We then use this output probability as a relevance rank for each Comment in the Question thread.

Experiments and Evalution
This section presents the evaluation of the SemEval-2017 Task 3 on CQA (Nakov et al., 2017). Note that for our system EICA we did not use data from SemEval-2015 CQA. The best result of each partition and subtask is highlighted. Our percentage comparisons all use absolute values.   We can see the results of Subtask A (questioncomment similarity ranking) in Table 1. In terms of ranking measures, our system outperform both the random and the search engine baseline. We observe a MAP improvement of 18.15% compare with the results obtained by the search engine. We obtain the second rank in SemEval-2016 . Similar to Subtask A ,the performance of our approach is also superior in Subtask B (questionquestion similarity ranking). As we can see in Table 3, using the test set for 2016, the improvement of MAP and AvgRec has been of 1.59%, 2.37% respectively compare to the search engine baseline. In this case, the improvements in performance are slightly reduced. We obtain the second rank in SemEval-2016 .
For Subtask C, the results are shown in Table  5. Using the test set for 2016, the improvement of MAP and AvgRec has been of 8.21%, 0.93% respectively compare to the search engine baseline .

SemEval-2017 Task 3 Results
We can see the results of Subtask A (questioncomment similarity ranking) in Table 2. In terms of ranking measures, our system also outperform both the random and the search engine baseline. Using the test set for 2017 (Nakov et al., 2017), we observe a MAP improvement of 13.92% compare with the results obtained by the search engine.
Similar to Subtask A ,the performance of our approach is also superior in Subtask B (questionrelated question similarity ranking). As shown in Table 4, using the test set for 2017 (Nakov et al., 2017), we obtain the MAP of 41.11% and AvgRec of 77.45.
For Subtask C, we can see the results in Table  6. Using the test set for 2017 (Nakov et al., 2017), the improvement of MAP and AvgRec is 4.3%, 2.72% respectively compare to the search engine baseline.
The results in both SemEval-2016  and SemEval-2017(Nakov et al., 2017 prove that features we use are quite useful for ranking comments with respect to a given question (Subtask A and C), but they do not achieve as similar results when ranking questions with respect to other questions(Subtask B).

Conclusion
We have described our system for SemEval-2017, Task 3 on Community Question Answering. Our approach rely on semantic and metadata-based features. In the main Subtask C, our primary submission is ranked fourth, with a MAP of 13.48 and accuracy of 97.08, which is the highest. In Subtask A, our primary submission is sixth, with MAP of 86.53 and accuracy of 61.64.
In future work, we plan to use our best feature combinations in a deep learning architecture, as in the Qiu's system (Qiu and Huang, 2015), which outperforms the other methods on two matching tasks. We also want to use information from entire threads (Joty et al., 2015) to make better predictions. How to combine them efficiently in the system is an interesting research question.