SUper Team at SemEval-2016 Task 3: Building a Feature-Rich System for Community Question Answering

We present the system we built for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.


Introduction
SemEval-2016 Task 3 on Community Question Answering 1  aims to solve a reallife application problem. The main subtask C (Question-External Comment Similarity) asks to find an answer in the forum that will be appropriate as a response to a newly posted question. This is achieved by retrieving similar questions and ranking their answers with respect to the new question. Two additional supporting subtasks are defined: Subtask A (Question-Comment Similarity): Given a question from a question-comment thread, rank the comments within the thread based on their relevance with respect to the question.
Subtask B (Question-Question Similarity): Given a new question, re-rank the similar questions retrieved by a search engine with respect to that question. 1 http://alt.qcri.org/semeval2016/task3/

Related Work
We build our preprocessing and feature extraction pipeline based on the system of Zamanov et al. (2015), which was developed by a subset of our 2016 team for SemEval-2015 Task 3 on Answer Selection in Community Question Answering . The task in 2015 was to classify comments in a thread as relevant, potentially useful, or bad with respect to the thread question. This year's Community Question Answering subtask A is similar to subtask A in 2015, but now it is a ranking task, asking to rank the answers in a thread based on their relevance with respect to the thread's question. Given this similarity, most of the techniques used by participants in the 2015 subtask A are potentially valuable for this year's subtask A as well. Below we mention just the few most relevant among them. In their system, Belinkov et al. (2015 used vectors of the question and of the comment, metadata features, and text-based similarities. Nicosia et al. (2015) used similarity measures, URLs in the comment text and statistics about the user profile: number of good, bad, and potentially useful comments. Similarly, we use the number of posts by the same user in the thread, the ID of the question's author, topic model-based feature, special words, etc.
Determing the overall sentiment of the question can also be useful, and it was used in 2015 (Nicosia et al., 2015). One way to do it is to build a sufficiently large question taxonomy as described in (Li and Roth, 2006). This may help determine the quality of the answer, but it requires significant ef-forts in order to build such a taxonomy.

Data
For training and testing, we used data provided by the SemEval-2016 Task 3 organizers. The datasets consist of 6,398 questions and 40,288 comments for Subtask A, 317 original + 3,169 related questions for Subtask B, and 317 original questions + 3,169 related questions + 31,690 comments for Subtask C.
For subtask A, the comments in a question-answer thread are annotated as Good, PotentiallyUseful and Bad. A good ranking is one that ranks all Good comments above PotentiallyUseful and Bad ones (without distinguishing between the latter two).
For subtask B, the potentially relevant questions are annotated as PerfectMatch, Relevant and Irrelevant with respect to the original question. A good ranking is one where the PerfectMatch and the Relevant questions (without distinguishing between them) are both ranked above the Irrelevant ones.
We also used semantic vectors (Mikolov et al., 2013a) pretrained on Google News data: 300-dimensional vectors, available for three million words and phrases.
For all subtasks, we further trained semantic vectors using Gensim (Řehůřek and Sojka, 2010) on 200,000 questions and two million comments from the Qatar Living Forum, 2 , which were provided by the task organizers.
Finally, using this same data, and following (Mihaylov et al., 2015a;Mihaylov et al., 2015b), we scraped information about the users from the forum and we extracted for each of them the time in the forum, the active period, the number of questions, the comments in the forum, etc.

Method
We build our system on top of the framework developed by our colleagues (Zamanov et al., 2015). In particular, we approach the task as a classification problem similarly to the approach we took for Se-mEval 2015 Task 3 . However, unlike 2015, this year we have a ranking problem for all subtasks, e.g., for subtask A we have to rank the comments depending on how likely the classifier thinks they are to be Good vs. them being Bad or PotentiallyUseful.
We use variety of features like question and comment metadata; question and comment lexical features; distance measures between the question and the comment; text readability measures applied to the question and to the comment; lexical semantics vectors for the question and for the comment; features modeling the likelihood of a user being a troll.
These features proved quite useful for ranking comments with respect to a given question (Subtask A and C), but they did not achieve as high results when ranking questions with respect to other questions (Subtask B).

Metadata Features
These features are based on surface observations of the thread's structure and properties. From the comments' attributes we extract whether the comment is written by the author of the question. We further extract the comment's position in the thread, and the ID of the author of the comment. Next, we tokenize the text and we calculate the ratio of the comment length and of the question length (in terms of number of tokens). In terms of the threads we measure, the number of comments from the same user in a particular thread and the order in which they are written by the user, i.e., first, second, etc. comment by the same user. In terms of the whole QatarLiving forum, we calculate the number of questions in a category.
Another family of metadata features explores the presence and the number of links in the question and in the comment. We counted both inbound (i.e., to qatarliving.com) and outbound links. Our hypothesis was that the presence of a reference to another resource is indicative of a relevant comment. Investigations on the training set showed that a relevant comment was more likely to contain such a link. Unfortunately, less than 10% of the comments had links, and ultimately these features did not have a very high impact on the results.

Lexical Features
These features represent the lexical content of questions and comments. They are obtained with the help of the GATE (Cunningham et al., 2011;Cunningham et al., 2002) preprocessing pipeline with some hand-crafted rules and various statistics.
We use token-, NP-, and sentence-based features as well as features based on the following entities: Person, Location, Organisation and Address. The latter ones are used to mark whether the comment contains an answer to a wh-question (where, who, what, etc.), e.g., if the question contains the word "where". We further add a boolean feature modeling whether the comment contains a Location or an Address. We tagged the named entities using the highquality named entity recognition pipeline of Ontotext. 3 We further extracted statistics about the number of verbs, nouns, pronouns, and adjectives in the question and in the comment, as well as the number of question marks in the comments, and the number of question words in the question and in the comment.
Another group of lexical features are extracted from the comment text only and show whether it contains smileys, currency units, e-mails, phone numbers, only laughter, "thank you" phrases, personal opinions, or disagreement.
Other lexical features relate to spelling and include number of misspelled words that are within edit distance of 1 from a word in our vocabulary and number of offensive words from a predefined list.
We also borrow a dictionary from the PMI-cool system (Balchev et al., 2016), which is based on unigram and bigram occurrences across the classes. We use it to compute the Pointwise Mutual Information (PMI) between a dictionary entry and a class. Based on it, we add features that sum the PMI for all tokens in a given comment. This family of features are weighed most heavily by the classifier.
We further computed lexical similarity between a question and a comment using SimHash (Sadowski and Levin, 2007), which is a near-duplicate similarity measure but it did not help much.
We also apply a set of statistical scores to measure the level of readability and complexity of the text (Aluisio et al., 2010). The standard readability measures include Automated Readability Index, Coleman-Liau Index, Flesch Reading Ease, Gunning Fog Index, Flesch-Kincaid Grade Level, LIX, SMOG grade. We also employ statistics about the average number of words per sentence in the comment or question, and type-to-token ratios.

Semantic Features
Our semantic features try to capture the proximity between the meanings encoded in the word sequences of the questions and of the answers.
One of the semantic features uses Mallet topic modelling (McCallum, 2002). We build a topic model with 100 topics. Then we measure the cosine distance between the topics in the first text and the topics in the second text, i.e., in the question and in the answer (Subtasks A and C), or in the original and in the related question (Subtask B).
We also used semantic vectors trained with word2vec (Mikolov et al., 2013b). We performed experiments to select the best vectors for the task. We tried pre-trained vectors from Google News. We further trained vectors from the unannotated data from the QatarLiving forum. We used the latter vectors in our system as they yielded better results.
We experimented with training vectors of different sizes and different minimal word counts. Because of the many common misspelled words, the smaller word count yields better results. We tried different tokenizations for the words. To capture the specifics of the forum language, we added identifiers for numbers, smileys, URLs and images. For each question-comment pair, we calculated the centroid vectors of the question and of the comment and we used the components of the resulting vectors as features for the classifier. We used Gensim (Řehůřek and Sojka, 2010) for building the vectors.

User Features
We downloaded and used characteristics about the users from the QatarLiving forum, such as number of questions, comments, classifieds; time since registration, time since last activity in the forum, time of the day in which the user was active, etc. We also added as user characteristics the number of good and bad comments from the annotated training data. However, the user features did not improve the results. We noticed that over time, the number of both good and bad comments for a user in the forum grew, and the number of good and bad comments for most of the users was similar.
We also used troll user features, e.g., number of mentions of the user as troll and troll behavior characteristics as described in (Mihaylov et al., 2015a;Mihaylov et al., 2015b;Mihaylov and Nakov, 2016a).

Credibility Features
We further added some of the credibility features described in (Castillo et al., 2013). We trained a prediction model on tweets from the dataset described in that paper. We used linear Support Vector Machines (linear SVMs) with Stochastic Gradient Descent (SGD) and L2 regularization. We used an off-the-shelf implementation of SVM, provided in the Apache Spark library (Apache Software Foundation Team, 2015).
For each answer we extracted the following features: length of the answer (characters); does the answer contain special punctuation, like question marks, exclamation marks, etc.; is there an emoticon in the text; is there a first person singular (I, me, my, mine, we) or plural pronoun (our, ours, we, us); is there a third person singular (he, she, it, his, hers, him, her) or plural pronoun (they, them, their); does the answer contain a user mention (@user); does the text contain URLs in it.
Based on these features we trained an SVM model to classify items as credible or not. For our submis-sion, we used all the features used to train the credibility module as well as the predicted label and the probability it is predicted with.

Classifier
Using the above features, we formed vectors associated with each question-answer pair. Those vectors are a concatenation of the extracted features, including the centroid of the semantic vectors for the question and for the comment.
We then used an SVM classifier as implemented in LibSVM (Chang and Lin, 2011) for classification.
We experimented with different kernels (Hsu et al., 2003), and we achieved the best results with the RBF kernel, which we use to train the model for our submissions. The ranking score for a question-comment pair in Subtask A is the calculated probability of the pair to be classified as Good.
For Subtask C, we used the same approach as in Subtask A. We first ranked the comments with respect to the relevant question. For the final ranking, we multiplied the probability of the pair "relevant question -comment" being Good by the reciprocal rank of the related question as given by Google.
For subtask B, we passed to the classifier characteristics of the pair "original question -relevant   question". For ranking, we used the probability of the pair to be classified as Good.

Experiments and Evaluation
We grouped our features in several groups and we ran experiments by excluding some of them in order to identify the most important types of features.
In particular, we used the LibSVM fselect script for feature selection. We achieved the best results by combining the features with the semantic vectors of the question and of the comment trained on Qatar-Living data.
In Table 1, we present the results for Semeval-2016 Task 3, Subtask A using all features, as well as when excluding individual feature groups. Our primary submission includes the top-rated features and semantic vectors. We selected our primary submission as it achieved higher score on Dev than Con-trastive1 and Contrastive2.
Compared to Contrastive-1, our Primary has some additional features: number of user comments in the thread, cosine between the comment text with question subject and category, more locations and organizations. Our Contrastive-1 submission included the top-rated features and semantic vectors, and our Contrastive-2 submission included the same features as our Primary submission, but used Dev-2016 as additional training set.
In  Tables 1 and 2 have shown that the most important feature groups are the metadata characteristics, the distance measures between the question and the comment, and the semantic vectors. Other features that fselect scored highly are the credibility score, text readability measures and the number of tokens of some parts of speech in the comment text (namely, number of adjectives and nouns). The least useful features are statistics about the forum users and characteristics of the question: question length and number of tokens of different parts of speech in the question text. In all above reported results, we used vectors for which we achieved the best results on the development dataset.
In Table 4, we present the results from experiments with semantic vectors. We experimented with pre-trained vectors from Google News and we also trained vectors with word2vec on the unannotated Qatar Living forum data. When training vectors on Qatar Living data, we experimented with different vector sizes and minimum word frequencies. We also added the following entities as words: numbers, images, URLs, smileys (referred to in the table as "special symbols"). We achieved best results with vectors from QatarLiving, vector size 100, and minimum word frequency of 5. Including special symbols as words also improved the results. We experimented with calculating the centroids of the question and of the comment using specific parts of speech only; however, ultimately we found that using all words from all parts of speech worked best.
For Subtask B, we used a similar approach as for Subtask A: we passed to the classifier the semantic vectors of the original and of the related question, some metadata and distance features. However, we could not experiment much with this subtask, and thus our results are not as strong, as Table 3 shows.

Conclusion
We have presented the system developed by our team for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.
In future work, we would like to experiment with new, interesting features, e.g., based on various word embeddings as in the SemanticZ system (Mihaylov and Nakov, 2016b). We also want to use our features in a deep learning architecture, e.g., as in the MTE-NN system (Guzmán et al., 2016a;Guzmán et al., 2016b), which borrowed an entire neural network framework and architecture from previous work on machine translation evaluation (Guzmán et al., 2015).
We further plan to use information from entire threads to make better predictions, as using threadlevel information for answer classification has already been shown useful for SemEval-2015 Task 3, subtask A, e.g., by using features modeling the thread structure and dialogue (Nicosia et al., 2015;Barrón-Cedeño et al., 2015), or by applying threadlevel inference using the predictions of local classifiers Joty et al., 2016). How to use such models efficiently in the ranking setup of 2016 is an interesting research question.
Finally, we would like to address subtask C in a more solid way, making good use of the data, the gold annotations, the features, the models, and the predictions for subtasks A and B.