Thread-Level Information for Comment Classification in Community Question Answering

Community Question Answering (cQA) is a new application of QA in social contexts (e.g., fora). It presents new interesting challenges and research directions, e.g., exploiting the dependencies between the different comments of a thread to select the best answer for a given question. In this paper, we explored two ways of modeling such dependencies: ( i ) by designing speciﬁc features looking globally at the thread; and ( ii ) by applying structure prediction models. We trained and evaluated our models on data from SemEval-2015 Task 3 on Answer Selection in cQA. Our experiments show that: ( i ) the thread-level features consistently improve the performance for a variety of machine learning models, yielding state-of-the-art results; and ( ii ) sequential dependencies between the answer labels captured by structured prediction models are not enough to improve the results, indicating that more information is needed in the joint model.


Introduction
Community Question Answering (cQA) is an evolution of a typical QA setting put in a Web forum context, where user interaction is enabled, without much restrictions on who can post and who can answer a question. This is a powerful mechanism, which allows users to freely ask questions and expect some good, honest answers.
Unfortunately, a user has to go through all possible answers and to make sense of them. It is often the case that many answers are only loosely related to the actual question, and some even change the topic. This is especially common for long threads where, as the thread progresses, users start talking to each other, instead of trying to answer the initial question. This is a real problem, as a question can have hundreds of answers, the vast majority of which would not satisfy the users' information needs. Thus, finding the desired information in a long list of answers might be very time-consuming.
The problem of selecting the relevant text passages (i.e., those containing good answers) has been tackled in QA research, either for non-factoid QA or for passage reranking. Usually, automatic classifiers are applied to the answer passages retrieved by a search engine to derive a relative order; see (Radlinski and Joachims, 2005;Jeon et al., 2005;Shen and Lapata, 2007;Moschitti et al., 2007;Surdeanu et al., 2008;Heilman and Smith, 2010;Wang and Manning, 2010;Severyn and Moschitti, 2012;Yao et al., 2013; for detail.
To the best of our knowledge, there is no QA work that effectively identifies good answers based on the selection of the other answers retrieved for a question. This is mainly due to the loose dependencies between the different answer passages in standard QA. In contrast, we postulate that in a cQA setting, the answers from different users in a common thread are strongly interconnected and, thus, a joint answer selection model should be adopted to achieve higher accuracy.
To test our hypothesis about the usefulness of thread-level information, we used a publicly available dataset, recently developed for the SemEval-2015 Task 3 . Subtask A in that challenge asks to identify the posts in the answer thread that answer the question well vs. those that can be potentially useful to the user vs. those that are just bad or useless.
We model the thread-level dependencies in two different ways: (i) by designing specific features that are able to capture the dependencies between the answers in the same thread; and (ii) by exploiting the sequential organization of the output labels for the complete thread.  For the latter, we used the usual extensions of Logistic Regression and SVM to linear-chain models such as CRF and SVM hmm .
The results clearly show that the thread-level features are important, providing consistent improvement for all our learning models. In contrast, the linear-chain models fail to exploit the sequential dependencies between nearby answer labels to improve the results significantly: although the labels from the neighboring answers can affect the label of the current answer, this dependency is too loose to have impact on the selection accuracy. In other words, labels should be used together with answers' content to account for stronger and more effective dependencies.

The Task
We use the CQA-QL corpus, which was used for Subtask A of SemEval-2015 Task 3 on Answer Selection in cQA. The corpus contains data from the Qatar Living forum, 1 and is publicly available on the task's website. 2 The dataset consists of questions and a list of the answers for each question, i.e., the question-answer thread. Each question, and also each answer, consists of a short title and a more detailed description. Moreover, there is some meta information associated with both, e.g., ID of the user asking/answering the question, timestamp, question category, etc.
The task asks to determine for each answer in the thread whether it is good, bad, or potentially useful. A simplified example is shown in Figure 1, 3 where answers 2 and 4 are good, answer 1 is potentially useful, and answer 3 is bad.
Below, we start with the original definition of Subtask A, as described above. Then, we switch to a binary classification setting (i.e., identifying good vs. bad answers), which is much closer to a real cQA application (see Section 4.3).

Basic and Thread-Level Features
Subsection 3.1 summarizes the basic features we used to implement the baseline systems. More importantly, Section 3.2 describes the set of threadlevel features we designed in order to test our working hypothesis. Below we use the following notation: q is a question posted by user u q , c is a comment, and C is the comment thread.
We designed a set of heuristic features that might suggest whether c is good or not. Forty-four Boolean features express whether c (i) includes URLs or emails (2 feats.); (ii) contains the word "yes", "sure", "no", "can", "neither", "okay", and "sorry", as well as symbols '?' and '@' (9 feats.); (iii) starts with "yes" (1 feat.); (iv) includes a sequence of three or more repeated characters or a word longer than fifteen characters (2 feats.); (v) belongs to one of the categories of the forum (Socialising, Life in Qatar, etc.) (26 feats.); and (vi) has been posted by the same u q , such a comment can include a question (i.e., contain a question mark), and acknowledgement (e.g., contain thank*, acknowl*), or none of them (4 feats.). An extra feature captures the length of c (as longergood-comments usually contain detailed information to answer a question).

Thread-Level Global Features
Comments are organized sequentially according to the time line of the comment thread. 4 Our first four features indicate whether c appears in the proximity of a comment by u q .  Table 1: Macro-averaged precision, recall, F 1measure, and accuracy on the multi-class (good, bad, potential) setting on the official SemEval-2015 Task 3 test set. The top-2 systems are included for comparison. QCRI refers to our official results, using an older version of our system.
The assumption is that an acknowledgment or further questions by u q in the thread could signal a good answer. More specifically, they test if among the comments following c there is one by u q (i) containing an acknowledgment, (ii) not containing an acknowledgment, (iii) containing a question, and, (iv) if among the comments preceding c there is one by u q containing a question. The value of these four features -a propagation of the information captured by some of the heuristics described in Section 3.1-depends on the distance k, in terms of the number of comments, between c and the closest comment by u q : that is, the closer the comment to c q , the higher the value assigned to this feature. We try to model potential dialogues, which at the end represent bad comments, by identifying interlacing comments between two users. Our dialogue features are identifying conversation chains: Comments by other users can appear in between the nodes of this "pseudo-conversation" chain. We consider three features: whether a comment is at the beginning, in the middle, or at the end of such a chain. Three more features exist in those cases in which u q is one of the participants of these pseudo-conversations.
Another Boolean feature for c u i is set to true if u i wrote more than one comment in the current thread. Three more features identify the first, the middle and the last comments by u i . One extra feature counts the total number of comments written by u i in the thread up to that moment.  Table 2: Performance of the binary (good vs. bad ) classifiers on the official SemEval-2015 Task 3 test dataset. Precision, recall, F 1 -measure and accuracy are calculated at the comment level, while F 1,ta and A ta are averaged at the thread level.
Moreover, we empirically observed that the likelihood of some comment being good decreases with its position in the thread. Therefore, we also included another real-valued feature: max(20, i)/20, where i represents the position of the comment in the thread.
Finally, we perform a pseudo-ranking of the comments. The relevance of c is computed as its similarity to q (using word n-grams), normalized by the maximum similarity among all the comments in the thread. The resulting relative scores are mapped into three binary features depending on the range they fall at: [0, 0.2], (0.2, 0.8), or [0.8, 1] (intervals resemble the three-class setting and were empirically set on the training data).

Experiments
Below we first describe the data we used, then we introduce the experimental setup, and finally we present and discuss the results of our experiments.

Data
The original CQA-QL corpus  consists of 3,229 questions: 2,600 for training, 300 for development, and 329 for testing. The total number of comments is 20,162, with an average of 6.24 comments per question. The class labels for the comments are distributed as follows: 9,941 good (49.31%), 2,013 potential (9.98%), and 8,208 bad (40.71%) comments.
Since a typical answer selection setting only considers correct and incorrect answers, we also experiment with potential labelled as bad.  Table 3: Precision, Recall, F 1 , Accuracy computed at the comment level; F 1,ta and A ta are averaged at the thread level. Precision, Recall, F 1 , F 1,ta are computed with respect to the good classifier on 5-fold cross-validation (mean±stand. dev.).

Experimental Setup
Our local classifiers are support vector machines (SVM) with C = 1 (Joachims, 1999), logistic regression with a Gaussian prior with variance 10, and logistic ordinal regression (McCullagh, 1980). In order to capture long-range sequential dependencies, we use a second-order SVM hmm (Yu and Joachims, 2008) (with C = 500 and epsilon = 0.01) and a second-order linear-chain CRF, which considers dependencies between three neighboring labels in a sequence (Lafferty et al., 2001;Cuong et al., 2014). In CRF, we perform two kinds of inference to find the most probable labels for the comments in a sequence. (i) We compute the maximum a posterior (MAP) or the (jointly) most probable sequence of labels using the Viterbi algorithm. Specifically, it computes y * = argmax y 1:T P (y 1:T |x 1:T ), where T is the number of comments in the thread. (ii) We use the forward-backward algorithm to find the labels by maximizing (individual) posterior marginals (MPM). More formally, we computeŷ = argmax y1 P (y 1 |x 1:T ), · · · , argmax yT P (y T |x 1:T ) . While MAP yields a globally consistent sequence of labels, MPM can be more robust in many cases; see (Murphy, 2012, p. 613) for details. CRF also uses a Gaussian prior with variance 10. 5

Experiment results
In order to compare the quality of our features to the existing state of the art, we performed a first experiment aligned to the multi-class setting of the SemEval 2015 Task 3 competition. Table 1 shows our results on the official test dataset.
As in the competition, the results are macroaveraged at class level. The results of the top 3 systems are reported for comparison: JAIST (Tran et al., 2015), HITSZ (Hou et al., 2015) and QCRI (Nicosia et al., 2015), where the latter refers to our old system that we used for the competition. The two main observations are (i) using threadlevel features helps significantly; and (ii) the ordinal regression model, which captures the idea that potential lies between good and bad, achieves at least as good results as the top system at SemEval, namely JAIST.
For the remaining experiments, we reduce the multi-class problem to a binary one (cf. Section 2). Table 2 shows the results obtained on the official test dataset. Note that ordinal regression is not applicable in this binary setting. The F 1 values for the baseline features suggest that using the labels in the thread sequence yields better performance with SVM hmm and CRF. When thread-level features are used, the models using sequence labels do not outperform SVM and logistic regression anymore. Regarding the two variations of CRF, the posterior marginals maximization is slightly better: maximizing on each comment pays more than on the entire thread.
Since the task consists in identifying good answers for a given question, further figures at the question level are necessary, i.e., we compute the target performance measure for all comments of each question and then we average the results over all threads (ta). Table 2 shows such the result using two measures: F 1 and accuracy, i.e., F 1,ta and A ta , for which long threads have less impact on the final outcome. The impact of the thread features is not-so-high in terms of these measures, sometimes even negatively affecting some of the models.  Figure 2: Two real question-comments threads (simplified; ID in CQA-QL: Q770 and Q752). The sub-indexes stand for the position in the thread and the author of the comment. The class label corresponds to the prediction before and after considering thread-level information. The right-hand label matches with the gold one in all the cases.
Cross validation. In order to better understand the mixed results obtained on the single official test set, we performed 5-fold cross validation over the entire dataset. The results are shown in Table 3. When looking at the performance of the different models with the same set of features, no statistically significant differences are observed on F 1 or F 1,ta (t-test with confidence level 95%). The sequence of predicted labels in CRF or SVM hmm does not impact the final result. In contrast, an important difference is observed when thread-level features come into play: the performance of all the models improves by approximately two F 1 points absolute, and statistically significant differences are observed for SVM and logistic regression (ttest, 95%). Moreover, while on the test dataset the thread-level features do not always improve F 1,ta and A ta , on the 5-fold cross-validation using them is always beneficial: for F 1,ta statistically significant difference is observed for SVM only (t-test, 90%).
Qualitative results. In order to get an intuition about the effect of the thread-level features, we show two example comment threads in Figure 2. These comments are classified correctly when thread features are used in the classifier, and incorrectly when only basic features are used.
In the first case (Q u 1 ), the third comment is classified as good by models that only use basic features. In contrast, thanks to the thread-level features, the classifier can consider that there is a dialogue between u 1 and u 2 , causing all the comments to be assigned to the correct class: bad.
In the second example (Q u 4 ), the first two comments are classified as bad when using the basic features. However, the third comment -written by the same user who asked Q u 4 -includes an acknowledgment. The latter is propagated to the previous comments in terms of a thread feature, which indicates that such comments are more likely to be good answers. This feature provides the classifier with enough information to properly label the first two comments as good.

Conclusions
We presented a study on using dependencies between the different answers in the same question thread in the context of answer selection in cQA. Our experiments with different classifiers, features, and experimental conditions, reveal that answer dependencies are helpful to improve results on the task. Such dependencies are best exploited by means of carefully designed thread-level features, whereas sequence label information alone, e.g., used in CRF or SVM hmm , is not effective.
In future work, we plan to (i) experiment with more sophisticated thread-level features, as well as with other features that model context in general; (ii) try data from other cQA websites, e.g., where dialogue between users is marked explicitly; and finally, (iii) integrate sequence, precedence, dependency information with globalthread-level-features in a unified framework.