Joint Learning with Global Inference for Comment Classification in Community Question Answering

This paper addresses the problem of comment classiﬁcation in community Question Answering. Following the state of the art, we approach the task with a global inference process to exploit the information of all comments in the answer-thread in the form of a fully connected graph. Our contribution comprises two novel joint learning models that are on-line and integrate inference within learning. The ﬁrst one jointly learns two node - and edge -level MaxEnt classiﬁers with stochastic gradient descent and integrates the inference step with loopy belief propagation. The second model is an instance of fully connected pairwise CRFs (FCCRF). The FCCRF model signiﬁcantly outperforms all other approaches and yields the best results on the task to date. Crucial elements for its success are the global normalization and an Ising-like edge potential.


Introduction
Online community fora have been gaining a lot of popularity in recent years. Many of them, such as Stack Exchange 1 , are quite open, allowing anybody to ask and anybody to answer a question, which makes them very valuable sources of information. Yet, this same democratic nature resulted in some questions accumulating a large number of answers, many of which are of low quality. While nowadays online fora are typically searched using standard search engines that index entire threads, this is not optimal, as it can be very time-consuming for a user to go through and make sense of a long thread. Thus, the creation of automatic systems for Community Question Answering (cQA), which could provide efficient and effective ways to find good answers in a forum, has received a lot of research attention recently (Duan et al., 2008;Li and Manandhar, 2011;dos Santos et al., 2015;Zhou et al., 2015a;Wang and Ittycheriah, 2015;Feng et al., 2015;Nicosia et al., 2015;. There have been also related shared tasks at SemEval-2015 2 and SemEval-2016 3 Nakov et al., 2016).
In this paper, we focus on the particular problem of classifying comments in the answer-thread of a given question as good or bad answers. Figure 1 presents an excerpt of a real example from the Qatar-Living dataset from SemEval-2015 Task 3. There is a question on top (Q) followed by eight comments (A 1 , A 2 , · · · , A 8 ). According to the human annotations ('Human'), all comments but 3 and 5 are good answers to Q. The comments also contain the predictions of a good-vs-bad binary classifier trained with state-of-the-art features on this dataset (Nicosia et al., 2015); its errors are highlighted in red. Many comments are short, making it difficult for the classifier to make the right decisions, but some errors could be corrected using information in the other comments. For instance, comments 4 and 7 are similar to each other, but also to comments 2 and 8 ('Hobby shop', 'City Center', etc.). It seems reasonable to think that similar comments should have the same labels, so comments 2, 4, 7 and 8 should all be labeled consistently as good comments.
Indeed, recent work has shown the benefit of using varied thread-level information for answer classification, either by developing features modeling the thread structure and dialogue , or by applying global inference mechanisms at the thread level using the predictions of local classifiers . We follow the second approach, assuming a graph representation of the answer-thread, where nodes are comments and edges represent pairwise (similarity) relations between them. Classification decisions are at the level of nodes and edges, and global inference is used to get the best label assignment to all comments. 2 http://alt.qcri.org/semeval2015/task3/ 3 http://alt.qcri.org/semeval2016/task3/ Our main contribution is to propose online models for learning the decisions jointly, incorporating the inference inside the joint learning algorithm. Building on the ideas from papers coupling learning and inference for NLP structure prediction problems (Punyakanok et al., 2005;Carreras et al., 2005), we propose joint learning of two MaxEnt classifiers with stochastic gradient descent, integrating global inference based on loopy belief propagation. We also propose a joint model with global normalization, that is an instance of Fully Connected Conditional Random Fields (Murphy, 2012). We compare our joint models with the previous state of the art for the comment classification problem. We find that the coupled learning-and-inference model is not competitive, probably due to the label bias problem. On the contrary, the fully connected CRF model improves results significantly over all rivaling models, yielding the best results on the task to date.
In the remainder of this paper, after discussing related work in Section 2, we introduce our joint models in Section 3. We then describe our experimental settings in Section 4. The experiments and analysis of results are presented in Section 5. Finally, we summarize our contributions with future directions in Section 6.

Related Work
The idea of using global inference based on locally learned classifiers has been tried in various settings. In the family of graph-based inference, Pang and Lee (2004) used local classification scores with proximity information as edge weights in a graph-cut inference to collectively identify subjective sentences in a review. Thomas et al. (2006) used the same framework to classify congressional transcribed speeches. They applied a classifier to provide edge weights that reflect the degree of agreement between speakers. Burfoot et al. (2011) extended the framework by including other inference algorithms such as loopy belief propagation and mean-field.
"Learning and inference under structural and linguistic constraints" ) is a framework to combine the predictions of local classifiers in a global decision process solved by Integer Linear Programming.
The framework has been applied to many NLP structure prediction problems, including shallow parsing (Punyakanok and Roth, 2000), semantic role labeling (Punyakanok et al., 2004), and joint learning of entities and relations . Further work explored the possibility of coupling learning and inference in the previous setting. For instance, Carreras et al. (2005) presented a model for parsing that jointly trains several local decisions with a perceptron-like algorithm that gets feedback after inference. Punyakanok et al. (2005) studied empirically and theoretically the cases in which this inference-based learning strategy is superior to the decoupled approach.
On the particular problem of comment classification in cQA, we find some work exploiting threadlevel information. Hou et al. (2015) used features about the position of the comment in the thread.  developed more elaborated global features to model thread structure and the interaction among users. Other work exploited global inference algorithms at the thread-level. For instance, (Zhou et al., 2015c;Zhou et al., 2015b; treated the task as sequential classification, using a variety of machine learning algorithms to label the sequence of timesorted comments: LSTMs, CRFs, SVM hmm , etc. Finally,  showed that exploiting the pairwise relations between comments (at any distance) is more effective than the sequential information. Their results are the best on this task to date. In this paper, we assume the same setting (cf. Section 3) and we experiment with new models to do learning jointly with inference in the same manner as in (Punyakanok et al., 2005), and also using fully connected pairwise CRFs.

Our Model
Given a forum question Q and a thread of answers T = {A 1 , A 2 , · · · , A n }, the task is to classify each answer A i in the thread into one of K possible classes based on its relevance to the question. We represent each thread as a fully-connected graph, where each node represents an answer in the thread.
Given this setting, there exist at least three fundamentally different approaches to learn classification functions.
First, the traditional approach of learning a local classifier ignoring the structure in the output and using it to predict the label of each node A i separately. This approach only considers correlations between the label of A i and features extracted from A i .
The second approach, adopted by , is to first learn two local classifiers separately: (i) a node-level classifier to predict the label for each individual node, and (ii) an edge-level classifier to predict whether the two nodes connected by an edge should have the same label or not (assuming a fully connected graph). The predictions of the local classifiers are then used in a global inference algorithm (e.g., graph-cut) to perform collective classification by maintaining structural constraints in the output. There are two issues with this model: (i) the local classifiers are trained separately; (ii) by decoupling learning from inference, this approach can lead to suboptimal solutions, as Punyakanok et al. (2005) pointed out.
The third approach, which we adopt in this paper, is to model the dependencies between the output variables while learning the classification functions jointly by optimizing a global performance criterion. The dependencies are captured using nodelevel and edge-level factors defined over a fully connected graph. The idea is that incorporating structural constraints in the form of all-pair relations during training can yield a better solution that directly optimizes an objective function for the target task.
Before we present our models in subsections 3.1 and 3.2, let us first introduce the notation that we will use. Each thread T = {A 1 , A 2 , · · · , A n } is represented by a complete graph G = (V, E). Each node i ∈ V in the graph is associated with an input vector x i , which represents the features of an answer A i , and an output variable y i ∈ {1, 2, · · · , K}, representing the class label. Similarly, each edge (i, j) ∈ E is associated with an input feature vector φ(x i , x j ), derived from the node-level features, and an output variable y i,j ∈ {1, 2, · · · , L}, representing the labels for the pair of nodes. We use ψ n (y i |x i , v) and ψ e (y i,j |φ(x i , x j ), w) to denote the node-level and the edge-level classification functions, respectively. We call ψ n and ψ e factors, which can be either normalized (e.g., probabilities) or unnormalized quantities. The model parameters θ = [v, w] are to be learned during training.

Joint Learning of Two Classifiers with Global Thread-Level Inference
Our aim is to train the local classifiers so that they produce correct global classification. To this end, in our first model we train the node-and the edge-level classifiers jointly based on global feedback provided by a global inference algorithm. The global feedback determines how much to adjust the local classifiers so that the classifiers and the inference together produce the desired result. We use log-linear models (aka maximum entropy) for both classifiers: The log likelihood (LL) for one data point (x, y) (i.e., a thread) can be written as follows: where y k i and y l i,j are the gold labels for i-th node and (i, j)-th edge expressed in 1-of-K (or 1-of-L) encoding, respectively, and Z(·) terms are the local normalization constants.
We give a pseudocode in Algorithm 1 that trains this model in an online fashion using feedback from the loopy belief propagation (LBP) inference algorithm (to be described later in Section 3.1.1). Specifically, the marginals from the LBP are used in a stochastic gradient descent (SGD) algorithm, which has the following (minibatch) update rule: where θ t and η t are the model parameters and the learning rate at step t, respectively, and 1 N f (θ t ) is the mean gradient for the minibatch (a thread). For our maximum entropy models, the gradients become Algorithm 1: Joint learning of local classifiers with global thread-level inference 1. Initialize the model parameters v and w; 2. repeat for each thread G = (V, E) do a. Compute node and edge probabilities ψ n (y i |x i , v) and ψ e (y i,j |φ(x i , x j ), w); b. Infer node and edge marginals β n (y i ) and β e (y i,j ) using sum-product LBP; In the above equations, β and y are the marginals and the gold labels, respectively.
Note that when applying the model to the test threads, we need to perform the same global inference to get the best label assignments.

Inference Using Belief Propagation
Belief Propagation or BP (Pearl, 1988) is a message passing algorithm for inference in probabilistic graphical models. It supports (i) sum-product, to compute the marginal distribution for each unobserved variable, i.e., p(y i |x, θ); and (ii) maxproduct, to compute the most likely label configuration, i.e., argmax y p(y|x, θ). We describe here the variant that operates on undirected graphs (aka Markov random fields) with pairwise factors, which uses the following equations: where µ i→j is a message from node i to node j, N (i) are the nodes neighbouring i, and ψ n (y i ) and ψ e (y i,j ) are the node and the edge factors. The algorithm proceeds by sending messages on each edge until the node beliefs β n (y i ) stabilize. The edge beliefs can be written as follows: The node and the edge marginals are then computed by normalizing the node and the edge beliefs, respectively. By replacing the summation with a max operation in Equation 7, we can get the most likely label configuration (i.e., argmax over labels).
BP is guaranteed to converge to an exact solution if the graph is a tree. However, exact inference is intractable for general graphs, i.e., graphs with loops. Despite this, it has been advocated by Pearl (1988) to use BP in loopy graphs as an approximation scheme; see also (Murphy, 2012), page 768. The algorithm is then called "loopy" BP, or LBP. Although LBP gives approximate solutions for general graphs, it often works well in practice (Murphy et al., 1999), outperforming other methods such as mean field (Weiss, 2001) and graph-cut (Burfoot et al., 2011).
It is important to mention that the approach presented above (i.e., subsection 3.1) is similar in spirit to the approach of Collins (2002), Carreras and Màrquez (2003) and Punyakanok et al. (2005). The main difference is that they use a Perceptron-like online algorithm, where the updates are done based on the best label configuration (i.e., argmax y p(y|x, θ)) rather than the marginals.
One can use graph-cut (applicable only for binary output variables) or max-product LBP for the decoding task. However, this yields a discontinuous estimate (even with averaged perceptron) for the gradient (see Section 5). For the same reason, we use sum-product LBP rather than max-product LBP.

A Joint Model with Global Normalization
Although the approach of updating the parameters of the local classifiers based on the global inference might seem like a natural extension to train the classifiers jointly, it suffers from at least two problems. First, since the node and the edge scores are normalized locally (see Equations 1 and 2), this approach leads to the so-called label bias problem, previously discussed by Lafferty et al. (2001). Namely, due to the local normalization, local features at any node do not influence states of other nodes in the graph. Second, the two classifiers use their own feature sets. However, the same feature sets that give optimal results locally (i.e., when trained on local objectives), may not work well when the models are trained jointly based on the global feedback. In order to address these issues, below we propose a different model.
In our second approach, we seek to build a joint model with global normalization. We define the following conditional joint distribution: where ψ n and ψ e are the node and edge factors, and Z(·) is the global normalization constant that ensures a valid probability distribution.
This model is essentially a fully connected conditional random field or FCCRF (Murphy, 2012). Figure 2 shows the differences between the two models with the standard graphical model representation. 4 The global normalization allows CRFs to take longrange interactions into account. Similar to our previous model, we use a log-linear representation for the factors: where φ(·) is a feature vector derived from the inputs and the labels. The LL for one data point becomes This objective is convex, so we can use gradientbased methods to find the global optimum. The gradients have the following form: where E[φ(.)] terms denote the expected feature vector. Traditionally, CRFs have been trained using offline methods like limited-memory BFGS. Online training of CRFs using SGD was proposed by Vishwanathan et al. (2006). To compare our two methods, we use SGD to train our CRF models. The pseudocode is very similar to Algorithm 1.

Modeling Edge Factors
One crucial aspect in the joint models described above is the modeling of edge factors. The traditional way is to define edge factors, where y i,j spans over all possible state transitions, that is K 2 different transitions, each of which is associated with a weight vector. This method has the advantage that it models transitions in a fine-grained way, but, in doing so, it also increases the number of model parameters, which may result in overfitting.
Alternatively, one can define Ising-like edge factors, where we only distinguish between two transitions: (i) same, when y i = y j and (ii) different, when y i = y j . This modeling involves tying one set of parameters for all same transitions, and another set for all different transitions.

Experimental Setting
In this section, we describe our experimental setting. We first introduce the dataset we use, then we present the features and the models that we compare.

Datasets and Evaluation
We experimented with the dataset from SemEval-2015 Task 3 on Answer Selection for Community Question Answering . The dataset contains question-answer threads from the Qatar Living forum. 5 Each thread consists of a question followed by one or more (up to 143) comments. The dataset is split into training, development and test sets, with 2,600, 300, and 329 questions, and 16,541, 1,645, and 1,976 answers, respectively.
Each comment in the dataset is annotated with one of the following labels, reflecting how well it answers the question: Good, Potential, Bad, Dialogue, Not English, and Other. At SemEval-2015 Task 3, the latter four classes were merged into BAD at testing time, and the evaluation measure uses a macroaveraged F 1 over the three classes: Good, Potential, and BAD. Unfortunately, the Potential class was both the smallest (covering about 10% of the data), and also the noisiest and the hardest to predict; yet, its impact was magnified by the macro-averaged F 1 . Thus, subsequent work has further merged Potential under BAD ( , and has used for evaluation F 1 with respect to the Good category (or just accuracy). For our experiments below, we also report F 1 for the Good class and the overall accuracy. We further perform statistical significance tests using an approximate randomization test based on accuracy. 6 We used SIGF V.2 (Padó, 2006) with 10,000 iterations.

Features
For comparison, we use the features from our previous work  to implement all classifiers in our models and baselines. There are two sets of features, corresponding to the two main classification problems in the models: node-level (i.e., classifying a comment as good or bad) and edge-level (i.e., classifying a pair of comments as having the same or different labels).
The features for node-level classification include three types of information: (i) a variety of textual similarity measures computed between the question and the comment, (ii) several boolean features capturing the presence of certain relevant words or patterns, e.g., URLs, emails, positive/negative words, acknowledgements, forum categories, presence of long words, etc., and (iii) a set of global features modeling dialogue and user interactions in the answer-thread. The features in the last two categories are manually engineered (Nicosia et al., 2015;. The features we use for edge-level classification include (i) all features from the node classification problem coded as the absolute value of the difference between the two comments, (ii) a variety of text similarity features between the two comments, (iii) the good/bad predictions of the node-level classifier on the two comments involved in the edge decision. See  and  for a detailed description of the features.

Methods Compared
We experimentally compare our above-described joint models to some baselines and to the state of the art for this problem. We briefly describe all models below, together with the names used in the tables of results.

Independent Comment Classification (ICC)
These are binary classifiers to label thread comments independently into good and bad categories. The simplest baseline (Majority) classifies all examples with the most frequent category. We also train a MaxEnt classifier with stochastic gradient descent (SGD) and a voted perceptron (ICC M E and ICC P erc , respectively).
Learning-and-Inference Models (LI) This is the approach presented by , who report the best results on the task. The model is explained in Section 3. We experiment with MaxEnt classifiers trained on-line with SGD and two different inference strategies, graph-cut and loopy BP (LI M E−GC and LI M E−LBP , in our notation).
Joint Learning and Inference Models These are our new models. First, we experiment with the model for joint learning of two classifiers coupled with thread-level inference (Section 3.1). We have two versions, one using MaxEnt classifiers and the other using averaged Perceptron. The inference algorithm is loopy BP in both cases. We call these methods Joint M E−LBP and Joint P erc−LBP , respectively. Second, we experiment with the joint model with global normalization (cf. Section 3.2). We call it FCCRF, for fully connected CRF. We use the Ising-like edge factors defined in Section 3.2.1.

Evaluation
All results we report below are calculated in the test set, using parameters tuned on the development set.
Our main results are shown in Table 1, where we report accuracy (Acc) as well as precision (P), recall (R) and F 1 -score (F 1 ) for the good class.
The models are organized in four blocks. On top, we see that the majority class baseline achieves accuracy of 50.5%, as the dataset is very well balanced between the classes.
In block II, we find the results for the local classifiers, ICC M E and ICC P erc , which achieve very similar results. They are comparable to MaxEnt in Table 2, where we report the best published results on this dataset; yet, our classifiers are trained on-line.
Block III in the table reports results for models that train two local MaxEnt classifiers and then perform thread-level inference using either graph-cut (LI M E−GC ) or loopy BP (LI M E−LBP ). 7 This yields improvements over the ICC models with the threadlevel inference in block II, which is consistent with the findings of ; however, the difference in terms of accuracy is not statistically significant (p-value = 0.09).   .
Comparing our LI M E−GC to MaxEnt+GraphCut in Table 2, we see that we are slightly worse: -0.2 in F 1 -score, and -0.4 in accuracy. It turns out that this is due to our on-line MaxEnt classifier for the pairwise classification being slightly worse (-0.4 accuracy points absolute), which could explain the lower performance after the graph-cut inference.
Next, block IV shows that the fully connected CRF model (FCCRF) improves over the models in block III by more than one point absolute in both F 1 and accuracy. The improvement is statistically significant (p-value = 0.04); especially noticeable is the increase in recall (+2.6 points). This result is also an improvement over the state of the art, as Table 2 shows.
Again in block IV, we can see that the two models that perform joint training of two classifiers and then integrate inference in the training loop, Joint M E−LBP and Joint P erc−LBP , do not work well and fall below the learning and inference models from block III. As we explained above, these models have two major disadvantages compared to FCCRF: (i) the local normalization of node and edge scores is prone to label bias issues; (ii) each of the two classifiers uses its own feature set, which might not be optimal when they are trained jointly based on the global feedback.
Notice that the version using Perceptron, Joint P erc−LBP , works bad in this setting. Since updates are done after each thread-level inference, we could not use a voted perceptron, but an averaged one (Collins, 2002). Moreover, it did not yield probabilities but real-valued scores, which we had to remap to the [0;1] interval using a sigmoid. Table 3 compares different variants of CRF. The first two rows show the results for the commonly used linear-chain CRF (LCCRF) of order 1 and 2. We can see that these models fall two accuracy (and F 1 ) points below FCCRF, which indicates that the pairwise relations between non-consecutive comments provide additional relevant information for the task. The fourth row shows the results when we eliminate the edge-level features and we consider state transitions using the bias features only: the decrease in performance is tiny, which means that what matters is to model the interaction in the first place; the particular features used are less important. More noticeable is the effect of using Ising-like modeling of the edge factors in our FCCRF model. If we use finer-grained edge factors for each of the four combinations (Good-Good, Good-Bad, Bad-Good, and Bad-Bad), the performance decreases significantly, mostly due to a drop in recall (see 'FCCRF (4C)').

Error Analysis
Next, we get a closer look at the predictions made by our best Local (ICC M E ), Inference (LI M E−GC ), and Global (FCCRF) models. We focus on questions for which there are at least two comments. There were 280 such test questions (out of 329), with a total of 1,927 comments.  The Local, the Inference, and the Joint models made correct predictions for 78.7%, 79.1% and 80.4% of the comments, respectively. We can see that the Inference model behaves more like Local, and not so much like Joint. This is indeed further confirmed when we look at the agreement between each pair of models: Local vs. Inference has 6.0% disagreement, for Local vs. Joint it is 9.9%, and for Inference vs. Joint it is 8.8%. Figure 3 compares the three models vs. the gold human labels on a particular test question (ID=Q2908; some long comments are truncated and the four omitted answers were classified correctly by all three classifiers). We can see that the Joint model is more robust than the Local one: while Joint corrects two of the three wrong classifications of Local, Inference makes two further errors instead.

Conclusion
We have proposed two learning methods for comment classification in community Question Answering. We depart from the state-of-the-art knowledge that exploiting the interrelations between all the comments in the answer-thread is beneficial for the task. Thus, we take as our baseline the learning and inference model from , in which the answer-thread is modeled as a fully connected graph. Our contribution consists of moving the framework to on-line learning and proposing two models for coupling learning with inference.
Our first model learns jointly the two MaxEnt classifiers with SGD and incorporates the graph inference at every step with loopy belief propagation. This model, due to its local normalization, suffers from the label bias problem. The alternative we proposed is to use an instance of a Fully Connected CRF that operates on the same graph and considers the node and edge factors with a shared set of features.  Figure 3: Sample test question with a thread of comments and, for each comment, decisions by the local (Loc), the global inference (Inf), and the global joint (Jnt) classifiers, as well as by the human annotators.
One of the main advantages is that the normalization is global. We experimented with the SemEval-2015 Task 3 dataset and we confirmed the advantage of the FCCRF model, which outperforms all baselines and achieves better results than the state of the art. In the near future, we plan to apply the FCCRF model to the full cQA task, i.e., finding good answers to newly-asked questions using previouslyasked questions and their answer threads. In this setting, we want to experiment with (i) ranking comments (instead of classifying them), (ii) exploiting the similarities between the new question and the questions in the database and also the relations between comments across different answer-threads.