Joint Multitask Learning for Community Question Answering Using Task-Specific Embeddings

We address jointly two important tasks for Question Answering in community forums: given a new question, (i) find related existing questions, and (ii) find relevant answers to this new question. We further use an auxiliary task to complement the previous two, i.e., (iii) find good answers with respect to the thread question in a question-comment thread. We use deep neural networks (DNNs) to learn meaningful task-specific embeddings, which we then incorporate into a conditional random field (CRF) model for the multitask setting, performing joint learning over a complex graph structure. While DNNs alone achieve competitive results when trained to produce the embeddings, the CRF, which makes use of the embeddings and the dependencies between the tasks, improves the results significantly and consistently across a variety of evaluation metrics, thus showing the complementarity of DNNs and structured learning.


Introduction and Motivation
Question answering web forums such as Stack-Overflow, Quora, and Yahoo! Answers usually organize their content in topically-defined forums containing multiple question-comment threads, where a question posed by a user is often followed by a possibly very long list of comments by other users, supposedly intended to answer the question. Many forums are not moderated, which often results in noisy and redundant content.
Within community Question Answering (cQA) forums, two subtasks are of special relevance when a user poses a new question to the website (Hoogeveen et al., 2018;Lai et al., 2018): (i) finding similar questions (question-question relatedness), and (ii) finding relevant answers to the new question, if they already exist (answer selection). * Work conducted while this author was at QCRI, HBKU.
Both subtasks have been the focus of recent research as they result in end-user applications. The former is interesting for a user who wants to explore the space of similar questions in the forum and to decide whether to post a new question. It can also be relevant for the forum owners as it can help detect redundancy, eliminate question duplicates, and improve the overall forum structure. Subtask (ii) on the other hand is useful for a user who just wants a quick answer to a specific question, without the need of digging through the long answer threads and winnowing good from bad comments or without having to post a question and then wait for an answer.
Obviously, the two subtasks are interrelated as the information needed to answer a new question is usually found in the threads of highly related questions. Here, we focus on jointly solving the two subtasks with the help of yet another related subtask, i.e., determining whether a comment within a question-comment thread is a good answer to the question heading that thread.
An example is shown in Figure 1. A new question q is posed for which several potentially related questions are identified in the forum (e.g., by using an information retrieval system); q i in the example is one of these existing questions. Each retrieved question comes with an associated thread of comments; c i m represents one comment from the thread of question q i . Here, c i m is a good answer for q i , q i is indeed a question related to q, and consequently c i m is a relevant answer for the new question q. This is the setting of SemEval-2016 Task 3, and we use its benchmark datasets.
Our approach has two steps. First, a deep neural network (DNN) in the form of a feed-forward neural network is trained to solve each of the three subtasks separately, and the subtask-specific hidden layer activations are taken as embedded feature representations to be used in the second step. q: "How can I extend a family visit visa?" qi: "Dear All; I wonder if anyone knows the procedure how I can extend the family visit visa for my wife beyond 6 months. I already extended it for 5 months and is 6 months running. I would like to get it extended for couple of months more.Any suggestion is highly appreciable.Thanks" c i m : "You can get just another month's extension before she completes 6 months by presenting to immigration office a confirmed booking of her return ticket which must not exceed 7 months." Figure 1: Example of the three pieces of information in the cQA problems addressed in this paper.
Then, a conditional random field (CRF) model uses these embeddings and performs joint learning with global inference to exploit the dependencies between the subtasks.
A key strength of DNNs is their ability to learn nonlinear interactions between underlying features through specifically-designed hidden layers, and also to learn the features (e.g., vectors for words and documents) automatically. This capability has led to gains in many unstructured output problems. DNNs are also powerful for structured output problems. Previous work has mostly relied on recurrent or recursive architectures to propagate information through hidden layers, but has been disregarding the modeling strength of structured conditional models, which use global inference to model consistency in the output structure (i.e., the class labels of all nodes in a graph). In this work, we explore the idea that combining simple DNNs with structured conditional models can be an effective and efficient approach for cQA subtasks that offers the best of both worlds.
Our experimental results show that: (i) DNNs already perform very well on the questionquestion similarity and answer selection subtasks; (ii) strong dependencies exist between the subtasks under study, especially answer-goodness and question-question-relatedness influence answerselection significantly; (iii) the CRFs exploit the dependencies between subtasks, providing sizeably better results that are on par or above the state of the art. In summary, we demonstrate the effectiveness of this marriage of DNNs and structured conditional models for cQA subtasks, where a feed-forward DNN is first used to build vectors for each individual subtask, which are then "reconciled" in a multitask CRF.

Related Work
Various neural models have been applied to cQA tasks such as question-question similarity (dos Santos et al., 2015;Lei et al., 2016; and answer selection (Wang and Nyberg, 2015;Qiu and Huang, 2015;Tan et al., 2015;Chen and Bunescu, 2017;Wu et al., 2018). Most of this work used advanced neural network architectures based on convolutional neural networks (CNN), long short-term memory (LSTM) units, attention mechanism, etc. For instance, dos Santos et al. (2015) combined CNN and bag of words for comparing questions. Tan et al. (2015) adopted an attention mechanism over bidirectional LSTMs to generate better answer representations, and Lei et al. (2016) combined recurrent and CNN models for question representation. In contrast, here we use a simple DNN model, i.e., a feed-forward neural network, which we only use to generate taskspecific embeddings, and we defer the joint learning with global inference to the structured model.
From the perspective of modeling cQA subtasks as structured learning problems, there is a lot of research trying to exploit the correlations between the comments in a question-comment thread. This has been done from a feature engineering perspective, by modeling a comment in the context of the entire thread , but more interestingly by considering a thread as a structured object, where comments are to be classified as good or bad answers collectively. For example,  treated the answer selection task as a sequence labeling problem and used recurrent convolutional neural networks and LSTMs.  modeled the relations between pairs of comments at any distance in the thread, and combined the predictions of local classifiers using graph-cut and Integer Linear Programming. In a follow up work,  also modeled the relations between all pairs of comments in a thread, but using a fully-connected pairwise CRF model, which is a joint model that integrates inference within the learning process using global normalization. Unlike these models, we use DNNs to induce taskspecific embeddings, and, more importantly, we perform multitask learning of three different cQA subtasks, thus enriching the relational structure of the graphical model. We solve the three cQA subtasks jointly, in a multitask learning framework. We do this using the datasets from the SemEval-2016 Task 3 on Community Question Answering , which are annotated for the three subtasks, and we compare against the systems that participated in that competition. In fact, most of these systems did not try to exploit the interaction between the subtasks or did so only as a pipeline. For example, the top two systems, SUPER TEAM (Mihaylova et al., 2016) and KELP (Filice et al., 2016), stacked the predicted labels from two subtasks in order to solve the main answer selection subtask using SVMs. In contrast, our approach is neural, it is based on joint learning and task-specific embeddings, and it is also lighter in terms of features.
In work following the competition, Nakov et al. (2016a) used a triangulation approach to answer ranking in cQA, modeling the three types of similarities occurring in the triangle formed by the original question, the related question, and an answer to the related comment. However, theirs is a pairwise ranking model, while we have a joint model. Moreover, they focus on one task only, while we use multitask learning. Bonadiman et al. (2017) proposed a multitask neural architecture where the three tasks are trained together with the same representation. However, they do not model comment-comment interactions in the same question-comment thread nor do they train taskspecific embeddings, as we do.
The general idea of combining DNNs and structured models has been explored recently for other NLP tasks. Collobert et al. (2011) used Viterbi inference to train their DNN models to capture dependencies between word-level tags for a number of sequence labeling tasks: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling.  proposed an LSTM-CRF framework for such tasks. Ma and Hovy (2016) included a CNN in the framework to compute word representations from character-level embeddings. While these studies consider tasks related to constituents in a sentence, e.g., words and phrases, we focus on methods to represent comments and to model dependencies between comment-level tags. We also experiment with arbitrary graph structures in our CRF model to model dependencies at different levels.

Learning Approach
Let q be a newly-posed question, and c i m denote the m-th comment (m ∈ {1, 2, . . . , M }) in the answer thread for the i-th potentially related question q i (i ∈ {1, 2, . . . , I}) retrieved from the forum. We can define three cQA subtasks: (A) classify each comment c i m in the thread for question q i as Good vs. Bad with respect to q i ; (B) determine, for each retrieved question q i , whether it is Related to the new question q in the sense that a good answer to q i might also be a good answer to q; and finally, (C) classify each comment c i m in each answer thread as either Relevant or Irrelevant with respect to the new question q.
Let y a i,m ∈ {Good, Bad}, y b i ∈ {Related, N ot-related}, and y c i,m ∈ {Relevant, Irrelevant} denote the corresponding output labels for subtasks A, B, and C, respectively. As argued before, subtask C depends on the other two subtasks. Intuitively, if c i m is a good comment with respect to the existing question q i , and q i is related to the new question q (subtask A), then c i m is likely to be a relevant answer to q. Similarly, subtask B can benefit from subtask C: if comment c i m in the answer thread of q i is relevant with respect to q, then q i is likely to be related to q.
We propose to exploit these inherent correlations between the cQA subtasks as follows: (i) by modeling their interactions in the input representations, i.e., in the feature space of (q, q i , c i m ), and more importantly, (ii) by capturing the dependencies between the output variables (y a i,m , y b i , y c i,m ). Moreover, we cast each cQA subtask as a structured prediction problem in order to model the dependencies between output variables of the same type. Our intuition is that if two comments c i m and c i n in the same thread are similar, then they are likely to have the same labels for both subtask A and subtask C, i.e., y a i,m ≈ y a i,n , and y c i,m ≈ y c i,n . Similarly, if two pre-existing questions q i and q j are similar, they are also likely to have the same labels, i.e., y b i ≈ y b j . Our framework works in two steps. First, we use a DNN, specifically, a feed-forward NN, to learn task-specific embeddings for the three subtasks, i.e., output embeddings x a i,m , x b i and x c i,m for subtasks A, B and C ( Figure 2a). The DNN uses syntactic and semantic embeddings of the input elements, their interactions, and other similarity features between them and, as a by-product, learns the output embeddings for each subtask.
In the second step, a structured conditional model operates on subtask-specific embeddings from the DNNs and captures the dependencies between the On the left (a), we have three feed-forward neural networks to learn task-specific embeddings for the three cQA subtasks. On the right (b), a global conditional random field (CRF) models intra-and inter-subtask dependencies.
subtasks, between existing questions, and between comments for an existing question ( Figure 2b). Below, we describe the two steps in detail.

Neural Models for cQA Subtasks
Figure 2a depicts our complete neural framework for the three subtasks. The input is a tuple (q, q i , c i m ) consisting of a new question q, a retrieved question q i , and a comment c i m from q i 's answer thread. We first map the input elements to fixed-length vectors (z q , z q i , z c i m ) using their syntactic and semantic embeddings. Depending on the requirements of the subtasks, the network then models the interactions between the inputs by passing their embeddings through non-linear hidden layers ν(·). Additionally, the network also considers pairwise similarity features φ(·) between two input elements that go directly to the output layer, and also through the last hidden layer. The pairwise features together with the activations at the final hidden layer constitute the task-specific embeddings for each subtask t: . The final layer defines a Bernoulli distribution for each subtask t ∈ {a, b, c}: where x t i , w t , and y t i are the task-specific embedding, the output layer weights, and the prediction variable for subtask t, respectively, and sig(·) refers to the sigmoid function. We train the models by minimizing the crossentropy between the predicted distribution and the gold labels. The main difference between the models is how they compute the task-specific embeddings x t i for subtask t. Neural Model for Subtask A. The feedforward network for subtask A is shown in the lower part of Figure 2a. To determine whether a comment c i m is good with respect to the thread question q i , we model the interactions between c i m and q i by merging their embeddings z c i m and z q i , and passing them through a hidden layer: where U a is the weight matrix from the inputs to the first hidden units, f is a non-linear activation function. The activations are then fed to a final subtask-specific hidden layer, which combines these signals with the pairwise similarity features where V a is the weight matrix. The task-specific output embedding is formed by merging h a 2 and To determine whether an existing question q i is related to the new question q, we model the interactions between q and q i using their embeddings and pairwise similarity features similarly to subtask A. The upper part of Figure 2a shows the network. The transformation is defined as follows: where U b and V b are the weight matrices in the first and second hidden layer. The task-specific embedding is formed by Neural Model for Subtask C. The network for subtask C is shown in the middle of Figure 2a.
To decide if a comment c i m in the thread of q i is relevant to q, we consider how related q i is to q, and how useful c i m is to answer q i . Again, we model the direct interactions between q and c i m using pairwise features φ c (q, c i m ) and a hidden layer where U c is a weight matrix. We then include a second hidden layer to combine the activations from different inputs and pairwise similarity features. Formally, The final task-specific embedding for subtask C is formed as

Joint Learning with Global Inference
One simple way to exploit the interdependencies between the subtask-specific embeddings (x a i,m , is to precompute the predictions for some subtasks (A and B), and then to use the predictions as features for the other subtask (C). However, as shown later in Section 6, such a pipeline approach propagates errors from one subtask to the subsequent ones. A more robust way is to build a joint model for all subtasks.
We could use the full DNN network in Figure 2a to learn the classification functions for the three subtasks jointly as follows: where θ = [θ a , θ b , θ c ] are the model parameters.
However, this has two key limitations: (i) it assumes conditional independence between the subtasks given the parameters; (ii) the scores are normalized locally, which leads to the so-called label bias problem (Lafferty et al., 2001), i.e., the features for one subtask would have no influence on the other subtasks. Thus, we model the dependencies between the output variables by learning (globally normalized) node and edge factor functions that jointly optimize a global performance criterion. In particular, we represent the cQA setting as a large undirected graph G=(V, E)=(V a ∪V b ∪V c , E aa ∪E bb ∪E cc ∪E ac ∪E bc ∪E ab ). As shown in Figure 2b, the graph contains six subgraphs: are associated with the three subtasks, while the bipartite subgraphs G ac =(V a ∪ V c , E ac ), G bc =(V b ∪ V c , E bc ) and G ab =(V a ∪ V b , E ab ) connect nodes across tasks.
We associate each node u ∈ V t with an input vector x u , representing the embedding for subtask t, and an output variable y u , representing the class label for subtask t. Similarly, each edge (u, v) ∈ E st is associated with an input feature vector µ(x u , x v ), derived from the node-level features, and an output variable y uv ∈ {1, 2, · · · , L}, representing the state transitions for the pair of nodes. 1 For notational simplicity, here we do not distinguish between comment and question nodes, rather we use u and v as general indices. We define the following joint conditional distribution: where τ = {a, b, c}, ψ n (·) and ψ e (·) are node and edge factors, respectively, and Z(·) is a global normalization constant. We use log-linear factors: where σ(·) is a feature vector derived from the inputs and the labels. This model is essentially a pairwise conditional random field (Murphy, 2012). The global normalization allows CRFs to surmount the label bias problem, allowing them to take long-range interactions into account. The objective in Equation 5 is a convex function, and thus we can use gradientbased methods to find the global optimum. The gradients have the following form: where E[φ(·)] is the expected feature vector.
Training and Inference. Traditionally, CRFs have been trained using offline methods like LBFGS (Murphy, 2012). Online training using first-order methods such as stochastic gradient descent was proposed by Vishwanathan et al. (2006). Since our DNNs are trained with the RMSprop online adaptive algorithm (Tieleman and Hinton, 2012), in order to compare our two models, we use RMSprop to train our CRFs as well. For our CRF models, we use Belief Propagation, or BP, (Pearl, 1988) for inference. BP converges to an exact solution for trees. However, exact inference is intractable for graphs with loops. Despite this, Pearl (1988) advocated for the use of BP in loopy graphs as an approximation. Even though BP only gives approximate solutions, it often works well in practice for loopy graphs (Murphy et al., 1999), outperforming other methods such as mean field (Weiss, 2001).

Variations of Graph Structures.
A crucial advantage of our CRFs is that we can use arbitrary graph structures, which allows us to capture dependencies between different types of variables: (i) intra-subtask, for variables of the same subtask, e.g., y b i and y b j in Figure 2b, and (ii) acrosssubtask, for variables of different subtasks.
For intra-subtask, we explore null (i.e., no connection between nodes) and fully-connected relations. For subtasks A and C, the intra-subtask connections are restricted to the nodes inside a thread, e.g., we do not connect y c i,m and y c j,m in Figure 2b. For across-subtask, we explored three types of connections depending on the subtasks involved: (i) null or no connection between subtasks, (ii) 1:1 connection for A-C, where the corresponding nodes of the two subtasks in a thread are connected, e.g., y a i,m and y c i,m in Figure 2b, and (iii) M:1 connection to B, where we connect all the nodes of C or A to the thread-level B node. Each configuration of intra-and acrossconnections yields a different CRF model. Figure 2b shows one such model for two threads each containing two comments, where all subtasks have fully-connected intra-subtask links, 1:1 connection for A-C, and M:1 for C-B and A-B.

Features for the DNN Models
We have two types of features: (i) input embeddings, for q, q i and c i m , and (ii) pairwise features, for (q, q i ), (q, c i m ), and (q i , c i m ) -see Figure 2a.

Input Embeddings
We use three types of pre-trained vectors to represent a question (q or q i ) or a comment (c i m ): GOOGLE VECTORS. 300-dimensional embedding vectors, trained on 100 billion words from Google News (Mikolov et al., 2013). The embedding for a question (or comment) is the average of the word embeddings it is composed of.
SYNTAX. We parse the question (or comment) using the Stanford neural parser (Socher et al., 2013), and we use the final 25-dimensional vector produced internally as a by-product of parsing.
QL VECTORS. We use fine-tuned word embeddings pretrained on all the available in-domain Qatar Living data .
BLEU COMPONENTS. We further use various components involved in the computation of BLEU: 2 n-gram precisions, n-gram matches, total number of n-grams (n=1,2,3,4), lengths of the hypotheses and of the reference, length ratio between them, and BLEU's brevity penalty.
QUESTION-COMMENT RATIO.
(1) questionto-comment count ratio in terms of sentences/tokens/nouns/verbs/adjectives/adverbs/pronouns; (2) question-to-comment count ratio of words that are not in WORD2VEC's Google News vocabulary.
META FEATURES.
(1) is the person answering the question the one who asked it; (2) reciprocal rank of comment c i m in the thread of q i , i.e., 1/m;  (3) reciprocal rank of c i m in the list of comments for q, i.e., 1/[m+10×(i − 1)]; and (4) reciprocal rank of question q i in the list for q, i.e., 1/i.

Data and Settings
We experiment with the data from SemEval-2016 Task 3 . Consistently with our notation from Section 3, it features three subtasks: subtask A (i.e., whether a comment c i m is a good answer to the question q i in the thread), subtask B (i.e., whether the retrieved question q i is related to the new question q), and subtask C (i.e., whether the comment c i m is a relevant answer for the new question q). Note that the two main subtasks we are interested in are B and C. DNN Setting. We preprocess the data using min-max scaling. We use RMSprop 3 for learning, with parameters set to the values suggested by Tieleman and Hinton (2012). We use up to 100 epochs with patience of 25, rectified linear units (ReLU) as activation functions, l 2 regularization on weights, and dropout (Srivastava et al., 2014) of hidden units. See Table 1 for more detail. CRF Setting. For the CRF model, we initialize the node-level weights from the output layer weights of the DNNs, and we set the edge-level weights to 0. Then, we train using RMSprop with loopy BP. We regularize the node parameters according to the best settings of the DNN: 0.001, 0.05, and 0.0001 for A, B, and C, respectively.

Results and Discussion
Below, we first present the evaluation results using DNN models (Section 6.1). Then, we discuss the performance of the joint models (Section 6.2). Table 2 shows the results for our individual DNN models (rows in boldface) for subtasks A, B and C on the TEST set. We report three ranking-based measures that are commonly accepted in the IR community: mean average precision (MAP), which was the official  evaluation measure of SemEval-2016, average recall (AvgRec), and mean reciprocal rank (MRR).

Results for the DNN Models
For each subtask, we show two baselines and the results of the top-2 systems at SemEval. The first baseline is a random ordering of the questions/comments, assuming no knowledge about the subtask. The second baseline keeps the chronological order of the comments for subtask A, of the question ranking from the IR engine for subtask B, and both for subtask C.
We can see that the individual DNN models for subtasks B and C are very competitive, falling between the first and the second best at SemEval-2016. For subtask A, our model is weaker, but, as we will see below, it can help improve the results for subtasks B and C, which are our focus here.
Looking at the results for subtask C, we can see that sizeable gains are possible when using gold labels for subtasks A and B as features to DNN C , e.g., adding gold A labels yields +6.90 MAP points. Similarly, using gold labels for subtask B adds +2.05 MAP points absolute. Moreover, the gain is cumulative: using the two gold labels together yields +9.25 MAP points. The same behavior is observed for the other evaluation measures. Of course, as we use gold labels, this is an upper bound on performance, but it justifies our efforts towards a joint multitask learning model.

Results for the Joint Model
Below we discuss the evaluation results for the joint model. We focus on subtasks B and C, which are the main target of our study.
Results for Subtask C. Table 3 compares several variants of the CRF model for joint learning, which we described in Section 3.2 above.
Row 1 shows the results for our individual DNN C model. The following rows 2-4 present a pipeline approach, where we first predict labels for subtasks A and B and then we add these predictions as features to DNN C . This is prone to error propagation, and improvements are moderate and inconsistent across the evaluation measures.
The remaining rows correspond to variants of our CRF model with different graph structures. Overall, the improvements over DNN C are more sizeable than for the pipeline approach (with one single exception out of 24 cases); they are also more consistent across the evaluation measures, and the improvements in MAP over the baseline range from +0.96 to +1.76 points absolute.
Rows 5-8 show the impact of adding connections to subtasks A and B when solving subtask C (see Figure 2b). Interestingly, we observe the same pattern as with the gold labels: the A-C and B-C connections help individually and in combination, with A-C being more helpful. Yet, further adding A-B does not improve the results (row 8).
Note that the locally normalized joint model in Eq. 4 yields much lower results than the globally normalized CRF all (row 8): 54.32, 59.87, and 61.76 in MAP, AvgRec and MRR (figures not included in the table for brevity). This evinces the problems with the conditional independence assumption and the local normalization in the model. Finally, rows 9-12 explore variants of the best system from the previous set (row 7), which has connections between subtasks only. Rows 9-12 show the results when using subgraphs for A, B and C that are fully connected (i.e., for all pairs). We can see that none of these variants yields improvements over the model from row 7, i.e., the fine-grained relations between comments in the threads and between the different related questions do not seem to help solve subtask C in the joint model. Note that our scores from row 7 are better than the best results achieved by a system at SemEval-2016 Task 3 subtask C: 56.00 vs. 55.41 on MAP, and 63.25 vs. 61.48 on MRR.
Results for Subtask B. Next, we present in Table 4 similar experiments, but this time with subtask B as the target, and we show some more measures (accuracy, precision, recall, and F 1 ).
Given the insights from Table 2 (where we used gold labels), we did not expect to see much improvements for subtask B. Indeed, as rows 2-4 show, using the pipeline approach, the IR measures are basically unaltered. However, classification accuracy improves by almost one point absolute, recall is also higher (trading for lower precision), and F 1 is better by a sizeable margin.
Coming to the joint models (rows 6-9), we can see that the IR measures improve consistently over the pipeline approach, even though not by much. The effect on accuracy-P-R-F 1 is the same as observed with the pipeline approach but with larger differences. 4 In particular, accuracy improves by more than two points absolute, and recall increases, which boosts F 1 to almost 60.
Row 5 is a special case where we only consider subtask B, but we do the learning and the inference over the set of ten related questions, exploiting their relations. This yields a slight increase in all measures; more importantly, it is crucial for obtaining better results with the joint models.
Rows 6-9 show results for various variants of the A-C and B-C architecture with fully connected B nodes, playing with the fine-grained connection of the A and C nodes. The best results are in this block, with increases over DNN B in MAP (+0.61), AvgRec (+0.69) and MRR (+1.05), and especially in accuracy (+2.18) and F 1 (+11.25 points). This is remarkable given the low expectation we had about improving subtask B.
Note that the best architecture for subtask C from Table 3 (A-C and B-C with no fully connected B layer) does not yield good results for subtask B. We speculate that subtask B is overlooked by the architecture, which has many more connections and parameters on the nodes for subtasks A and C (ten comments are to be classified for both subtask   Table 4: Performance of the pipeline and of the joint models on subtask B (best results in boldface).
A and C, while only one decision is to be made for the related question B). Finally, note that our best results for subtask B are also slightly better than those for the best system at SemEval-2016 Task 3, especially on MRR.

Conclusion
We have presented a framework for multitask learning of two community Question Answering problems: question-question relatedness and answer selection. We further used a third, auxiliary one, i.e., finding the good comments in a question-comment thread. We proposed a twostep framework based on deep neural networks and structured conditional models, with a feedforward neural network to learn task-specific embeddings, which are then used in a pairwise CRF as part of a multitask model for all three subtasks.
The DNN model has its strength in generating compact embedded representations for the subtasks by modeling interactions between different input elements. On the other hand, the CRF is able to perform global inference over arbitrary graph structures accounting for the dependencies between subtasks to provide globally good solutions. The experi-mental results have proven the suitability of combining the two approaches. The DNNs alone already yielded competitive results, but the CRF was able to exploit the task-specific embeddings and the dependencies between subtasks to improve the results consistently across a variety of evaluation metrics, yielding state-of-the-art results.
In future work, we plan to model text complexity (Mihaylova et al., 2016), veracity , speech act (Joty and Hoque, 2016), user profile (Mihaylov et al., 2015), trollness (Mihaylov et al., 2018), and goodness polarity Mihaylov et al., 2017). From a modeling perspective, we want to strongly couple CRF and DNN, so that the global errors are backpropagated from the CRF down to the DNN layers. It would be also interesting to extend the framework to a cross-domain (Shah et al., 2018) or a cross-language setting (Da San Martino et al., 2017;. Trying an ensemble of neural networks with different initial seeds is another possible research direction.