Multi-task Attention-based Neural Networks for Implicit Discourse Relationship Representation and Identification

We present a novel multi-task attention based neural network model to address implicit discourse relationship representation and identification through two types of representation learning, an attention based neural network for learning discourse relationship representation with two arguments and a multi-task framework for learning knowledge from annotated and unannotated corpora. The extensive experiments have been performed on two benchmark corpora (i.e., PDTB and CoNLL-2016 datasets). Experimental results show that our proposed model outperforms the state-of-the-art systems on benchmark corpora.


Introduction
The task of implicit discourse relation (or rhetorical relation) identification is to recognize how two adjacent text spans without explicit discourse marker (i.e., connective, e.g., because or but ) between them are logically connected to one another (e.g., cause or contrast). It is considered to be a crucial step for discourse analysis and language generation and helpful to many downstream NLP applications, e.g., QA, MT, sentiment analysis, machine comprehension, etc.
With the release of PDTB 2.0 (Prasad et al., 2008), lots of work has been done for discourse relation identification on natural (i.e., genuine) discourse data (Pitler et al., 2009;Lin et al., 2009;Wang et al., 2010;Zhou et al., 2010;Braud and Denis, 2015;Fisher and Simmons, 2015) with the use of traditional NLP linguistically informed features and machine learning algorithms. Recently, more and more researchers resorted to neural networks for implicit discourse recognition (Zhang et al., 2015;Qin et al., 2016a;Liu and Li, 2016;Braud and Denis, 2016;Wu et al., 2016). Meanwhile, to alleviate the shortage of labeled data, researchers explored multi-task learning with the aid of unannotated data for implicit discourse recognition either in traditional machine learning framework (Collobert and Weston, 2008;Lan et al., 2013) or recently in neural network framework (Wu et al., 2016;.
In this work, we present a novel multi-task attention-based neural network to address implicit discourse relationship representation and recognition. It performs two types of representation learning at the same time. An attention-based neural network conducts discourse relationship representation learning through interaction between two discourse arguments. Meanwhile, a multi-task learning framework leverages knowledge from auxiliary task to enhance the performance of main task. Furthermore, these two types of learning are integrated into one neural network framework and work together to maximize the overall performance.
The contributions of this work are listed as follows.
• We propose a multi-task attention-based neural network model to address implicit discourse relationship representation and recognition, which benefits from both the interaction between discourse arguments and the interaction between different learning tasks; • Our method achieves the best results on two benchmark corpora in comparison with the state-of-the-art systems so far.
The organization of this work is as follows. Section 2 describes the proposed novel multi-task neural network. Section 3 introduces the exper-imental settings in detail. Section 4 reports the comprehensive experimental results on two benchmark corpora. Section 5 summarized related work. Finally, Section 6 concludes this work.

Multi-task Attention-based Neural
Networks Models

Motivation
The idea of learning two types of interactive knowledge from arguments and from multi-tasks is motivated by the following observations and analysis.
On the one hand, to recognize the discourse relationships, our system needs to understand the meaning of each argument and infer the discourse sense transferred between two arguments (denoted as Arg-1 and Arg-2). Learning the semantic representation of each argument (sentence) has been studied with the use of many neural network models and their variants (e.g., CNN, RNN, LSTM, Bi-LSTM, ect). However, learning the complicated and various types of discourse relationships between arguments may not be performed by simply summing up or concatenating two argument representations. We analyze the discourse with contrast relationship and find that the contrast information may result from different parts of sentence, e.g., tenses (e.g., previous vs. now), entities (their vs our), or even the whole arguments, etc. Therefore, in order to learn the relationship representation between two arguments, we propose an attention mechanism that can select out the most important part from two arguments and perform the information interaction between two arguments.
On the other hand, one common issue involved in implicit discourse relationship identification is the lack of labeled data. In this work, we state that the relevant information from unlabelled data might be helpful and we present a novel multitask learning framework. In contrast with previous multi-task learning framework in traditional machine learning, we improve multi-task learning framework with representation learning for better discourse relationship representation.
Inspired by the above considerations, we present a novel multi-task attention-based neural network model by integrating attention mechanism with multi-task learning for information interaction between arguments and between tasks.

Learning Discourse Representation
To learn the semantic representation of each argument in discourse, a lot of neural network models and their variants have been proposed, such as, convolutional neural network (CNN), recurrent neural network (RNN) and so on. As a variant of RNN, long-short term memory (LSTM) neural network specifically addresses the issue of learning long-term dependencies and is good at modeling over a sequence of words with consideration of the contextual information. Therefore, in this work we adopt LSTM to model discourse argument. l, through the embedding layer, we associate each word w in the vocabulary with a vector representation x w ∈ R dw . Let x 1 i (x 2 i ) be the i-th word vector in Arg-1 (Arg-2), then these two discourse arguments are represented as:

LSTM for Argument Representation
where Arg-1 (Arg-2) has L 1 (L 2 ) words. Given the word representations of the argument [x 1 , x 2 , · · · , x L ] as the input sequence, an LSTM computes the state sequence [h 1 , h 2 , · · · , h L ] for each time step i using the following formulation: (3) where [ ] means the concatenation operation of vectors, σ denotes the sigmoid function and denotes element-wise product. Besides, i i , f i , o i and c i denote the input gate, forget gate, output gate and memory cell, respectively. Moreover, we also use bidirectional LSTM (Bi-LSTM) which is able to capture the context from both past and future rather than LSTM which only considers the context information from the past. Therefore, at each position i of the sequence, we obtain two s- After that, we sum up the sequence states [h 1 , h 2 , · · · , h L ] to get the representations of Arg-1 and Arg-2 respectively as follows: Finally we concatenate the two argument representations R Arg 1 and R Arg 2 as the argument pair representation, i.e., R pair = [R Arg 1 , R Arg 2 ].
Clearly, in this way, there is no any correlation and interaction between the two arguments. That is, whatever the types of discourse relationship they hold, the argument pair representation R pair is independent from R Arg 1 or R Arg 2 .

Attention Neural Network for Relationship Representation
In order to effectively capture the complicated and various types of relationships between arguments, we proposed a novel attention-based neural network model shown in Figure 2.
To do it, we first compute the match between R Arg 1 (R Arg 2 ) and each state h 2 i (h 1 i ) of Arg-2 (Arg-1) by taking the inner product followed by a word embedding word embedding softmax as follows: where Softmax(z i ) = e z i / j e z i . Here p is an attention (probability) vector over the inputs and can be viewed as the weights of the words measuring to what degree our model should pay attention to. It is worth noting that p 1 and p 2 are determined by R Arg 2 and R Arg 1 respectively, which means the representation of one argument depends on the representation of the other. Next, we sum over the state h i weighted by the attention vector p to compute the new representations for Arg-1 and Arg-2 respectively as below: The representation of Arg-2 (R Arg 2 ) is used to compute the weights of words in Arg-1 (i.e., p 1 ) and R Arg 1 is used to compute the weights of words in Arg-2 (i.e., p 2 ). In this way, the new representations of the two arguments interact with each other. Therefore, this attention mechanism enables our model to focus on specific spans in the two arguments, which is crucial to recognize the discourse relations. We then concatenate R Arg 1 and R Arg 2 to get the argument pair representation Finally, we feed the argument pair vector R pair to a fully-connected softmax layer which outputs the probabilities of different classes for the classification task. Here we choose the cross-entropy loss between the outputs of the softmax layer and the ground-truth class labels as our loss function.

Multi-task Attention-based Neural Networks
The model presented in Section 2.2 can perform implicit discourse relation recognition in itself. However, similar with many models in deep learning, one big issue is the lack of labeled data. Therefore, we propose a multi-task attentionbased neural network by integrating the aforementioned model into a multi-task learning framework to address the implicit discourse relation recognition with the aid of large amount of unlabelled data. Figure 3 shows the general framework of our proposed multi-task attention-based neural network model.  Figure 3: The framework of our proposed multitask attention-based neural network model.
We use the aforementioned attention-based neural network to map the argument pair into a low-dimensional vector (R pair ) denoted as Arg Pair representation component in Figure 3. Under the multi-task learning framework, the parameters of the Arg Pair representation components are shared between the main task and the auxiliary tasks. We denote R main and R aux as the representations of argument pair for main and auxiliary tasks, respectively. And we add a hidden layer after R main and R aux to learn the task-specific representations followed by the softmax layers used to compute the loss of the main task (Loss main ) and the loss of the auxiliary task (Loss aux ), respectively.
Regarding the strategy of sharing knowledge learnt from auxiliary to main task, we propose the following three methods.

Equal Share
A simple and straightforward way is to equally share the knowledge learned from main task and auxiliary task. Therefore, the total loss of the multi-task neural network is calculated as: where Loss aux has the same weight as Loss main .

Weighted Share
Another method is to give different weights to the main and auxiliary task as below: where w ∈ (0, 1] is a weight parameter. Clearly, a lower value of w means less importance of auxiliary task.

Sigmoid (Gated) Interaction
The above two ways of sharing knowledge actually have no deep interaction between the main and auxiliary tasks. They only share equal or weighted contributions from tasks to final result. Therefore, we propose a model that can perform interaction between tasks, which is shown in Figure 4. We introduce two important parameters W inter ∈ R d pair ×d pair and b inter ∈ R d pair (d pair is the length of the argument pair representation vector R pair ) to fulfil the interaction between main and auxiliary tasks. As shown in the following Formula (17) and (18), the new representation of argument pair R main is updated by the combination of W inter and R aux using a Sigmoid function.
W inter and the Sigmoid function (σ) work together to make information interacted between two tasks and select useful relevant information out of the opposite tasks as well. Clearly, W inter is representation Arg Pair representation  a parameter to be trained. This mechanism acts as a gate to determine how much the information would pass through to the final result. Therefore, under the framework of multi-task and gated mechanism, the main and auxiliary tasks are capable of not only sharing the parameters of learning argument pair representation but also interacting the representations learning from each other.

Parameter Learning
We

Datasets
We adopted three corpora: PDTB 2.0 and CoNLL-2016 datasets are annotated for discourse relation recognition evaluation, and the BLLIP corpus is unlabeled and used as auxiliary task. PDTB 2.0 is the largest annotated corpus of discourse relations, which contains 2, 312 Wall Street Journal (WSJ) articles. The sense label of discourse relations is hierarchically with three levels, i.e., class, type and sub-type. The top level contains four major semantic classes: Comparison (denoted as Comp.), Contingency (Cont.), Expansion (Exp.) and Temporal (Temp.). For each class, a set of types is used to refine relation sense. The set of subtypes is to further specify the semantic contribution of each argument. We focus on the top level (class) relations. Following (Pitler et al., 2009), we used sections 2-20 as training set, sections 21-22 as test set, and sections 0-1 as development set.  The CoNLL-2016 Shared Task focuses on shallow discourse parsing, which provides two test datasets, i.e., one from PDTB section 23 denoted as CoNLL-Test set, and the other from a similar source and domain (English Wikinews 2 ) denoted as CoNLL-Blind test set. Different from the sense labels in PDTB, the CoNLL-Test set has three sense levels and the EntRel label. Moreover, it merges several labels in the original annotation to reduce some sparsity without losing too much of the utility and the semantics of the sense.
BLLIP The North American News Text (Complete) is used as unlabeled data source to generate synthetic labeled data for auxiliary task. We remove the explicit discourse connectives from raw texts and grant the explicit relations as the synthetic implicit relations. We obtain a resulting corpus with 100, 000 implicit relations by random sampling.

Evaluation Measures
We adopt precision (P), recall (R) and their harmonic mean, i.e., F 1 for performance evaluation. We also report accuracy for direct comparison with previous works.

Results on PDTB in multiple binary classification
To be consistent with previous work, we first perform multiple binary classification (one-versusother) on the four top level classes in PDTB. Several previous studies merged EntRel with Expansion, which is also explored in our study and noted as Exp+. Table 2 shows the results of our proposed three models in terms of F 1 (%) on PDTB using multiple binary classification, where STL means single task learning, Eshare, Wshare and Gshare denote the equal share, weighted share and gated interaction share under multi-task framework respectively, Imp denotes the standard implicit relations dataset in PDTB (similarly, Imp denotes standard implicit relations dataset in the CoNLL dataset when we perform experiments on the CoN-LL dataset) used for training, Exp denotes all explicit relations in sections 00-24 in PDTB (similarly, all explicit relations in the CoNLL dataset when we perform experiments on the CoNLL dataset), and BLLIP denotes the synthetic implicit relations extracted from BLLIP. For example, Imp + BLLIP indicates that Imp is used for main task and BLLIP is for auxiliary task. The first three rows in Table 2 list the results of LSTM, Bi-LSTM and attention neural network in the single task learning (STL) framework, which act as baselines for comparison with multitask learning. We see that Bi-LSTM achieve slightly better performance than LSTM, which is consistent with previous work as Bi-LSTM considers the forward and backward direction contextual information while LSTM only considers the forward information. Compared with LSTM and Bi-LSTM, the attention neural network achieves much better performance. This indicates the effectiveness of attention mechanism for capturing the interaction between discourse arguments, which is crucial for relationship representation.
Generally, under the multi-task neural network framework, the three proposed multi-task neural networks, i.e., Eshare, Wshare and Gshare, outperform the single task learning methods. Com-paring with Eshare and Wshare, we see that using a low value of w is able to boost the performance and reduce the negative influence brought by auxiliary task. We then use the best w value in Wshare to construct the loss of Gshare and the Gshare achieves the best performance among all methods through information interaction between main and auxiliary tasks.
Comparing Imp + Exp with Imp + BLLIP, we see that using Exp as auxiliary task achieves lower performance than using BLLIP and even hurts the performance compared with the single task. The possible reasons may result from (1) there is difference between explicit and implicit discourse relations and (2) the size of Exp dataset is much smaller than that of BLLIP and thus it is not large enough to boost the performance.

Results on PDTB and CoNLL-2016 in multi-class classification
We also perform multi-class classification on PDTB and CoNLL-2016. That is, a four-way classification on the four top-level classes in PDTB and a 15-way classification on the 15 sense labels in CoNLL dataset. Table 3 shows the results of multi-class classification on PDTB and CoNLL-2016 corpora in terms of accuracy (%) and macroaveraged F 1 (%). The results of multi-class classification are consistent with the results of binary classification. First, the attention neural network achieves better performance than LSTM and Bi-LSTM. Second, the multi-task learning methods outperform the single-task learning method. Thrid, the Gshare method achieves the best performance. Table 4 lists the performance of our best model with the reported state-of-the-art systems on PDTB and CoNLL-2016. We see that our model achieves F 1 improvements of 1.64% on Cont., 0.97% on Exp., and 1.35% on Exp.+ against the best reported systems in binary classification. And in multi-class classification, our model also achieves the best performance of F 1 in four-way classification and accuracy in CoNLL-2016 Blind test set, which indicates that our model has good generality. Specially,  and (Liu and Li, 2016) listed in Table 4, which adopted neural network-based multi-task framework, are quite    relevant to this work.  presented a multi-task neural network, which considered information sharing between the main and auxiliary task. Different from their work, our work integrates the attention-based interaction between arguments and the multi-task based interaction between tasks into the final model. This is the main reason why our model achieves better performance in all types of relations, which shows the effectiveness of integrating gated mechanism into multi-task framework. Besides, (Liu and Li, 2016) used a complicated multi-level attention mechanism and the performance of our attention neural network in the single task is comparable to their results. Our multi-task attention model achieves better performance in most types with the aid of multi-task framework. Besides, our previous work in (Lan et al., 2013) listed in Table 4, also presented a multitask framework with traditional machine learning method to address implicit discourse recognition using BLLIP to obtain synthetic data. Clearly, under neural network-based multi-task framework, the attention and gated mechanism significantly improved the results and outperformed traditional machine learning method in all types of relations. Figure 5 shows the performance of four binary classification on four top level classes influenced by different share weights w in Wshare multi-task framework. We see that the best performance is achieved when we use a lower value of w. This  indicates that a low value of w can boost performance and reduce the negative influence brought by auxiliary task and enable our model to pay more attention to the main task.

Implicit Discourse
With the release of PDTB 2.0, a number of studies performed discourse relation recognition on natural (i.e., genuine) discourse data with the use of traditional NLP techniques to extract linguistically informed features and traditional machine learning algorithms (Pitler et al., 2009;Lin et al., 2009;Wang et al., 2010;Braud and Denis, 2015;Fisher and Simmons, 2015).
Later, to make a full use of unlabelled data, several studies performed multi-task or unsupervised learning methods (Lan et al., 2013;Braud and Denis, 2015;Fisher and Simmons, 2015;Rutherford and Xue, 2015).

Multi-task learning
Multi-task learning framework adopts traditional machine learning with human-selected effective knowledge and the shared part is integrated into the cost function to prefer the main task learning. (Collobert and Weston, 2008) proposed a multi-task neural network trained jointly on the relevant tasks using weight-sharing (sharing the word embeddings with tasks).  proposed the multi-task neural network by modifying the recurrent neural network for text classification tasks. (Lan et al., 2013) present a multi-task learning based system which can effectively use synthetic data for implicit discourse relation recognition. (Wu et al., 2016) use bilingually-constrained synthetic implicit data for implicit discourse relation recognition a multi-task neural network.  propose a convolutional neural network embedded multi-task learning system to improve the performance of implicit discourse identification.

Deep learning with Attention
Recently deep learning with attention has been widely adopted by NLP researchers.  proposed an attention-based Bi-LSTM for relation classification. (Wang et al., 2016c) proposed an attention-based LSTM for aspect-level sentiment classification. (Tan et al., 2016) proposed a attentive LSTMs for Question Answer Matching. (Wang et al., 2016a) proposed an inner attention based RNN (add attention information before RNN hidden representation) for Answer Selection in QA. (Wang et al., 2016b) proposed multi-level attention CNNs for relation classification. (Yin et al., 2016) proposed an attentive convolutional neural network for QA.

Concluding Remarks
We present a novel multi-task attention-based neural network model for implicit discourse relationship representation and identification. Our method captures both the discourse relationships through interactions between discourse arguments and the complementary knowledge through interactions between annotated and unannotated data. The experimental results showed that our proposed model outperforms the state-of-the-art systems on two benchmark corpora.