Adversarial Connective-exploiting Networks for Implicit Discourse Relation Classification

Implicit discourse relation classification is of great challenge due to the lack of connectives as strong linguistic cues, which motivates the use of annotated implicit connectives to improve the recognition. We propose a feature imitation framework in which an implicit relation network is driven to learn from another neural network with access to connectives, and thus encouraged to extract similarly salient features for accurate classification. We develop an adversarial model to enable an adaptive imitation scheme through competition between the implicit network and a rival feature discriminator. Our method effectively transfers discriminability of connectives to the implicit features, and achieves state-of-the-art performance on the PDTB benchmark.


Introduction
Discourse relations connect linguistic units such as clauses and sentences to form coherent semantics. Identification of discourse relations can benefit a variety of downstream applications including question answering (Liakata et al., 2013), machine translation (Li et al., 2014), text summarization (Gerani et al., 2014), and so forth.
Connectives (e.g., but, so, etc) are one of the most critical linguistic cues for identifying discourse relations. When explicit connectives are present in the text, a simple frequency-based mapping is sufficient to achieve over 85% classification accuracy (Xue et al., 2016). In contrast, implicit discourse relation recognition has long been seen as a challenging problem, with the best accuracy so far still lower than 50%. In the implicit case, discourse relations are not lexicalized by connectives, but to be inferred from relevant sentences (i.e., arguments). For example, the following two adjacent sentences Arg1 and Arg2 imply relation Cause (i.e., Arg2 is the cause of Arg1).
[Arg2]: You already know the answer.
[Implicit connective]: Because [Discourse relation]: Cause Various attempts have been made to directly infer underlying relations by modeling the semantics of the arguments, ranging from feature-based methods (Lin et al., 2009;Pitler et al., 2009) to the very recent end-to-end neural models (Chen et al., 2016a;Qin et al., 2016c). Despite impressive performance, the absence of strong explicit connective cues has made the inference extremely hard and hindered further improvement. In fact, even the human annotators would make use of connectives to aid relation annotation. For instance, the popular Penn Discourse Treebank (PDTB) benchmark data (Prasad et al., 2008) was annotated by first inserting a connective expression (i.e., implicit connective, as shown in the above example) manually, and determining the abstract relation by combining both the implicit connective and contextual semantics.
Therefore, the huge performance gap between explicit and implicit parsing (namely, 85% vs 50%), as well as the human annotation practice, strongly motivates to incorporate connective information to guide the reasoning process. This paper aims to advance implicit parsing by making use of annotated implicit connectives available in training data. Few recent work has explored such combination. Zhou et al. (2010) developed a two-step approach by first predicting implicit connectives whose sense is then disambiguated to obtain the relation. However, the pipeline approach usually suffers from error propagation, and the method itself has relied on hand-crafted features which do not necessarily generalize well. Other research leveraged explicit connective examples for data augmentation Braud and Denis, 2015;Braud and Denis, 2016). Our work is orthogonal and complementary to this line.
In this paper, we propose a novel neural method that incorporates implicit connectives in a principled adversarial framework. We use deep neural models for relation classification, and take the intuition that, sentence arguments integrated with connectives would enable highly discriminative neural features for accurate relation inference, and an ideal implicit relation classifier, even though without access to connectives, should mimic the connective-augmented reasoning behavior by extracting similarly salient features. We therefore setup a secondary network in addition to the implicit relation classifier, building upon connectiveaugmented inputs and serving as a feature learning model for the implicit classifier to emulate.
Methodologically, however, feature imitation in our problem is challenging due to the semantic gap induced by adding the connective cues. It is necessary to develop an adaptive scheme to flexibly drive learning and transfer discriminability. We devise a novel adversarial approach which enables a self-calibrated imitation mechanism. Specifically, we build a discriminator which distinguishes between the features by the two counterpart networks. The implicit relation network is then trained to correctly classify relations and simultaneously to fool the discriminator, resulting in an adversarial framework. The adversarial mechanism has been an emerging method in different context, especially for image generation (Goodfellow et al., 2014) and domain adaptation (Ganin et al., 2016;Chen et al., 2016c). Our adversarial framework is unique to address neural feature emulation between two models. Besides, to the best of our knowledge, this is the first adversarial approach in the context of discourse parsing. Compared to previous connective exploiting work (Zhou et al., 2010;Xu et al., 2012), our method provides a new integration paradigm and an end-to-end procedure that avoids inefficient feature engineering and error propagation.
We evaluate our method on the PDTB 2.0 benchmark in a variety of experimental settings. The proposed adversarial model greatly improves over standalone neural networks and previous best-performing approaches. We also demonstrate that our implicit recognition network successfully imitates and extracts crucial hidden representations.
We begin by briefly reviewing related work in section 2. Section 3 presents the proposed adversarial model. Section 4 shows substantially improved experimental results over previous methods. Section 5 discusses extensions and future work.
2 Related Work

Implicit Discourse Relation Recognition
There has been a surge of interest in implicit discourse parsing since the release of PDTB (Prasad et al., 2008), the first large discourse corpus distinguishing implicit examples from explicit ones. A large set of work has focused on direct classification based on observed sentences, including structured methods with linguistically-informed features (Lin et al., 2009;Pitler et al., 2009;Zhou et al., 2010), end-to-end neural models (Qin et al., 2016b,c;Chen et al., 2016a;, and combined approaches . However, the lacking of connective cues makes learning purely from contextual semantics full of challenges. Prior work has attempted to leverage connective information. Zhou et al. (2010) also incorporate implicit connectives, but in a pipeline manner by first predicting the implicit connective with a language model and determining discourse relation accordingly. Instead of treating implicit connectives as intermediate prediction targets which can suffer from error propagation, we use the connectives to induce highly discriminative features to guide the learning of an implicit network, serving as an adaptive regularization mechanism for enhanced robustness and generalization. Our framework is also end-to-end, avoiding costly feature engineering. Another notable line aims at adapting explicit examples for data synthesis (Biran and McKeown, 2013;Braud and Denis, 2015;, multi-task learning (Lan et al., 2013;, and word representation (Braud and Denis, 2016). Our work is orthogonal and complementary to these methods, as we use implicit connectives which have been annotated for implicit examples.  Figure 1: Architecture of the proposed method. The framework contains three main components: 1) an implicit relation network i-CNN over raw sentence arguments, 2) a connective-augmented relation network a-CNN whose inputs are augmented with implicit connectives, and 3) a discriminator distinguishing between the features by the two networks. The features are fed to the final classifier for relation classification. The discriminator and i-CNN form an adversarial pair for feature imitation. At test time, the implicit network i-CNN with the classifier is used for prediction.

Adversarial Networks
Adversarial method has gained impressive success in deep generative modeling (Goodfellow et al., 2014) and domain adaptation (Ganin et al., 2016). Generative adversarial nets (Goodfellow et al., 2014) learn to produce realistic images through competition between an image generator and a real/fake discriminator. Professor forcing (Lamb et al., 2016) applies a similar idea to improve longterm generation of a recurrent neural language model. Other approaches (Chen et al., 2016b;Hu et al., 2017) extend the framework for controllable image/text generation. Li et al. (2015); Salimans et al. (2016) propose feature matching which trains generators to match the statistics of real/fake examples. Their features are extracted by the discriminator rather than the classifier networks as in our case. Our work differs from the above since we consider the context of discriminative modeling. Adversarial domain adaptation forces a neural network to learn domain-invariant features using a classifier that distinguishes the domain of the network's input data based on the hidden feature. Our adversarial framework is distinct in that besides the implicit relation network we construct a second neural network serving as a teacher model for feature emulation.
To the best of our knowledge, this is the first to employ the idea of adversarial learning in the context of discourse parsing. We propose a novel connective exploiting scheme based on feature imitation, and to this end derive a new adversar-ial framework, achieving substantial performance gain over existing methods. The proposed approach is generally applicable to other tasks for utilizing any indicative side information. We give more discussions in section 5.

Adversarial Method
Discourse connectives are key indicators for discourse relation. In the annotation procedure of the PDTB implicit relation benchmark, annotators inserted implicit connective expressions between adjacent sentences to lexicalize abstract relations and help with final decisions. Our model aims at making full use of the provided implicit connectives at training time to regulate learning of implicit relation recognizer, encouraging extraction of highly discriminative semantics from raw arguments, and improving generalization at test time. Our method provides a novel adversarial framework that leverages connective information in a flexible adaptive manner, and is efficiently trained end-to-end through standard back-propagation.
The basic idea of the proposed approach is simple. We want our implicit relation recognizer, which predicts the underlying relation of sentence arguments without discourse connective, to have prediction behaviors close to a connectiveaugmented relation recognizer which is provided with a discourse connective in addition to the arguments. The connective-augmented recognizer is in analogy to an annotator with the help of connectives as in the human annotation process, and the implicit recognizer would be improved by learning from such an "informed" annotator. Specifically, we want the latent features extracted by the two models to match as closely as possible, which explicitly transfers the discriminability of the connective-augmented representations to implicit ones.
To this end, instead of manually selecting a closeness metric, we take advantage of the adversarial framework by constructing a two-player zero-sum game between the implicit recognizer and a rival discriminator. The discriminator attempts to distinguish between the features extracted by the two relation models, while the implicit relation model is trained to maximize the accuracy on implicit data, and at the same time to confuse the discriminator.
In the next we first present the overall architecture of the proposed approach (section 3.1), then develop the training procedure (section 3.2). The components are realized as deep (convolutional) neural networks, with detailed modeling choices discussed in section 3.3.

Model Architecture
Let (x, y) be a pair of input and output of implicit relation classification, where x = (x 1 , x 2 ) is a pair of sentence arguments, and y is the underlying discourse relation. Each training example also includes an annotated implicit connective c that best expresses the relation. Figure 1 shows the architecture of our framework.
The neural model for implicit relation classification (i-CNN in the figure) extracts latent representation from the arguments, denoted as H I (x 1 , x 2 ), and feeds the feature into a classifier C for final prediction C(H I (x 1 , x 2 )). For ease of notation, we will also use H I (x) to denote the latent feature on data x.
The second relation network (a-CNN) takes as inputs the sentence arguments along with an implicit connective, to induce the connectiveaugmented representation H A (x 1 , x 2 , c), and obtains relation prediction C(H A (x 1 , x 2 , c)). Note that the same final classifier C is used for both networks, so that the feature representations by the two networks are ensured to be within the same semantic space, enabling feature emulation as presented shortly.
We further pair the implicit network with a rival discriminator D to form our adversarial game.
The discriminator is to differentiate between the reasoning behaviors of the implicit network i-CNN and the augmented network a-CNN. Specifically, D is a binary classifier that takes as inputs a latent feature H derived from either i-CNN or a-CNN given appropriate data (where implicit connectives is either missing or present, respectively). The output D(H) estimates the probability that H comes from the connective-augmented a-CNN rather than i-CNN.

Training Procedure
The system is trained through an alternating optimization procedure that updates the components in an interleaved manner. In this section, we first present the training objective for each component, and then give the overall training algorithm.
Let θ D denote the parameters of the discriminator. The training objective of D is straightforward, i.e., to maximize the probability of correctly distinguishing the input features: where E (x,c,y)∼data [·] denotes the expectation in terms of the data distribution.
We denote the parameters of the implicit network i-CNN and the classifier C as θ I and θ C , respectively. The model is then trained to (a) correctly classify relations in training data and (b) produce salient features close to connectiveaugmented ones. The first objective can be fulfilled by minimizing the usual cross-entropy loss: LI,C (θI , θC ) = E (x,y)∼data J C(HI (x; θI ); θC ), y , where J(p, y) = − k I(y = k) log p k is the cross-entropy loss between predictive distribution p and ground-truth label y. We achieve objective (b) by minimizing the discriminator's chance of correctly telling apart the features: The parameters of the augmented network a-CNN, denoted as θ A , can be learned by simply fitting to the data, i.e., minimizing the cross-entropy loss as follows: LA(θA) = E (x,c,y)∼data J C(HA(x, c; θA)), y . (4) Algorithm 1 Adversarial Model for Implicit Recognition Input: Training data {(x, c, y) n } Parameters: λ 1 , λ 2 -balancing parameters 1: Initialize {θ I , θ C } and {θ A } by minimizing Eq.
(2) and Eq.(4), respectively 2: repeat 3: Train the discriminator through Eq.(1) 4: Train the relation models through Eq.(5) 5: until convergence Output: Adversarially enhanced implicit relation network i-CNN with classifier C for prediction As mentioned above, here we use the same classifier C as for the implicit network, forcing a unified feature space of both networks. We combine the above objectives Eqs.
(2)-(4) of the relation classifiers and minimize the joint loss: where λ 1 and λ 2 are two balancing parameters calibrating the weights of the classification losses and the feature-regulating loss. In practice, we pretrain the implicit and augmented networks independently by minimizing Eq.
(2) and Eq.(4), respectively. In the adversarial training process, we found setting λ 2 = 0 gives stable convergence. That is, the connective-augmented features are fixed after the pre-training stage. Algorithm 1 summarizes the training procedure, where we interleave the optimization of Eq.(1) and Eq.(5) at each iteration. More practical details are provided in section 4. We instantiate all modules as neural networks (section 3.3) which are differentiable, and perform the optimization efficiently through standard stochastic gradient descent and back-propagation.
Through Eq.(1) and Eq. (3), the discriminator and the implicit relation network follow a minimax competition, which drives both to improve until the implicit feature representations are close to the connective-augmented latent representations, encouraging the implicit network to extract highly discriminative features from raw sentence arguments for relation classification. Alternatively, we can see Eq.
(3) as an adaptive regularization on the implicit model, which, compared to pre-fixed regularizors such as 2 -regularization, provides a more flexible, self-calibrated mechanism to improve generalization ability.

Component Structures
We have presented our adversarial framework for implicit relation classification. We now discuss the model realization of each component. All components of the framework are parameterized with neural networks. Distinct roles of the modules in the framework lead to different modeling choices. Figure 2 illustrates the structure of the implicit relation network i-CNN. We use a convolutional network as it is a common architectural choice for discourse parsing. The network takes as inputs the word vectors of the tokens in each sentence argument, and maps each argument to intermediate features through a shared convolutional layer. The result-ing representations are then concatenated and fed into a max pooling layer to select most salient features as the final representation. The final classifier C is a simple fully-connected layer followed by a softmax classifier. The connective-augmented network a-CNN has a similar structure as i-CNN, wherein implicit connective is appended to the second sentence as input. The key difference from i-CNN is that here we adopt average k-max pooling, which takes the average of the top-k maximum values in each pooling window. The reason is to prevent the network from solely selecting the connective induced features (which are typically the most salient features) which would be the case when using max pooling, but instead force it to also attend to contextual features derived from the arguments. This facilitates more homogeneous output features of the two networks, and thus facilitates feature imitation. In all the experiments we fixed k = 2.

Relation Classification Networks
Discriminator The discriminator is a binary classifier to identify the correct source of an input feature vector. To make it a strong rival to the feature imitating network (i-CNN), we model the discriminator as a multi-layer perceptron (MLP) enhanced with gated mechanism for efficient information flow (Srivastava et al., 2015;Qin et al., 2016c), as shown in Figure 3.

Experiments
We demonstrate the effectiveness of our approach both quantitatively and qualitatively with extensive experiments. We evaluate prediction performance on the PDTB benchmark in different settings. Our method substantially improves over a diverse set of previous models, especially in the practical multi-class classification task. We perform in-depth analysis of the model behaviors, and show our adversarial framework successfully enables the implicit relation model to imitate and learn discriminative features.

Experiment Setup
We use PDTB 2.0 1 , one of the largest manually annotated discourse relation corpus. The dataset contains 16,224 implicit relation instances in total, with three levels of senses: Level-1 Class, Level-2 Type, and Level-3 Subtypes. The 1st level consists of four major relation Classes: COMPARI-1 http://www.seas.upenn.edu/∼pdtb/ SON, CONTINGENCY, EXPANSION and TEMPO-RAL. The 2nd level contains 16 Types.
To make extensive comparison with prior work of implicit discourse relation classification, we evaluate on two popular experimental settings: 1) multi-class classification for 2nd-level types (Lin et al., 2009;, and 2) oneversus-others binary classifications for 1st-level classes (Pitler et al., 2009). We describe the detailed configurations in the following respective sections. We will focus our analysis on the multiclass classification setting, which is most realistic in practice and serves as a building block for a complete discourse parser such as that for the shared tasks of CoNLL-2015 and 2016 (Xue et al., , 2016. Model Training We provide detailed model and training configurations in the supplementary materials, and only mention a few of them here. Throughout the experiments i-CNN and a-CNN contains 3 sets of convolutional filters with the filter sizes selected on the dev set. The final singlelayer classifier C contains 512 neurons. The discriminator D consists of 4 fully-connected layers, with 2 gated pathways from layer 1 to layer 3 and layer 4 (Figure 3).
For adversarial model training, it is critical to keep balance between the progress of the two players. We use a simple strategy which at each iteration optimizes the discriminator and the implicit relation network on a randomly-sampled minibatch. We found this is enough to stabilize the training. The neural parameters are trained using AdaGrad (Duchi et al., 2011) with an initial learning rate of 0.001. For the balancing parameters in Eq.(5), we set λ 1 = 0.1, while λ 2 = 0. That is, after the initialization stage the weights of the connective-augmented network a-CNN are fixed. This has been shown capable of giving stable and good predictive performance for our system.

Implicit Relation Classification
We will mainly focus on the general multi-class classification problem in two alternative settings adopted in prior work, showing the superiority of our model over previous state of the arts. We perform in-depth comparison with carefully designed baselines, providing empirical insights into the working mechanism of the proposed framework. For broader comparisons we also report the performance in the one-versus-all setting.  Table 1: Accuracy (%) on the test sets of the PDTB-Lin and PDTB-Ji settings for multi-class classification. Please see the text for more details.

Multi-class Classifications
We first adopt the standard PDTB splitting convention following (Lin et al., 2009), denoted as PDTB-Lin, where sections 2-21, 22, and 23 are used as training, dev, and test sets, respectively. The most frequent 11 types of relations are selected in the task. During training, instances with more than one annotated relation types are considered as multiple instances, each of which has one of the annotations. At test time, a prediction that matches one of the gold types is considered as correct. The test set contains 766 examples. Please refer to (Lin et al., 2009) for more details. An alternative, slightly different multi-class setting is used in , denoted as PDTB-Ji, where sections 2-20, 0-1, and 21-22 are used as training, dev, and test sets, respectively. The resulting test set contains 1039 examples. We also evaluate in this setting for thorough comparisons. Table 1 shows the classification accuracy in both of the settings. We see that our model (Row 10) achieves state-of-the-art performance, greatly outperforming previous methods (Rows 6-9) with various modeling paradigms, including the linguistic feature-based model (Lin et al., 2009), pure neural methods (Qin et al., 2016c), and combined approach .
To obtain better insights into the working mechanism of our method, we further compare with a set of carefully selected baselines as shown in Rows 1-5. 1) "Word-vector" sums over the word vectors for sentence representation, showing the base effect of word embeddings. 2) "CNN" is a standalone convolutional net having the exact same architecture with our implicit relation network. Our model trained within the pro-posed framework provides significant improvement, showing the benefits of utilizing implicit connectives at training time. 3) "Ensemble" has the same neural architecture with the proposed framework except that the input of a-CNN is not augmented with implicit connectives. This essentially is an ensemble of two implicit recognition networks. We see that the method performs even inferior to the single CNN model. This further confirms the necessity of exploiting connective information. 4) "Multi-task" is the convolutional net augmented with an additional task of simultaneously predicting the implicit connectives based on the network features. As a straightforward way of incorporating connectives, we see that the method slightly improves over the stand-alone CNN, while falling behind our approach with a large margin. This indicates that our proposed feature imitation is a more effective scheme for making use of implicit connectives. 5) At last, " 2 -reg" also implements feature mimicking by imposing an 2 distance penalty between the implicit relation features and connective-augmented features. We see that the simple model has obtained improvement over previous best-performing systems in both settings, further validating the idea of imitation. However, in contrast to the fixed 2 regularization, our adversarial framework provides an adaptive mechanism, which is more flexible and performs better as shown in the table.

One-versus-all Classifications
We also report the results of four one-versus-all binary classifications for more comparisons with prior work. We follow the conventional experimental setting (Pitler et al., 2009) by selecting sections 2-20, 21-22, and 0-1 as training, dev, and test sets. More detailed data statistics are provided in    Table 2. Our method outperforms most of the prior systems in all the tasks. We achieve state-of-the-art performance in recognition of the Expansion relation, and obtain comparable scores with the best-performing methods in each of the other relations, respectively. Notably, our feature imitation scheme greatly improves over (Zhou et al., 2010) which leverages implicit connectives as an intermediate prediction task. This provides additional evidence for the effectiveness of our approach.

Qualitative Analysis
We now take a closer look into the modeling behavior of our framework, by investigating the process of the adversarial game during training, as well as the feature imitation effects. Figure 4 demonstrates the training progress of different components. The a-CNN network keeps high predictive accuracy as implicit connectives are given, showing the importance of connective cues. The rise-and-fall patterns in the accuracy of the discriminator clearly show its competition with the implicit relation network i-CNN as training goes. At first few iterations the accuracy of the discriminator increases quickly to over 0.9, while at late stage the accuracy drops to around 0.6, showing that the discriminator is getting confused by i-CNN (an accuracy of 0.5 indicates full confusion). The i-CNN network keeps improving in terms of implicit relation classification accuracy, as it is gradually fitting to the data and simultaneously learning increasingly discriminative features by mimicking a-CNN. The system exhibits similar learning patterns in the two different settings, showing the stability of the training strategy.
We finally visualize the output feature vectors of i-CNN and a-CNN using the t-SNE method (Maaten and Hinton, 2008) in Figure 5. Without feature imitation, the extracted features by the two networks are clearly separated (Figure 5(a)). In contrast, as shown in Figures 5(b)-(c), the feature vectors are increasingly mixed as training proceeds. Thus our framework has successfully driven i-CNN to induce similar representations with a-CNN, even though connectives are not present.

Discussions
We have developed an adversarial neural framework that facilitates an implicit relation network to extract highly discriminative features by mimicking a connective-augmented network. Our method achieved state-of-the-art performance for implicit discourse relation classification. Besides implicit connective examples, our model can naturally exploit enormous explicit connective data to further improve discourse parsing.
The proposed adversarial feature imitation scheme is also generally applicable to other context to incorporate indicative side information available at training time for enhanced inference. Our framework shares a similar spirit of the iterative knowledge distillation method (Hu et al., 2016a,b) which train a "student" network to mimic the classification behavior of a knowledgeinformed "teacher" network. Our approach encourages imitation on the feature level instead of the final prediction level. This allows our approach to apply to regression tasks, and more interestingly, the context in which the student and teacher networks have different prediction out-puts, e.g., performing different tasks, while transferring knowledge between each other can be beneficial. Besides, our adversarial mechanism provides an adaptive metric to measure and drive the imitation procedure.