Unsupervised Adversarial Domain Adaptation for Implicit Discourse Relation Classification

Implicit discourse relations are not only more challenging to classify, but also to annotate, than their explicit counterparts. We tackle situations where training data for implicit relations are lacking, and exploit domain adaptation from explicit relations (Ji et al., 2015). We present an unsupervised adversarial domain adaptive network equipped with a reconstruction component. Our system outperforms prior works and other adversarial benchmarks for unsupervised domain adaptation. Additionally, we extend our system to take advantage of labeled data if some are available.


Introduction
Discourse relations capture the relationship between units of text-e.g., sentences and clausesand are an important aspect of text coherence.While some relations are expressed explicitly with a discourse connective (e.g., "for example", "however"), relations are equally often expressed implicitly without an explicit connective (Prasad et al., 2008); in these cases, the relation needs to be inferred.
Resources for implicit discourse relations are scarce compared to the explicit ones, since they are harder to annotate (Miltsakaki et al., 2004).For example, among corpora annotated with discourse relations such as Arabic (Al-Saif and Markert, 2010), Czech (Poláková et al., 2013), Chinese (Zhou and Xue, 2015), English (Prasad et al., 2008), Hindi (Oza et al., 2009), and Turkish (Zeyrek et al., 2013), only the Chinese, English and Hindi corpora include implicit discourse relations (Prasad et al., 2014).In this low-resource scenario, Ji et al. (2015) proposed training with explicit relations via unsupervised domain adaptation, viewing explicit relations as a source domain with labeled training data, and implicit relations as a target domain with no labeled data.The domain gap between explicit and implicit relations is acknowledged by prior observations that the two types of discourse relations are linguistically dissimilar (Sporleder and Lascarides, 2008;Rutherford and Xue, 2015).
We present a new system for the unsupervised domain adaptation setup on the Penn Discourse Treebank (Prasad et al., 2008).Our system is based on Adversarial Discriminative Domain Adaptation (Tzeng et al., 2017), which decouples source domain training and representation mapping between source and target.We improve this framework by proposing a reconstruction component to preserve the discriminability of target features, and incorporating techniques for stabler training on textual data.
Experimental results show that even with a simple architecture for representation learning, our unsupervised domain adaptation system outperforms prior work by 1.4-2.3macro F1, with substantial improvements on Temporal and Contingency relations.It is also superior to DANN (Ganin et al., 2016), an adversarial framework widely used in NLP (Chen et al., 2018;Gui et al., 2017;Zhang et al., 2017;Fu et al., 2017;Joty et al., 2017;Xu and Yang, 2017), by 5.7 macro F1.
Finally, we extend the system to incorporate in-domain supervision as it is sometimes feasible resource-wise to build a seed corpus that may not be large enough to train a fully supervised system.We simulate this scenario by enabling the system to jointly optimize over a varying number of labeled examples of implicit relations.Our system consistently outperforms two strong baselines.and Xue (2015) observed that explicit and implicit relations are linguistically dissimilar, warranting an unsupervised domain adaptation approach in Ji et al. (2015).They used a marginalized denoising autoencoder to obtain generalized feature representations across the source and target domains with a linear SVM as the classification model.Our system improves upon this work using an adversarial network; we further generalize our network to semi-supervised settings.

Related Work
To supplement the training data of implicit discourse relations, prior works have used weak supervision from sentences with discourse connectives (Marcu and Echihabi, 2002;Sporleder and Lascarides, 2008;Braud and Denis, 2014;Ji et al., 2015), by analyzing connectives (Zhou et al., 2010a,b;Biran and McKeown, 2013;Rutherford and Xue, 2015;Braud and Denis, 2016;Wu et al., 2017), using a multi-task framework with other corpora (Lan et al., 2013;Liu et al., 2016;Lan et al., 2017), or utilizing cross-lingual data (Wu et al., 2016;Shi et al., 2017).The important distinction between this work and the research above is that these are supervised systems that used all of the annotated implicit annotation from PDTB during training, while exploring non-PDTB corpora for additional, noisy discourse cues; on the contrary, our main goal is to assume no labeled training data for implicit discourse relations.
Unsupervised domain adaptation with adversarial networks has become popular in recent years; this type of approach generates a representation for the target domain with the goal that the discriminator unable to distinguish between the source and target domains.Prior works proposed both generative approaches (Liu and Tuzel, 2016;Bousmalis et al., 2017;Sankaranarayanan et al., 2018;Russo et al., 2018) and discriminative approaches (Ganin et al., 2016;Tzeng et al., 2015Tzeng et al., , 2017)).The discriminative DANN algorithm from Ganin et al. (2016) is frequently used in NLP tasks (Chen et al., 2018;Gui et al., 2017;Zhang et al., 2017;Fu et al., 2017;Joty et al., 2017;Xu and Yang, 2017).Our method builds upon Adversarial Discriminative Domain Adaptation (Tzeng et al., 2017), shown to outperform DANN in visual domain adaptation but has not been used in NLP tasks.The key differences between the two are discussed in Section 3. Qin et al. (2017) adopted adversarial strategies to supervised implicit discourse classification.
They train an adversarial model using implicit discourse relations with and without expert-inserted connectives.Note again that theirs is a fully supervised system using signals in addition to the implicit relation annotations themselves, while our main focus is unsupervised domain adaptation that does not train on implicit relations.

Model Architecture
To classify discourse relations, our system takes a pair of sentence arguments x as input, and outputs the discourse relation y between these two arguments.With unsupervised domain adaptation, we have examples (X s , Y s ) from the source domain, i.e., explicit discourse relations, and unlabeled examples (X t ) from a target domain, i.e., implicit discourse relations.
We use ADDA (Tzeng et al., 2017) as our underlying framework for domain adaptation.ADDA first learns a discriminative representation for the classification task in the source domain, then learns a representation for the target domain that mimics the distribution of the source domain.The key insight here is asymmetric mapping, where the target representation is "updated" until it matches with the source, a process more similar to the original Generative Adversaial Networks (Goodfellow et al., 2014) than joint training as in DANN (Ganin et al., 2016).Intuitively, since ADDA learns distinct feature encoders for the source and target domains instead of using a shared encoder, the same network doesn't have to handle instances from different domains.
Summarized in Figure 1, we first pre-train a source encoder M s and source classifier C (Section 3.1), then train the target encoder M t (initialized with M s ) and discriminator D in an adversarial way, to minimize the domain discrepancy distance between the target representation distribution M t (X t ) and that of the source M s (X s ) (Section 3.2).Eventually, the target feature space is trained to match the source, and the source classifier C can be directly used on the target domain.

Base encoder and classifier
The source and target encoders M s and M t follow the same architecture; M t is initialized to be M s during adaptation.The encoders encode relation arguments into latent representations, and then feeds the representations into a classifier C to predict the discourse relation.Encoder The encoder generates a representation for each argument with an inner-attention Bidirectional LSTM (Yang et al., 2016) shared between the two arguments.Then, the representations of the two arguments are concatenated to form the final representation, shown in Figure 2.
Specifically, we encode each word in an argument into its word embeddings, which are fed into a BiLSTM, to get the hidden representations z i using a fully-connected layer W c on top of the concatenated hidden states h i = [ h i , h i ].We then apply an attention mechanism to induce a distribution of weights over all tokens in the argument; the final argument representation Arg is a weighted sum of z i based on the attention weights α i : Classifier The classifier consists of a single fully-connected layer on top of the encoder, finished with a softmax classification layer.
The source encoder M s and the classifier C are trained using a supervised loss:

Unsupervised adversarial domain adaptation
We then learn a target encoder M t to generate features for the target data which can be classified with classifier C, without assuming labels Y t in the target domain.This is achieved by training a domain discriminator D, which classifies whether a feature is from the source or the target domain, and the target encoder M t , that produces features similar to the source domain features and tries to fool the discriminator to predict the incorrect domain label.The discriminator D is optimized according to a standard supervised loss: D consists of two fully-connected layers on top of the encoder, finished with a softmax classification layer.
The target encoder M t is optimized according to a standard GAN loss with inverted labels: Spectral normalization To stabilize the training of the discriminator, we employ spectral normalization, a weight normalization technique (Miyato et al., 2018), which controls the Lipschitz constant of the discriminator function by constraining the spectral norm of each layer.Spectral normalization is easy to implement without tuning any hyper-parameters and has a small additional computational cost.
Label smoothing We utilize label smoothing (Szegedy et al., 2016) to regularize the classifier during pre-training, which prevents the largest logit from becoming much larger than all others, and therefore prevents overfitting and makes the classifier, trained in the source domain, more adaptable.
For a source domain training example x s with ground-truth label y s and ground-truth distribution q(k|x s ), the classifier computes the classification probability over relation classes as p(k|x s ) for k ∈ {1...K}.With label smoothing, we replace the ground-truth label distribution q(k|x s ) in the standard cross-entropy loss as a linear combination of q(k|x s ) and a uniform distribution over classes u Reconstruction loss In order to classify the target representations using the source classifier, the target encoder is trained to produce representations that mimic the source domain representations in the adversarial training stage.Since there is no supervised loss applied in this stage, the target encoder may lose its ability to produce discriminative features that are helpful during classification.We propose a reconstruction loss to preserve the discriminability of the target encoder when adversarially adapting its features.
Since we initialize the target encoder with the source encoder, the initial representation (before domain adaptation) of a target instance x t is the representation of target instances produced by the source encoder M s (x t ) (which is then fixed).After training, M t (x t ) adapts to the source domain and becomes dissimilar to M s (x t ).The reconstruction loss encourages the target encoder to produce features that can be reconstructed back to M s (x t ) (Figure 3).
For a target example x t , we learn a reconstruction mapping M r that maps the target representation M t (x t ) to M s (x t ): The target encoder M t and the reconstruction mapping M r are optimized jointly with a reconstruction loss: M r consists of three fully-connected layers on top of the encoder.
Unsupervised objective For unsupervised domain adaptation, our full objective is:  3), Eq.( 4), Eq.( 7).Finally, we test the model using the target encoder M t and classifier C. The steps (lines 4 and 5) in the Repeat loop execute once in one iteration, and we optimized the model in two-step units.

Unsupervised Domain Adaptation Experiments
We first evaluate our model for the default task: unsupervised domain adaptation from explicit discourse relations to implicit discourse relations.

Settings
Data We train and test our model on the PDTB, following the experimental setup of We use GloVe (Pennington et al., 2014) for word embeddings with dimension 300.The maximum argument length is set to 80.The encoder contains an inner-attention BiLSTM with dimension 50, producing a representation with dimension 200 for each example.The discriminator D consists of 2 hidden layers with 200 and 200 neurons on each layer.The reconstruction mapping M r contains 3 hidden layers with 120, 15 and 120 neurons on each layer.The label smoothing parameter is 0.1.We use Adam (Kingma and Ba, 2015) with learning rate 1e-4 for the base encoder and classifier, and 1e-6 for the adversarial domain adapter.We use SGD optimizer with learning rate 1e-2 for the reconstruction component.All the models were implemented using Py-Torch (Paszke et al., 2017) and adapted from Conneau et al. (2017).

Systems
We experiment with three settings: Implicit → Implicit A supervised implicit discourse relation classifier using the base encoder and classifier, optimizing the standard crossentropy loss, using the full implicit training and For benchmarking, we train an unsupervised domain adaptation system using DANN (Ganin et al., 2016), which jointly learns domain-invariant representations and the classifier and is often used in NLP (c.f.Section 2).We use the same encoder, classifier and discriminator structures, with parameters tuned on the implicit development data.The system is optimized using Adam with learning rate 2e-4 and the adaptation parameter 0.25, chosen between 0.01 and 1 on a logarithmic scale.

Results
To evaluate our model, we train four-way classifiers and report per-class and macro F1 scores.Table 2 tabulates the experimental results for unsupervised domain adaptation.
We also show reported results from Ji et al. (2015).Even though they trained four binary classifiers (instead of doing multi-class classification), it is the only prior work exploring unsupervised domain adaptation for implicit discourse relation classification.We include two settings: their best system with labeled data from PDTB explicit relations only (and an implicit development set), and their system with additional weak supervision from non-PDTB sources. 2ur full system achieves the best average F1 measure, a 9.53% absolute increase from Explicit → Implicit.It also performs 2.28% better than Ji et al. (2015)'s model trained without weak supervision, and 1.44% better than their model trained with weak supervision.The full system achieved an average F1 comparable to the supervised Implicit → Implicit, while Ji et al. (2015)'s models did not.Comparing with DANN, our system achieved superior performance for 3 of the 4 relations, showing that training target representations and the classifier in two stages outperforms doing both jointly.
The largest improvements from the Explicit → Implicit baseline are from Temporal (from 22.22 to 31.25) and Contingency (from 22.35 to 48.04) relations.Our system performs ∼11% better for Temporal, and ∼6% better for Contingency, than Ji et al. (2015)'s binary classifiers.The Comparison and Expansion relations improved by about 2% from the baseline, a smaller improvement compared to the other two relations.Our Comparison performance is comparable with Ji et al. (2015)'s model without weak supervision.
Notably, the performance for Expansion dropped after domain adaptation (without extensions) by about 5%.We suspect that this is because the distributions of Expansion among other relations are very different (33% for explicit and 53% for implicit, c.f. Table 1).By applying Spectral Normalization, the performance improved and surpassed Explicit → Implicit.
Component-wise, Spectral Normalization helps two of the four relations (Contingency and Expansion), but hurts the performance of Comparison.Label smoothing improves performance for all relations except Expansion; applying the reconstruction loss improves performance for all relations except Temporal.Overall, the best result on this task is to incorporate all components.
Error analysis Figure 4 shows the normalized confusion matrices before and after unsupervised domain adaptation.Before adaptation, Temporal and Contingency relations are often misclassified as Expansion, which is substantially improved after adaptation.The improvement in F score for Comparison is milder due to lower precision and higher recall, which is also reflected in the matrices.Finally, the drop of performance in Expansion adaptation can be traced through increased confusion between Expansion and Contingency.

How about a little supervision?
We have so far presented an unsupervised domain adaptation system that is not trained on any labels Y t in the target domain.However, it is sometimes feasible to have some seed annotation that can be used to improve prediction.Hence we extend the model with an optional supervised component.We evaluate this extension by gradually adding labeled examples of implicit discourse relations, simulating situations when different numbers of labeled examples are available.

Incorporating supervision
We extend the model with a supervised component, where a subset X L t ⊆ X t has labels Y L t .Illustrated in Figure 5, we jointly optimize the target encoder M t and the classifier C according to an additional supervised loss: Effectively, we encourage the target encoder to jointly extract more discriminative features for all target examples (X t ), and learning target domain representations close to the source.
The full objective incorporates supervision of the in-domain labels by adding L sup (X L t , Y L t ) to the unsupervised objective:

Data and settings
We synthesize the labeled target subset (X L t , Y L t ) (X L t ⊆ X t ) by randomly extracting subsets from the implicit training set and get their labels.The sizes of this subset range from 1382 to 13824 with a stepsize of 1382.Note that we use the entire implicit training set (X t without relation labels Y t ) in the adversarial adaptation process as unlabeled data in the target domain, and the sampled labeled data is used in the supervised component only.We use the same hyper-parameters as the unsupervised experiment, except that we tune the learning rate on the implicit development set.

Systems
We compare three settings: Supervised baseline The encoder and classifier trained on the sampled implicit instances (X L t , Y L t ), optimizing Eq.( 9).
Pre-training baseline Our model with the supervised component, but without the domain adaptation component.This setting is equivalent to pre-training on the explicit instances then finetuning on the sampled implicit instances.This is trained on the explicit training set (X s , Y s ), plus the sampled implicit instances (X L t , Y L t ), optimizing Eq.( 5) and Eq.( 9).
Semi-supervised domain adaptation Our full model with both the supervised and adaptation components, optimizing Eq.( 10).The supervised component uses the sampled implicit instances (X L t , Y L t ) for training.

Results
Since the added training data is randomly sampled, we average the performance across 3 different runs.Figure 6 shows the average F1 measure (y-axis) of the above three supervised systems, with varying numbers of labeled implicit relation training data (x-axis).Standard errors are also shown in the graph.
Our full system outperforms both the supervised baseline and the pre-training baseline, re-gardless of the amount of labeled target data.This evaluation also reveals that the pre-training baseline also improves upon the supervised baseline across the board, which means that the performance of implicit relation classification can be improved with pre-training on explicit relations.
Finally, the macro F1 of our system using full supervision is 47.50.Since we focus on domain adaptation and used very simple encoders, we do not attempt to achieve state-of-the-art (e.g., Dai and Huang (2018), Bai and Zhao (2018)).However this performance is on-par with many recent work using multi-task or GANs, including Lan et al. (2017) (47.80),Qin et al. (2017) (44.38,Reproduced results on four-way classification), and Liu et al. (2016) (44.98).These results confirm that our framework generalizes well with respect to the amount of supervision in the target domain.

Conclusion
Our work tackles implicit discourse relation classification in a low resource setting that is flexible to the amount of supervision.We present a new system based on the adversarial discriminative domain adaptation framework (Tzeng et al., 2017) for unsupervised domain adaptation from explicit discourse relation to implicit discourse relation.We propose a reconstruction loss to preserve the discriminability of features during adaptation, and we generalize the framework to make use of possibly available seed data by jointly optimizing it with a supervised loss.Our system outperforms prior work and strong adversarial baselines on unsupervised domain adaptation, and works effectively with varying amount of supervision.

Figure 1 :
Figure 1: The framework of our proposed adversarial domain adaptation model, containing the pre-training stage, the adversarial adaptation stage, and the testing stage.The dashed box shows the supervised component.

Figure 2 :
Figure 2: Neural structure of inner-attention BiLSTM to encode relation argument pairs.

Figure 3 :
Figure 3: The reconstruction loss component augmenting our unsupervised adversarial domain adaptor.
1, the training procedure consists of three stages: pre-training, adversarial adaptation, and testing.During pre-training, we train the source encoder M s and the classifier C according to Eq.(2).In the adversarial adaptation stage, we alternately train the discriminator D, target encoder M t , and reconstruction mapping Algorithm 1: Adversarial Adaptation Input: explicit sentences with labels {xs, ys} implicit sentences without labels {xt} Notations: source encoder Ms, classifier C, target encoder Mt, reconstruction mapping Mr, domain discriminator D 1 Train Ms, C through Eq.(2) with {xs, ys}; 2 Initialize Mt as Ms; 3 Repeat 4 Train D through Eq.(3) with {xs, xt} and train Mt through Eq.(4) with {xt}; 5 Train Mt, Mr through Eq.(7) with {xt}; Output: Mt and C for relation prediction M r according to Eq.(

Figure 4 :
Figure 4: Normalized confusion matrices before and after unsupervised domain adaptation.

Figure 5 :
Figure 5: An extension component to incorporate supervision with our unsupervised adversarial domain adaptor.

Figure 6 :
Figure 6: The average F1 (%) with varying numbers of labeled implicit relation training data.
Model configurationThe hyperparameters, as well as the number of fully connected layers for the classifier C, discriminator D and the reconstruction mapping M r , are all set according to the performance on the development sets.We first set the hyper-parameters of the encoders M s , M t and classifier C based on development performance during the pre-training stage.Then, we set the hyper-parameters of D and M r based on development performance of the adaptation stage.
t , and validate on the implicit de-

Table 2 :
Per-class and macro average F1 (%) of unsupervised domain adaptation from explicit to implicit relations.