Next Sentence Prediction helps Implicit Discourse Relation Classification within and across Domains

Implicit discourse relation classification is one of the most difficult tasks in discourse parsing. Previous studies have generally focused on extracting better representations of the relational arguments. In order to solve the task, it is however additionally necessary to capture what events are expected to cause or follow each other. Current discourse relation classifiers fall short in this respect. We here show that this shortcoming can be effectively addressed by using the bidirectional encoder representation from transformers (BERT) proposed by Devlin et al. (2019), which were trained on a next-sentence prediction task, and thus encode a representation of likely next sentences. The BERT-based model outperforms the current state of the art in 11-way classification by 8% points on the standard PDTB dataset. Our experiments also demonstrate that the model can be successfully ported to other domains: on the BioDRB dataset, the model outperforms the state of the art system around 15% points.


Introduction
Discourse relation classification has been shown to be beneficial to multiple down-stream NLP tasks such as machine translation (Li et al., 2014), question answering (Jansen et al., 2014) and summarization (Yoshida et al., 2014). Following the release of the Penn Discourse Tree Bank (Prasad et al., 2008, PDTB), discourse relation classification has received a lot of attention from the NLP community, including two CoNLL shared tasks (Xue et al., 2015(Xue et al., , 2016. Discourse relations in texts are sometimes marked with an explicit connective (e.g., but, because, however), but these explicit signals are often absent. With explicit connectives acting as informative cues, it is relatively easy to classify the discourse relation with high accuracy (93.09% on four-way classification in (Pitler et al., 2008)).
When there is no connective, classification has to rely on semantic information from the relational arguments. This task is very challenging, with state-of-the-art systems achieving accuracy of only 45% to 48% on 11-way classification. Consider example 1: ( In order to correctly classify the relation, it is necessary to understand that Arg1 raises the expectation that the next discourse segment may provide an explanation for why the venture wasn't good (e.g., that it was risky), and Arg2 contrasts with this discourse expectation. More generally, this means that a successful discourse relation classification model would have to be able to learn typical temporal event sequences, reasons, consequences etc. for all kinds of events. Statistical models attempted to address this intuition by giving models word pairs from the two arguments as features (Lin et al., 2009;Park and Cardie, 2012;Biran and McKeown, 2013;Rutherford and Xue, 2014), so that models could for instance learn to recognize antonym relations between words in the two arguments.
Recent models exploit such similarity relations between the two arguments, as well as simpler surface features that occur in one relational argument and correlate with specific coherence relations (e.g., the presence of negation, temporal expressions etc. may give hints as to what coherence relation may be present, see Park and Cardie (2012); Asr and Demberg (2015)). However, relations between arguments are often a lot more diverse than simple contrasts that can be captured through antonyms, and may rely on world knowledge (Kishimoto et al., 2018). It is hence clear that one cannot learn all these diverse relations from the very small amounts of available training data. Instead, we would have to learn a more general representation of discourse expectations.
Many recent discourse relation classification approaches have focused on cross-lingual data augmentation , training models to better represent the relational arguments by using various neural network models, including feed-forward network (Rutherford et al., 2017), convolutional neural networks (Zhang et al., 2015), recurrent neural network (Ji et al., 2016;Bai and Zhao, 2018), character-based (Qin et al., 2016) or formulating relation classification as an adversarial task (Qin et al., 2017). These models typically use pre-trained semantic embeddings generated from language modeling tasks, like Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018).
However, previously proposed neural models still crucially lack a representation of the typical relations between sentences: to solve the task properly, a model should ideally be able to form discourse expectations, i.e., to represent the typical causes, consequences, next events or contrasts to a given event described in one relational argument, and then assess the content of the second relational argument with respect to these expectations (see Example 1). Previous models would have to learn these relations only from the annotated training data, which is much too sparse for learning all possible relations between all events, states or claims.
The recently proposed BERT model (Devlin et al., 2019) takes a promising step towards addressing this problem: the BERT representations are trained using a language modelling and, crucially, a "next sentence prediction" task, where the model is presented with the actual next sentence vs. a different sentence and needs to select the original next sentence. We believe it is a good fit for discourse relation recognition, since the task allows the model to represent what a typical next sentence would look like.
In this paper, we show that a BERT-based model outperforms the current state of the art by 8% points in 11-way implicit discourse relation classification on PDTB. We also show that after pre- trained with small size cross-domain data, the model can be easily transferred to a new domain: it achieves around 16% accuracy gain on BioDRB compared to state of the art model. We also show that the Next Sentence Prediction task played an important role in these improvements. Devlin et al. (2019) proposed the Bidirectional Encoder Representation from Transformers (BERT), which is designed to pre-train a deep bidirectional representation by jointly conditioning on both left and right contexts. BERT is trained using two novel unsupervised prediction tasks: Masked Language Modeling and Next Sentence Prediction (NSP). The NSP task has been formulated as a binary classification task: the model is trained to distinguish the original following sentence from a randomly chosen sentence from the corpus, and it showed great helps in multiple NLP tasks especially inference ones. The resulting BERT representations thus encode a representation of upcoming discourse content, and hence contain discourse expectation representations which, as we argued above, are required for classifying coherence relations. is the special classification embedding while "C" is the same as "[CLS]" in pre-training but the ground-truth label in the fine-tuning. In the experiments, we used the uncased base model 1 provided by Devlin et al. (2019), which is trained on BooksCorpus and English Wikipedia with 3300M tokens in total.

Evaluation on PDTB
We used the Penn Discourse Tree Bank (Prasad et al., 2008), the largest available manually annotated discourse corpus. It provides a three level hierarchy of relation tags. Following the experimental settings and evaluation metrics in Bai and Zhao (2018), we use two most-used splitting methods of PDTB data, denoted as PDTB-Lin (Lin et al., 2009), which uses sections 2-21, 22, 23 as training, validation and test sets, and PDTB-Ji (Ji and Eisenstein, 2015), which uses 2-20, 0-1, 21-22 as training, validation and test sets and report the overall accuracy score. In addition, we also performed 10-fold cross validation among sections 0-22, as promoted in . We also follow the standard in the literature to formulate the task as an 11-way classification task. Results are presented in Table 1. We evaluated three versions of the BERT-based model. All of our BERT models use the pre-trained representations and are fine-tuned on the PDTB training data. The version marked as "BERT" does not do any additional pre-training. BERT+WSJ in addition performs further pre-training on the 1 https://github.com/google-research/ bert#pre-trained-models parts of the Wall Street Journal corpus that do not have discourse relation annotation. The model version "BERT+WJS w/o NSP" also performs pre-training on the WSJ corpus, but only uses the Masked Language Modelling task, not the Next Sentence Prediction task in the pre-training. We added this variant to measure the benefit of in-domain NSP on discourse relation classification (note though that the downloaded pre-trained BERT model contains the NSP task in the original pre-training).
We compared the results with four state-of-theart systems: Cai and Zhao (2017) proposed a model that takes a step towards calculating discourse expectations by using attention over an encoding of the first argument, to generate the representation of the second argument, and then learning a classifier based on the concatenation of the encodings of the two discourse relation arguments. Kishimoto et al. (2018) fed external world knowledge (ConceptNet relations and coreferences) explicitly into MAGE-GRU (Dhingra et al., 2017) and achieved improvements compared to only using the relational arguments. However, we here show that it works even better when we learn this knowledge implicit through next sentence prediction task. Shi and Demberg (2019) used a seq2seq model that learns better argument representations due to being trained to explicitate the implicit connective. In addition, their classifier also uses a memory network that is intended to help remember similar argument pairs encountered during training. The current best performance was achieved by Bai and Zhao (2018), who combined representations from different grained em-beddings including contextualized word vectors from ELMo (Peters et al., 2018), which has been proved very helpful. In addition, we compared our results with a simple bidirectional LSTM network and pre-trained word embeddings from Word2Vec.
We can see that on all settings, the model using BERT representations outperformed all existing systems with a substantial margin. It obtained improvements of 7.3% points on PDTB-Lin, 5.5% points on PDTB-Ji, compared with the ELMobased method proposed in (Bai and Zhao, 2018). What's more, the BERT model outperformed (Shi and Demberg, 2019) on cross validation by around 8%, with significance of p<0.01. Significance test was performed by estimating variance of the model from the performance on different folds in cross-validation (paired t-test). For the Lin and Ji evaluations, we estimated variance due to random initialization by running them 5 times and calculating the likelihood that the state-of-the-art model result would come from that distribution.

Evaluation On BioDRB
The Biomedical Discourse Relation Bank (Prasad et al., 2011) also follows PDTB-style annotation. It is a corpus annotated over 24 open access fulltext articles from the GENIA corpus (Kim et al., 2003) in the biomedical domain. Compared with PDTB, some new discourse relations and changes have been introduced in the annotation of Bio-DRB. In order to make the results comparable, we preprocessed the BioDRB annotations to map the relations to the PDTB ones, following the instructions in Prasad et al. (2011).
The biomedical domain is very different from the WSJ or the data on which the BERT model was trained. The BioDRB contains a lot of professional words / phrases that are extremely hard to model. In order to test the ability of the BERT model on cross-domain data, we performed finetuning on PDTB while testing on BioDRB. We also tested the state of the art model of implicit discourse relation classification proposed by Bai and Zhao (2018) on BioDRB. From Table 2, we can see that the BERT base model achieved almost 12% points improvement over the Bi-LSTM baseline and 15% points over Bai and Zhao (2018). When fine-tuned on in-domain data in the crossvalidation setting, the improvement increases to around 17% points.  Table 2: Accuracy (%) on BioDRB level 2 relations with different settings. Cross-Domain means trained on PDTB and tested on BioDRB. For the In-Domain setting, we used 5-fold cross-validation and report average accuracy. Numbers in bold are significantly better than the state of the art system with p<0.01 and numbers with * denote denote significant improvements over BERT + GENIA w/o NSP with p<0.01.
It is also interesting to know whether the performance of the BERT model can be improved if we add additional pre-training on in-domain data. BioBert  continues pretraining BERT with bio-medical texts including PubMed and PMC corpora (around 18B tokens), which achieved the best results on in-domain setting. Similarly, BERT+GENIA refers to a model in which the downloaded BERT representations are further pre-trained on the parts of the GENIA corpus which consists of 18k sentences and is not annotated with coherence relations. Evaluation shows that this in-domain pre-training yields another 3% point improvement; our tests also show that the NSP task again plays a substantial role in the improvement. We believe that gains for further pre-training on GENIA for the biomedical domain are higher than for pre-training on WSJ for PDTB because the domain difference between the BooksCorpus and the biomedical domain is larger.
Currently there are not so many published results that we can compare with on BioDRB for implicit discourse relation classification. We compared BERT model with naïve Bayes and Max-Ent methods proposed in Xu et al. (2012) on oneversus-all binary classification. We followed the settings in Xu et al. (2012) and used two articles ("GENIA 1421503", "GENIA 1513057") for testing and one article ("GENIA 111020") for validation. During training, we employed downsampling or up-sampling to keep the numbers of positive and negative samples in each relation consistent. The BERT base model achieved 43.03% average F 1 score and 77.34% average accuracy in one-versus-all level-1 classification. Compared with the current state-of-the-art perfor-  Table 3: F 1 -score (Accuracy) of binary classification on level 1 implicit relation in BioDRB.  Table 4: Precision, Recall and F 1 score for each level-2 relation on PDTB-Lin setting and BioDRB with "BERT + WSJ/GENIA" systems w/ and w/o NSP. "-" indicates 0.00 and "C." means the number of each relation in the test set.

Discussion
The usage of the BERT model in this paper was motivated primarily by the use of the nextsentence prediction task during training. The results in Table 1 and Table 2 confirm that removing the "Next Sentence Prediction" hurts the performance on both PDTB and BioDRB. In order to have better insights about which relation has benefited from the NSP task, we also reported the detailed performance for each relation with and without it in BERT. As illustrated in Table 4, we can see that performances on relations like Temporal.Synchrony, Comparison.Contrast, Expansion.Conjunction and Expansion.Alternative have been improved by a large margin. This shows that representing the likely upcoming sentence helps the model form discourse expectations, which the classifier can then use to predict the coherence relation between the actually observed arguments.
However, compared with BERT+GENIA, the results of BioBert  in Table 2 show that having large in-domain data for pretraining also has limited ability in learning domain specific representations. We therefore believe that the model could be further improved by including external domain-specific knowledge from an ontology (as in Kishimoto et al. (2018)) or a causal graph for biomedical concepts and events.

Conclusion and Future work
In this paper, we show that BERT has very good ability in encoding the semantic relationship between sentences with its "next sentence prediction" task in pre-training. It outperformed the current state-of-the-art systems significantly with a substantial margin on both in-domain and cross domain data. Our results also indicate that the next-sentence prediction task during training indeed plays a role in this improvement. Future work should explore the joint representation of discourse expectations through implicit representations that are learned during training and the inclusion of external knowledge. In addition, Yang et al. (2019) showed that NSP only helps tasks with longer texts. It would be interesting to see whether it has the same effect on implicit discourse relation classification task, we'd like to leave that in the future work.