Implicit Discourse Relation Classification: We Need to Talk about Evaluation

Implicit relation classification on Penn Discourse TreeBank (PDTB) 2.0 is a common benchmark task for evaluating the understanding of discourse relations. However, the lack of consistency in preprocessing and evaluation poses challenges to fair comparison of results in the literature. In this work, we highlight these inconsistencies and propose an improved evaluation protocol. Paired with this protocol, we report strong baseline results from pretrained sentence encoders, which set the new state-of-the-art for PDTB 2.0. Furthermore, this work is the first to explore fine-grained relation classification on PDTB 3.0. We expect our work to serve as a point of comparison for future work, and also as an initiative to discuss models of larger context and possible data augmentations for downstream transferability.


Introduction
Understanding discourse relations in natural language text is crucial to end tasks involving larger context, such as question-answering (Jansen et al., 2014) and conversational systems grounded on documents (Saeidi et al., 2018;Feng et al., 2020). One way to characterize discourse is through relations between two spans or arguments (ARG1/ARG2) as in the Penn Discourse TreeBank (PDTB) (Prasad et al., 2008(Prasad et al., , 2019. For instance: [ Arg1 I live in this world,] [ Arg2 assuming that there is no morality, God or police.] (wsj_0790) Label: EXPANSION.MANNER.ARG2-AS-MANNER The literature has focused on implicit discourse relations from PDTB 2.0 (Pitler et al., 2009;Lin et al., 2009), on which deep learning has yielded substantial performance gains (Chen et al., 2016;Liu and Li, 2016;Lan et al., 2017;Qin et al., 2017; Bai and * Work done while at IBM Research. Zhao, 2018;Nguyen et al., 2019, i.a.). However, inconsistencies in preprocessing and evaluation such as different label sets (Rutherford et al., 2017) pose challenges to fair comparison of results and to analyzing the impact of new models. In this paper, we revisit prior work to explicate the inconsistencies and propose an improved evaluation protocol to promote experimental rigor in future work. Paired with this guideline, we present a set of strong baselines from pretrained sentence encoders on both PDTB 2.0 and 3.0 that set the state-of-the-art. We furthermore reflect on the results and discuss future directions. We summarize our contributions as follows: • We highlight preprocessing and evaluation inconsistencies in works using PDTB 2.0 for implicit discourse relation classification. We expect our work to serve as a comprehensive guide to common practices in the literature.
• We lay out an improved evaluation protocol using section-based cross-validation that preserves document-level structure.
• We report state-of-the-art results on both toplevel and second-level implicit discourse relation classification on PDTB 2.0, and the first set of results on PDTB 3.0. We expect these results to serve as simple but strong baselines that motivate future work.
• We discuss promising next steps in light of the strength of pretrained encoders, the shift to PDTB 3.0, and better context modeling.

The Penn Discourse TreeBank (PDTB)
In PDTB, two text spans in a discourse relation are labeled with either one or two senses from a three-level sense hierarchy. PDTB 2.0 contains around 43K annotations with 18.4K explicit and 16K implicit relations in over 2K Wall Street Journal (WSJ) articles. Identifying implicit relations (i.e., without explicit discourse markers such as  Table 1: Accuracy on PDTB 2.0 L2 classification. We report average performance and standard deviation across 5 random restarts. Significant improvements according to the N − 1 χ 2 test after Bonferroni correction are marked with * , * * , * * * (2-tailed p < .05, < .01, < .001). We compare the best published model and the median result from the 5 restarts of our models. Because we use section-based cross-validation, significance over † is not computed. but) is more challenging than explicitly signaled relations (Pitler et al., 2008). The new version of the dataset, PDTB 3.0 (Prasad et al., 2019), introduces a new annotation scheme with a revised sense hierarchy as well as 13K additional datapoints. 2 The third-level in the sense hierarchy is modified to only contain asymmetric (or directional) senses.

Variation in preprocessing and evaluation
We survey the literature to identify several sources of variation in preprocessing and evaluation that could lead to inconsistencies in the results reported.
Choice of label sets. Due to the hierarchical annotation scheme and skewed label distribution, a range of different label sets has been employed for formulating classification tasks (Rutherford et al., 2017). The most popular choices for PDTB 2.0 are: (1) top-level senses (L1) comprised of four labels, and (2) finer-grained Level-2 senses (L2). For L2, the standard protocol is to use 11 labels after eliminating five infrequent labels as proposed in Lin et al. Random initialization. Different random initializations of a network often lead to substantial variability (Dai and Huang, 2018). It is important to consider this variability especially when the reported margin of improvement can be as small as half a percentage point (see cited papers in Table 1). We report the mean over 5 random restarts for existing splits, and the mean of mean cross-validation accuracy over 5 random restarts. 3

Proposed Evaluation Protocol
While Xue et al. (2015) lay out one possible protocol, it does not fully address the issues we have raised in Section 2. Another limitation is the unavailability of the preprocessing code as of the date of this submission. We describe our proposal below, which will be accompanied by a publicly available preprocessing code. 4 In addition to accounting for the variation previously discussed, we take Shi and Demberg (2017)'s concerns into consideration.
Cross-validation. We advocate using crossvalidation for L2 classification, sharing the concerns of Shi and Demberg (2017) on label sparsity. However, we propose using cross-validation at section-level rather than individual example-level as suggested by Shi and Demberg (2017). This is to preserve paragraph and document structures, which are essential for investigating the effect of modeling larger context (e.g., Dai and Huang 2018). We further illustrate the potential utility of document structure in Section 4. We suggest dividing the 25 sections of PDTB into 12 folds with 2 development, 2 test and 21 training sections in each fold. We used a sliding window of two sections starting from P&K (dev: 0-1, test: 23-24, train: 2-22). All but one section (22) is used exactly once for testing. Whether future works should evaluate on these particular cross-validation splits or on randomized splits (Gorman and Bedrick, 2019) is an open issue; we provide an additional discussion in Appendix F.
Label sets. We recommend reporting results on both L1 and L2, using the standard 11-way classification for L2 in PDTB 2.0. A standardized label set does not exist yet for L2 in PDTB 3.0 (L1 remains unchanged). We propose using only the labels with > 100 instances, which leaves us with 14 senses from L2 (see Appendix A for counts). We suggest using all four possible label fields if the senses are multiply-annotated, as discussed in Section 2.   Table 4 lists the performance of single-span (either ARG1 or ARG2) baseline models for both PDTB 2.0 and 3.0. This baseline adapts the idea of hypothesis-only baselines in Natural Language Inference (Poliak et al., 2018), where we limit the training data by only showing the models one of the two spans that are in a discourse relation. We discuss these baselines further in Section 4.   over PDTB 2.0. For instance, the annotation manual (Prasad et al., 2019) remarks that LIST was removed since it was "not in practice distinguishable from CONJUNCTION". Indeed, models trained on PDTB 2.0 behaved exactly so, classifying most of LIST as CONJUNCTION (but not vice versa, likely due to frequency effect; see Appendix G). We conducted an additional experiment testing the impact of the new annotation scheme, in an attempt to address the question "If we want to detect relation X in a downstream task, which PDTB should we use to train our models?". We trained the same model (BERT-large) twice on the same set of datapoints, only varying the annotation scheme. Since PDTB 3.0 has both added and removed examples, we filtered the datasets so that the two PDTBs contained exactly the same span pairs. With the model and inputs fixed, the labeling scheme should be the only effective factor. After filtering, the majority-class baseline for both were less than 30%. Table 5 suggests that PDTB 3.0's annotation scheme does lead to improved distinguishability of CONJUNCTION. 7 PDTB 3.0 overall yielded better 7 We used pooled cross-validation accuracy (compared us-(or unchanged) distinguishability of shared labels except for CONTRAST. This trend was especially salient for CONCESSION that was practically unlearnable from PDTB 2.0. This supports the utility of PDTB 3.0 over 2.0 if downstream transfer is considered, motivating a transition to 3.0.

Single-span baselines
Unsurprisingly, the change in distinguishability was highly dependent on the change in label counts in the training data (Table 5, ∆). But change in frequency alone does not give us the full picture. For instance, SYNCHRONOUS remained difficult to learn even with a substantial increase in labeled examples. The absolute size of the class was also not deterministic of performance. There were 192 training instances of SYNCHRONOUS in the filtered PDTB 2.0 and 261 for PDTB 3.0. Similar/smaller classes such as |ALTERNATIVE| = 118 in PDTB 2.0 and |SUBSTITUTION| = 191 in PDTB 3.0 were still learnable with 26% and 48% accuracy, respectively. This was mostly due to SYNCHRONOUS being mislabeled as CONJUNCTION, which was also the case in the unfiltered dataset (see Appendix G).   (Table 3). This calls for a data augmentation that would balance subclass ratios and alleviate label sparsity at L3.
Within-document label distribution is informative, even for shallow discourse parsing. We have advocated for an evaluation scheme that preserves larger contexts. This is motivated by the fact that discourse relations are not independently distributed from one another (even when they are annotated in isolation, as in PDTB). For instance, implicit CONJUNCTION (IC) relations are likely to be adjacent; in PDTB 3.0, the probability of one IC following another is P (IC 2 |IC 1 ) = 0.14, when P (IC) = 0.08. Implicit REASON is likely to be adjacent to RESULT; P (IReason|IResult) = 0.12, P (IReason) = 0.05.
Vanilla pretrained encoders are strong, but are overreliant on lexical cues. A simple finetuning of pretrained encoders yielded impressive gains. At the same time, they overrelied on lexical cues. For instance, ARG2-initial to often signals PURPOSE; 79.9% of such cases are true PURPOSE relations. It is reasonable for our models to utilize this strong signal, but the association was much amplified in their prediction. For example, XLNetbase predicted PURPOSE for 95.8% of the examples with ARG2-initial to. We also found that model predictions were in general brittle; a simplistic lexical perturbation with no semantic effect, such as appending '-' to the beginning of spans, resulted in a 9%p drop in performance for BERT-large models.
Overall, there still remains much overhead for improvement, with our best model at 66% accuracy on PDTB 3.0 L2 classification. Combining pretrained encoders and expanded context modeling to better capture document-level distributional 1 Why can't I receive recovery email?
2 Some users started to experience the issue of not receiving any recovery email. 3 That typically happens when the account has been logged on using different devices within 24 hours. 4 You can call IT Desk during hours of operation, 5 you will be provided instructions on how to make a reset request. 6 You can submit a request for another user. 7 However, it is only allowed when their computer is broken or not functional. Aggregation of single-span baselines as decontextualized upper-bounds. Lexical cues continue to be informative even for implicit relations, as with the case of ARG2-initial to. Although these signals could be genuine rather than artifactual, they require comparatively less multi-span reasoning. Then, how much of our dataset only requires shallower reasoning as such? To address this question, we constructed a decontextualized baseline by aggregating predictions of single-span models, and assuming that an oracle always chooses the right answer if it is in the prediction set. This provides an upper-bound estimate of the performance of a model that only disjointly considers the two input spans, but still has full lexical access. Comparing the final rows of Table 4 and Table 2, we see that no model reliably outperforms its decontextualized upper-bound counterpart.

Conclusion
We have surveyed the literature to highlight experimental inconsistencies in implicit discourse relation classification, and suggested an improved protocol using section-level cross-validation. We provided a set of strong baselines for PDTB 2.0 and 3.0 following this protocol, as well as results on a range of existing setups to maintain comparability. We discussed several future directions, including data augmentation for downstream transferability, applicability of pretrained encoders to discourse, and utilizing larger discourse contexts.

B List of Splits in Prior Work
We compile a (non-exhaustive) list of the Wall Street Journal sections used as training, development, test sets in published work to demonstrate the high variability. We mostly list works that do not explicitly specify the source of the splits, with some exceptions. Some of the works have overlapping sections across splits, which we suspect to be typos but cannot verify.

C Training Details
For all sentence encoder models, we fine-tuned each encoder for a maximum of 10 epochs with early stopping when the the development set performance did not improve for 5 evaluation steps (step size=500), with a batch size of 8. We used a learning rate of 5e-6 for all models except for XLNet-large, for which we used 2e-6. We used accuracy as the validation metric. We ran each model 5 times with different random initializations of the fine-tuning layer, and reported the average performance across the 5 runs.  Table 9: Accuracy and F1 on L1 classification (4-way) for PDTB 2.0 and 3.0, using Ji split for both. We report average performance across 5 random restarts. Table 10 lists the performance of single-span (either ARG1 or ARG2) baselines for PDTB 2.0. Results on PDTB 3.0 are reported in Table 4. We additionally note that ARG2-only models consistently outperform ARG1-only models in both PDTB 2.0 and 3.0. For PDTB 3.0, the strong association between ARG2-initial to and CONTIN-GENCY.PURPOSE was largely responsible for this discrepancy (see Section 4 also).

F Cross-validation and Randomized validation
Gorman and Bedrick (2019) have proposed validation over randomized splits using significance testing with multiple-comparisons correction. An adaptation of this idea to our proposal of sectionbased evaluation would be randomized sampling of sections to create section-based splits. Given label sparsity and distributional skew across sections, cross-validation has an advantage of guaranteed coverage of label counts used for testing, although this may not be a large issue if sufficient number of random splits are sampled. Conversely, the main goal of evaluation on random splits-avoiding overfitting to the standard split-is partially mitigated by reporting the average performance over crossvalidation splits. Still, if a standard cross-validation split is adopted, overfitting may still arise over time.
Although we leave it to future work to decide which practice should be followed, we provide comparisons between the four models we tested, using our proposed cross-validation splits and random validation splits (both n = 12). Random splitting was done section-wise instead of instance-wise; we randomly split the dataset into 21 train, 2 dev, 2 test sections 12 times. Table 11 shows the model comparison results. Figure 2 shows the confusion matrices generated from PDTB 2.0 L2 classification results produced by XLNet-large and BERT-large models. Figure 3 shows the confusion matrices of PDTB 3.0 L2 classification predictions, again from XLNet-large and BERT-large models (we did not observe immediate qualitative differences between XLNet and BERT, or between large and base models). The figures aggregate the predictions from all test sets of the cross-validation experiment, so the datapoints shown span the full dataset except for WSJ section 22. The colors are normalized over each row; the darkest shade is the most frequently predicted label for the true label denoted by the row.

G Additional Error Analyses
It was generally the case for both models that classes sharing the same L1 senses (e.g., CONTIN-GENCY.CAUSE and CONTINGENCY.PRAGMATIC CAUSE, or COMPARISON.CONTRAST and COM-PARISON.CONCESSION) were confused. When such confusions occurred, the more frequent class often subsumed the prediction of the other class (e.g., CONTINGENCY.PRAGMATIC CAUSE was often classified as CONTINGENCY.CAUSE but not vice versa).
As noted in Section 4, TEMPO-RAL.SYNCHRONOUS (SYNCHRONY in PDTB   2.0) was frequently confused with EXPAN-SION.CONJUNCTION (but not vice versa). The models generally had a tendency to predict CONTINGENCY.CAUSE across the board, likely due to it being the most frequent label.