Contextualized Embeddings for Connective Disambiguation in Shallow Discourse Parsing

This paper studies a novel model that simplifies the disambiguation of connectives for explicit discourse relations. We use a neural approach that integrates contextualized word embeddings and predicts whether a connective candidate is part of a discourse relation or not. We study the influence of those context-specific embeddings. Further, we show the benefit of training the tasks of connective disambiguation and sense classification together at the same time. The success of our approach is supported by state-of-the-art results.


Introduction
Coherence is crucial for humans to be able to interpret text. The area of discourse parsing models this by identifying certain phrases (arguments) within a text and using discourse relations to unfold their underlying connections. These discourse relations and their understanding are important for tasks such as machine translation (Sim Smith, 2017), abstractive summarization (Wu and Hu, 2018), and text simplification (Zhong et al., 2020). A subset of these relations is signaled by specific words, socalled discourse connectives (or discourse markers or cues), and thus referred to as explicit discourse relations. However, such cues can be ambiguous, as they may signal more than one relation type or may not always function as a relation indicator. Two challenges arise 1 -first, distinguishing connectives from words with mere sentential meaning: 1. Mr. Perkins believes, however, that the market could be stabilized.
2. "The 1987 crash was a false alarm however you view it," says university of Chicago economist.
Here, example 1 shows a discourse relation, while example 2 uses 'however' in its sentential reading. The second challenge consists in classifying a connective's sense (described in detail in Section 2): 3. She owns a bike, while her brother drives a car. (Comparison.Contrast) 4. You should take the deal or even try to negotiate this price down. (Expansion.Alternative) 5. If things work out, then everybody will be happy. (Contingency.Condition) 6. While it is raining outside, I clean the dishes.

(Temporal.Synchronous)
Shallow discourse parsing (SDP) is the area that builds models to uncover such discourse structures within texts. SDP consists of the main tasks of identifying connectives, demarcating their arguments, assigning senses to them, and finding the senses of so-called implicit relations (which hold between adjacent text spans without a lexical signal being present). In this work, we focus in particular on the binary connective disambiguation of explicit discourse relations and, further, integrate explicit sense prediction into our model, as those two tasks are highly related.
Word embeddings provide dense token representations in a low-dimensional vector space pretrained on large unannotated text corpora. First, we use fast-Text (Bojanowski et al., 2017), which is based on the skip-gram model (Mikolov et al., 2013) and integrates character -grams into its representation. Second, we use GloVe (Pennington et al., 2014); as opposed to fastText, those embeddings were calculated through co-occurrence statistics rather than trained by a neural network. Recently, models were introduced that provide contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019) on demand and thus tackle the problem of identical representations for homonymous words with different senses, which had been indistinguishable in older models. For our experiments, we use BERT (Devlin et al., 2019), which was successful in many areas of NLP (Liu and Lapata, 2019;. In this work, we present a novel approach to identifying explicit relations in shallow discourse parsing. We introduce a simple yet powerful model that outperforms previous research on the binary disambiguation of connective candidates. Furthermore, we adopt connective sense classification as an auxiliary task to improve performance and generalization capabilities and study the benefits of jointly training the auxiliary task in addition to the main task. This is because, in various cases, training neural models on multiple related tasks has shown beneficial for the learned representation (Caruana, 1993), as it introduces inductive bias and, thereby, reduces the possible hypothesis space (Baxter, 2000). Specifically, the work of Collobert et al. (2011) has pointed out the advantages of multitask learning on NLP tasks. We compare our results with state-of-the-art SDP components that took part at the CoNLL Shared Task in 2016. 2 The contributions of this paper are as follows: 1. We design a simple neural architecture that eliminates the need for hand-engineered features. To the best of our knowledge, this work is the first to provide state-of-the-art performance on word-embedding-based connective disambiguation.
2. We present a novel approach that successfully combines the two tasks of connective disambiguation and explicit sense classification into one single model. In contrast to previous work, we introduce a more sensitive measure and, with its help, demonstrate improved stability of the jointly trained model.
In the following, Section 2 describes the corpus for the experiments; Section 3 explains our method. The experiments and results are presented in Section 4 and Section 5; Section 6 discusses relevant related work, followed by conclusions in Section 7.

Penn Discourse Treebank
Shallow discourse parsing is a challenging task that was promoted by the development of the second 2 We are not aware of more recent results.  version of the Penn Discourse Treebank (PDTB2) (Prasad et al., 2008). This corpus provides about 43,000 annotated discourse relations, of which roughly 18,000 are signalled by explicit discourse connectives. Those relations are further annotated with a three-level sense hierarchy (one or two senses per relation). All discourse relations consist of two arguments and are associated with one of various types; the focus of our work is on relations of the explicit type. The Shared Tasks at CoNLL 2015 and 2016 (Xue et al., 2015(Xue et al., , 2016 used PDTB2 with minor changes. Successful systems were ; Wang and Lan (2016); Oepen et al. (2016). They largely follow a pipeline architecture (Lin et al., 2014), which consists of successive tasks of connective identification, argument labeling, and sense classification for both explicit and implicit relations.
Recently, PDTB3 (Prasad et al., 2018) was published, which extends the previous work with more available relations and corrects several former annotations. The authors also adjusted the relations' sense labels for a more balanced class distribution. For the sake of comparison with previous work on SDP, we stick to the PDTB2 corpus and assume to achieve similar results with PDTB3. Table 1 summarizes the distribution regarding sense classes, where we denote candidate words with sentential reading by NoConn. In both settings, NoConn dominates other classes and, thus, serves as the majority baseline. The first setting shows the four coarse sense classes provided by PDTB2. The second setting describes the fine senses as defined in the Shared Task. In contrast to the first setting, the distribution slightly changes, as rare training samples were removed or combined with other classes. Although the exact numbers for NoConn are the same in both settings, the ratios are different, which can be explained by the small modifications made to PDTB2 in the competitions.

Method
This work introduces a first, simple neural architecture for shallow discourse connective disambiguation. The system builds upon previous observations that a word's context could be used as a strong indicator for the presence of a discourse relation (Lin et al., 2014). 3 Our work investigates the limitations of knowledge free approaches and introduces a simple yet flexible model without domain knowledge.
We assume that word embeddings contain information about the discourse that can be used for the disambiguation task. We study standard noncontextualized embeddings (in particular, GloVe embeddings and Wikipedia-based fastText embeddings) and compare those to the recently developed contextualized embeddings (represented by BERT). We first hypothesize that contextualized embeddings yield better results than their noncontextualized counterpart. Second, we expect the context span to influence the model's performance, as the context may indicate a word's function more clearly.
In addition, we propose a second model based on the first one, which successfully combines connective disambiguation with sense classification as an auxiliary task. We follow the idea of previous work that sense classification can be performed without extracting the connectives' arguments (Pitler and Nenkova, 2009;Lin et al., 2014;. Further, it has been previously shown that, for the identification of an explicit relation's sense, the connective itself as well as its context already provide significant information (Pitler and Nenkova, 2009;Lin et al., 2014;Wang and Lan, 2015;Ghosh et al., 2011). Consequently, we assume the necessary information for sense classification to be already accessible by our neural connective disambiguation model to some degree. Also, this approach elim-inates the error propagation and the performance of our joint model stays as is without relying on previous predictions. The reason for adding sense classification as an auxiliary task in the first place is that joint training with auxiliary tasks has shown benefits in earlier work, as mentioned in Section 1. We could validate that this is the case with our connective disambiguation task as well, as later demonstrated in our experiments (see Section 4.2).
In the following sections, we explain in more detail our binary connective disambiguation approach and the joint sense classification model.

Embedding-Based Connective Disambiguation
For parsing explicit discourse relations, the first task usually involves the identification of possible connective candidates. For this purpose, we use a list of candidate patterns based on PDTB2. Some candidates might look like discourse connectives, however, they might only be in sentential use.
Connective annotation in PDTB2 is quite flexible. Connectives can be individual words ('indeed'), multiple consecutive words ('in the end'), or distant words that function together ('neither . . . nor'). In addition, they can contain adverbial modifications ('at least when,' 'even when,' 'usually when'), which vastly increases the number of possible connectives. Regarding this problem, the CoNLL Shared Task introduced a mapping that normalizes instances of connectives to their head by removing adverbial modifiers. For example, the three full connectives above all normalize to their head 'when.' For our studies, we follow this approach and focus on the disambiguation of connective head candidates rather than fully annotated connectives as in the original corpus.
We introduce a simple neural architecture (see Figure 1) that relies on pretrained word embeddings instead of hand-engineered features. The network consists of a multilayer perceptron with a single hidden layer. As the network's input, we use the candidate word's embeddings and its context.
A continuous token sequence of length is encoded as an embedding sequence ( 1 , 2 , … , ). We define our input with regards to the candidate's positions within the sentence (denoted as ) and use cmin and cmax for the first and last occurrence of the candidate, respectively. Finally, with a con- Figure 1: Model overview. The average of the connective embeddings and their context serve as input, a single hidden layer is used for transformation, and the final layer outputs either the connective probability or sense classes.
text size of , our input looks as follows: Because the candidate might consist of multiple words ('in particular'), we simply average all candidate embeddings and concatenate remaining embeddings to build the network's input : Thus, independent of the number of words describing a connective candidate, the input always has the same dimension. We do not average the embeddings of the context because this would lead to unwanted information loss.
For the transformation of candidates and their context into word embeddings, we use the tokenization provided by the CoNLL Shared Tasks. No other annotations such as POS, constituent trees, and dependencies are used for our experiments. GloVe and fastText are used for noncontextualized embeddings and BERT for contextualized embeddings. For contextualized embeddings, we noticed a difference in the tokenization of contractions. Therefore, we simply replaced occurrences of the token 'n't' by 'not' without changing the overall meaning.
The usage of embeddings is straightforward. Each token in a document is mapped to its embedding representation. In contrast, the contextualized embeddings are generated sentence-wise before extracting context and candidate embeddings. The input for BERT is prepared with special tags for sentence beginnings and ends. Further, original tokens might be split into smaller tokens based on the WordPiece tokenizer (Wu et al., 2016) before feeding them into BERT's encoder. This possibly leads to a higher number of BERT subtoken embeddings than tokens defined on the original corpus. Based on the alignment of original tokens and BERT subtokens, only the first BERT subtoken embedding of a corresponding token is used as its embedding. This selection follows the original BERT publication (Devlin et al., 2019), where features were extracted and the finally predicted classes only rely on the first subtoken's position.

Joint Disambiguation and Sense Classification
For the reasons explained in the beginning of Section 3, we combine binary connective disambiguation and sense classification into a single, second model. Thus, the model jointly learns whether a connective candidate serves as a discourse signal and, if so, determines its sense. We use the same model as in our previous experiment (see Figure 1) but introduce a novel prediction scheme for the joint classification. As both tasks have exclusive classes, our model either predicts whether a candidate is without sense, which is equivalent to having sentential reading, or predicts one of the desired sense classes.
Combining multiple tasks into a single model is called multitask learning (see Section 6).

Evaluation
For our experiments, we use PDTB2 (Prasad et al., 2008), especially the version provided for the CoNLL Shared Task. We distinguish between coarse senses, which come from the original PDTB, and fine senses as defined by the Shared Task. Also, an official split is provided that makes comparisons to other systems more reliable. In particular, this means that we used folders 02-22 for training, folders 00 and 01 as a development set, and folders 23 and 24 for testing. We downloaded word embeddings for GloVe 4 and fastText 5 from their corresponding websites. For the contextualized embeddings, we extracted token embeddings that we had previously transformed using BERT. 6 As we work with highly imbalanced data, we present our results using precision, recall, and F1 score. Typically, there is a natural inverse rela-  Table 2: Experimental results for various embedding types (GloVe, fastText, BERT) and context sizes (ctx). Evaluation involves Section 23 of WSJ and the blind data set proposed for CoNLL Shared Task. All tasks are measured using F1 scores. Average precision is calculated for connective disambiguation and shown in parentheses. Results are ordered by primary task and separated with regards to the groups highlighted in Figure 2.

Model Conn Disambiguation Coarse Sense Classification Fine Sense Classification
tion between precision and recall-as one increases, the other decreases. Depending on the final usage, either one could be optimized. While in previous work, scores were usually reported for one specific threshold only, we decided to use precision-recall curves for our experimental results. These give a better understanding of the models' sensitivity to the selected threshold in our binary disambiguation task. To approximate the area under the precisionrecall curve, we compute the average precision (AP) score. While F1 score indicates performance for a single threshold only, the AP score helps to compare the precision-recall curves of various models.
In our experiments, we study different embedding types (GloVe, fastText, BERT) with a varying context size (ctx ∈ {0, 1, 2}). The dimension (emb) of the noncontextualized word embeddings is 300 and 768 for contextualized embeddings, which results in an input size of (2 * ctx + 1) * emb. The size of the hidden layer is 2048. All models were trained for at most 50 epochs using early stopping (Prechelt, 1998) when validation loss did not improve over 10 epochs, a batch size of 128, and the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001. For comparison, we also provide a baseline from a reimplementation of Lin et al. (2014). Table 2 reports the performances of our models per experimental setting for each data partition (test and blind) and highlights best performances. In the remainder of this section, we discuss the experimental results obtained for the models presented in Sections 3.1 and 3.2.  conclude that single noncontextualized word embeddings do not contain enough discourse information but that a model can compensate the missing information with the connective's context. Finally, contextualized embeddings seem to already contain this discourse information, as varying context sizes did not lead to clearly different results. Also, these embeddings may have outperformed noncontextualized embeddings because their features are already based on full sentences. As shown in Figure 2, the baseline exhibited a high level of performance, between that of noncontextualized embeddings with context and contextualized embeddings.

Embedding-Based Connective Disambiguation
Comparing the results on the test (Figures 2a) and blind (Figures 2d) data sets, we notice the usual drop in performance, as both data sets differ in their distribution. The test set comes from news articles, while the blind set is based on Wikipedia. With other feature-based models submitted to the Shared Tasks, we expect this performance drop to be higher, so that our model would generalize better.
We carried out a further analysis on the test data in order to characterize weaknesses of using word embeddings for connective disambiguation. Therefore, we examined our contextualized embedding model without context (bert-ctx-0), as it yielded high performance despite its low complexity. For most of the rare classification mistakes made by our model, we found that there existed similar embeddings to those that were misclassified, which naturally made them hard to distinguish for our model.

Joint Disambiguation and Sense Classification
For our second experimental setting, we study the influence of jointly training connective disambiguation and sense classification (coarse and fine senses shown in Figures 2b and 2c, respectively). As our hypothesis, we assumed generalization to improve with increasing task complexity. For the commonly evaluated F1 score, we do not see a vast improvement between connective disambiguation and the joint training approach. In addition to the previous metric, we use the average precision score that better summarizes the overall ratio of precision and recall for a single model. With respect to this metric, we notice higher values for both kinds of sense classification in contrast to connective disambiguation. This confirms that, although the single F1 score might not change that much, more complex tasks indeed improve model generalization and result in more stable models. Further, it appears that training our model on fine senses is somewhat less effective for the main disambiguation task, as training on coarse senses often slightly outperforms it. Finally, we studied the predictions of our contextualized embedding model (see Figure 3) as before. Here, we compare both sense levels and notice a change of performance in Contingency and Expansion. While the coarse model works better on the second class than the first one, this turns around for the fine-sense model. Especially for the fine-sense model, we observe an overall drop of performance, which could be related to the smaller number of samples per class.

Discussion
In our final comparison, Table 3 shows our bestperforming models for each category (standard vs. contextualized embeddings and connective disambiguation vs. joint training for sense classification). For comparison, we included test results from successful submissions to the CoNLL 2016 Shared Task (Xue et al., 2016)-in particular, results that were achieved for connective disambiguation in the first part of the Shared Task and results for explicit sense classification taken from the second part. As Table 3 shows, when using contextualized embeddings, our model outperformed the other systems with F1 scores of up to 97.32. The authors of the models (Stepanov and Soochow) unfortunately submitted results only for the first task, and thus, we cannot compare their performance on sense classification to our model's performance. The numbers for sense classification of our proposed contextualized embedding approach are slightly below those of the compared systems. But it is important to note that the other systems are prone to error propagationerrors made early throughout the pipeline negatively affect all subsequent steps. However, in the competitions, error propagation was eliminated by providing preprocessed data to the competing systems. This can be considered an unfair advantage over our system, which performed all tasks simultaneously and thus had to operate on raw data.

Related Work
In this section, we discuss work relevant to the area of discourse parsing, in particular, connective disambiguation and sense classification. Finally, recent work on word embeddings and multitask learning with regard to discourse parsing is outlined.
For connective disambiguation, Pitler and Nenkova (2009) defined a set of syntactic features extracted from constituency trees. Beside the connective's surface and category information from related tree nodes (parent, siblings), they also used   (Stepanov and Riccardi, 2016a), Soochow (Fan et al., 2016). The ending -mtl refers to the results for fine-sense classification in Table 2.
binary features that check whether categories are contained by the nodes' traces and pairwise interaction features. In addition to these features, Lin et al. (2014) propose a set of lexicosyntactic features, as they observe that a connective's immediate context and part of speech is already a strong indicator for disambiguation. The authors further extend those features by category paths from the connective to the root. Wang and Lan (2015) further extend the previous two works and add similar features for more syntactic context information of the connective. Oepen et al. (2016) combine previous feature sets with work on identifying expressions of speculation and negation (Velldal et al., 2012). Recent work of Webber et al. (2019) highlights the complexity of several kinds of ambiguity when working with discourse connectives.
The connective and its explicit sense have a strong correlation as shown by Pitler and Nenkova (2009), who report accuracy higher than the interannotator agreement for their connective disambiguation features on coarse-grained level senses. Lin et al. (2014) use only context features and evaluate their work on second-level senses. Wang and Lan (2015) extend previous features and develop a model for the CoNNL Shared Task. Oepen et al. (2016) use an ensemble of three types of classifiers that are mainly based on previous features (Wang and Lan, 2015). Stepanov and Riccardi (2016b) use chained information extracted from syntactical trees and chunk tags.  use convolutional neural networks on word-level embedded sentence pairs but a linear model with additional dependency features for sense classification. Braud and Denis (2015) have shown that word embeddings outperform sparse features for implicit sense classification. They compare word pair features with Brown clusters and low-dimensional word embeddings. Bai and Zhao (2018) use different levels of input representations, ranging from character level to contextualized word embeddings. Kishimoto et al. (2020) adapt BERT to perform implicit discourse sense classification. They show promising results by adding tasks, such as connective prediction, for pretraining.
Multitask learning is also successfully applied to implicit sense classification . The authors combine four different tasks related to discourse parsing, but in contrast to our work, they rely on previously extracted argument spans. Qin et al. (2017) propose a model that, in addition to their main task (implicit sense classification), also learns to predict a possible connective that could be inserted. Lan et al. (2017) introduce various models that perform multitask learning, and their focus also lies on implicit sense classification.

Conclusions
In this work, we studied the value of discourse information in different kinds of word embeddings. We first presented a novel feature-free approach to connective disambiguation that achieves state-ofthe-art results on this task. Then, this approach was extended by explicit sense classification to study the influence of jointly training both tasks. While our second approach does not directly outperform previous approaches on explicit sense classification, our model can be directly applied to raw input without being subject to error propagation, which is an advantage of our approach.
As our work indicates that combining multiple subtasks avoid error propagation issues, a future direction could be to investigate what other kinds of subtasks could be combined in order to benefit from this. Also, word embeddings have shown to be very flexible, and they are useful even for outof-domain data. It is worth investigating whether they are suitable for language transfer. This is particularly interesting because data sets of a similar quality to PDTB do not exist for many languages other than English.