Coherence Modeling Improves Implicit Discourse Relation Recognition

The research described in this paper examines how to learn linguistic knowledge associated with discourse relations from unlabeled corpora. We introduce an unsupervised learning method on text coherence that could produce numerical representations that improve implicit discourse relation recognition in a semi-supervised manner. We also empirically examine two variants of coherence modeling: order-oriented and topic-oriented negative sampling, showing that, of the two, topic-oriented negative sampling tends to be more effective.


Introduction
Shallow discourse parsing aims to automatically identify discourse relations (e.g., comparisons) between adjacent sentences. When connectives such as however explicitly appear, discourse relations are relatively easy to classify, as connectives provide strong cues (Pitler et al., 2008). In contrast, it remains challenging to identify discourse relations across sentences that have no connectives.
One reason for this inferior performance is a shortage of labeled instances, despite the diversity of natural language discourses. Collecting annotations about implicit relations is highly expensive because it requires linguistic expertise. 1 A variety of semi-supervised or unsupervised methods have been explored to alleviate this issue. Marcu and Echihabi (2002) proposed generating synthetic instances by removing connectives from sentence pairs. This idea has been extended in many works 1 The Penn Discourse Treebank (PDTB) 2.0 corpus (Prasad et al., 2008), which is the current largest corpus for discourse relation recognition, contains only about 16K annotated instances in total. and remains a core approach in the field (Zhou et al., 2010;Patterson and Kehler, 2013;Lan et al., 2013;Rutherford and Xue, 2015;Ji et al., 2015;Braud and Denis, 2016;Lan et al., 2017;. However, these methods rely on automatically detecting connectives in unlabeled corpora beforehand, which makes it almost impossible to utilize parts of unlabeled corpora in which no connectives appear. 2 In addition, as Sporleder and Lascarides (2008) discovered, it is difficult to obtain a generalized model by training on synthetic data due to domain shifts. Though several semi-supervised methods do not depend on detecting connectives (Hernault et al., 2010(Hernault et al., , 2011Braud and Denis, 2015), these methods are restricted to manually selected features, linear models, or word-level knowledge transfer.
In this paper, our research question is how to exploit unlabeled corpora without explicitly detecting connectives to learn linguistic knowledge associated with implicit discourse relations.
Our core hypothesis is that unsupervised learning about text coherence could produce numerical representations related to discourse relations. Sentences that compose a coherent document should be connected with syntactic or semantic relations (Hobbs, 1985;Grosz et al., 1995). In particular, we expect that there should be latent relations among local sentences. In this study, we hypothesize that parameters learned through coherence modeling could contain useful information for identifying (implicit) discourse relations. To verify this hypothesis, we develop a semi-supervised system whose parameters are first optimized for coherence modeling and then transferred to implicit discourse relation recognition. We also empirically examine two variants of coherence mod- eling: (1) order-oriented negative sampling and (2) topic-oriented negative sampling. An example is shown in Figure 1.
Our experimental results demonstrate that coherence modeling improves Macro F 1 on implicit discourse relation recognition by about 3 points on first-level relation classes and by about 5 points on second-level relation types. Coherence modeling is particularly effective for relation categories with few labeled instances, such as temporal relations. In addition, we find that topic-oriented negative sampling tends to be more effective than the order-oriented counterpart, especially on firstlevel relation classes.

Coherence Modeling
In this study, we adopt the sliding-window approach of Li and Hovy (2014) to form a conditional probability that a document is coherent. That is, we define the probability that a given document X is coherent as a product of probabilities at all possible local windows, i.e., where P (coherent|x, θ) denotes the conditional probability that the local clique x is coherent and θ denotes parameters. Clique x is a tuple of a central sentence and its left and right sentences, (s − , s, s + ). Though larger window sizes may allow the model to learn linguistic properties and inter-sentence dependencies over broader contexts, it increases computational complexity during training and suffers from data sparsity problem.
We automatically build a dataset D = P ∪ N for coherence modeling from an unlabeled corpus. Here, P and N denote sets of positive and negative instances, respectively. Given a source corpus C of |C| sentences s 1 , s 2 , . . . , s |C| , we collect positive instances as follows: Text coherence can be corrupted by two aspects, which correspond to how to build negative set N .
The first variant is order-oriented negative sampling, i.e., The second variant is topic-oriented negative sampling, i.e., where s denotes a sentence randomly sampled from a uniform distribution over the entire corpus C. We call this method topic-oriented because topic consistency shared across a clique (s − , s, s + ) is expected to be corrupted by replacing s with s .  Table 1: The results of implicit discourse relation recognition (multi-class classification) and coherence modeling (binary classification). IRel and O/T-Coh denote that the model is trained on implicit discourse relation recognition and order/topic-oriented coherence modeling respectively. "Small" and "large" correspond to the relative size of the used unlabeled corpus: 37K (WSJ) and 22M (BLLIP) positive instances, respectively.

Model Architecture
(unsupervised learning) (supervised learning) discourse relation recognition coherence modeling Figure 2: The semi-supervised system we developed. The model consists of sentence encoder E, coherence classifier F c , and implicit discourse relation classifier F r . 2016)), such architectures are outside the scope of this study, since the effectiveness of incorporating coherence-based knowledge would be broadly orthogonal to the model's complexity.

Sentence Encoder
Sentence encoder E transforms a symbol sequence (i.e., a sentence) into a continuous vector. First, a bidirectional LSTM (BiLSTM) is applied to a given sentence of n tokens w 1 , . . . , w n , i.e., where FwdLSTM and BwdLSTM denote forward and backward LSTMs, respectively. We initialize the hidden states to zero vectors, i.e., − → h 0 = ← − h n+1 = 0. In our preliminary experiments, we tested conventional pooling functions (e.g., summation, average, or maximum pooling); we found that the following concatenation tends to yield higher performances: We use Eq. 7 as the aggregation function throughout our experiments.

Classifiers
We develop two multi-layer perceptrons (MLPs) with ReLU nonlinearities followed by softmax normalization each for F c and F r . The MLP inputs are the concatenation of sentence vectors. Thus, the dimensionalities of the input layers are 2D×3 and 2D×2 respectively. The MLPs consist of input, hidden, and output layers.

Preparation
We used the Penn Discourse Treebank (PDTB) 2.0 corpus (Prasad et al., 2008) as a dataset for implicit discourse relation recognition. We followed the standard section partition, which is to use Sections 2-20 for training, Sections 0-1 for development, and Sections 21-22 for testing. We evaluate multiclass classifications with first-level relation classes (four classes) and second-level relation types (11 classes). We used the Wall Street Journal (WSJ) articles (Marcus et al., 1993) 3 or the BLLIP North American News Text (Complete) (Mc-Closky et al., 2008) 4 to build a coherence modeling dataset, resulting in about 48K (WSJ) or 23M (BLLIP) positive instances. We inserted a special symbol " ARTICLE BOUNDARY " to each Acc. (%) Macro F 1 (%) Rutherford and Xue (2015) 57.10 40.50  57.27 44.98 Braud and Denis (2016)    article boundary. For the WSJ corpus, we split the sections into training/development/test sets in the same way with the implicit relation recognition. For the BLLIP corpus, we randomly sampled 10,000 articles each for the development and test sets. Negative instances are generated following the procedure described in Section 2. Note that this procedure requires neither human annotation nor special connective detection. We set the dimensionalities of the word embeddings, hidden states of the BiLSTM, and hidden layers of the MLPs to 100, 200, and 100, respectively. GloVe (Pennington et al., 2014) was used to produce pre-trained word embeddings on the BLLIP corpus. To avoid overfitting, we fixed the word embeddings during training in both coherence modeling and implicit relation recognition. Dropout (ratio 0.2) was applied to word embeddings and MLPs's layers. At every iteration during training in both tasks, we configured classbalanced batches by resampling.

Results
To verify whether unsupervised learning on coherence modeling could improve implicit discourse relation recognition, we compared the semi-supervised model (i.e., implicit discourse relation recognition (IRel) + coherence modeling with order/topic-oriented negative sampling (O/T-Coh)) with the baseline model (i.e., IRel only). The evaluation metrics are accuracy (%) and Macro F 1 (%). We report the mean scores over 10 trials. Table 1 shows that coherence modeling improves Macro F 1 by about 3 points in first-level relation classes and by about 5 points in second-level relation types. Coherence modeling also outperforms the baseline in accuracy. We observed that the higher the coherence modeling performance (see Small vs. Large), the higher the implicit relation recognition score. These results support our claim that coherence modeling could learn linguistic knowledge that is useful for identifying discourse relations.
We also found that topic-oriented negative sampling tends to outperform its order-oriented counterpart, especially on first-level relation classes. We suspect that this is because order-oriented coherence modeling is more fine-grained and challenging than topic-oriented identification, resulting in poor generalization. For example, there could be order-invariant cliques that still hold coherence relations after random shuffling, whereas topic-invariant cliques hardly exist. Indeed, training on order-oriented negative sampling converged to lower scores than that of topic-oriented negative sampling (see coherence accuracy).
Next, for reference, we compared our system with previous work that exploits unlabeled cor-pora. As shown in Table 2, we found our model to outperform previous systems in Macro F 1 . In this task, Macro F 1 is more important than accuracy because the class balance in the test set is highly skewed. Note that these previous models rely on previously detected connectives in the unlabeled corpus, whereas our system is free from such detection procedures.
To assess the effectiveness of coherence modeling on different relation classes, we trained and evaluated the models on one-vs-others binary classification. That is, we treated each of the first-level relation classes (4 classes) as the positive class and others as the negative class. Table 3 shows that coherence modeling is effective, especially for the Temporal relation which has relatively fewer labeled instances than others, indicating that coherence modeling could compensate for the shortage of labeled data.
We also performed an ablation study to discover the performance contribution from coherence modeling by changing the number of training instances used in implicit relation recognition. Here, we assume that in real-world situations, we do not have sufficient labeled data. We downsampled from the original training set and maintained the balance of classes as much as possible. As shown in Figure 3, coherence modeling robustly yields improvements, even if we reduced the labeled instances to 10%.

Conclusion
In this paper, we showed that unsupervised learning on coherence modeling improves implicit discourse relation recognition in a semi-supervised manner. Our approach does not require detecting explicit connectives, which makes it possible to exploit entire unlabeled corpora. We empirically examined two variants of coherence modeling and show that topic-oriented negative sampling tends to be more effective than the order-oriented counterpart on first-level relation classes.
It still remains unclear whether the coherencebased knowledge is complemental to those by previous work. It is also interesting to qualitatively inspect the differences of learned properties between order-oriented and topic-oriented negative sampling. We will examine this line of research in future.