Discourse-Aware Neural Extractive Text Summarization

Recently BERT has been adopted for document encoding in state-of-the-art text summarization models. However, sentence-based extractive models often result in redundant or uninformative phrases in the extracted summaries. Also, long-range dependencies throughout a document are not well captured by BERT, which is pre-trained on sentence pairs instead of documents. To address these issues, we present a discourse-aware neural summarization model - DiscoBert. DiscoBert extracts sub-sentential discourse units (instead of sentences) as candidates for extractive selection on a finer granularity. To capture the long-range dependencies among discourse units, structural discourse graphs are constructed based on RST trees and coreference mentions, encoded with Graph Convolutional Networks. Experiments show that the proposed model outperforms state-of-the-art methods by a significant margin on popular summarization benchmarks compared to other BERT-base models.


Introduction
Neural networks have achieved great success in the task of text summarization (Nenkova et al., 2011;Yao et al., 2017). There are two main lines of research: abstractive and extractive. While the abstractive paradigm (Rush et al., 2015;See et al., 2017;Celikyilmaz et al., 2018;Sharma et al., 2019) focuses on generating a summary word-by-word after encoding the full document, the extractive approach (Cheng and Lapata, 2016;Narayan et al., 2018) directly selects sentences from the document to assemble into a summary. The abstractive approach is more flexible  and generally produces less redundant summaries, while the extractive approach enjoys better factuality and efficiency (Cao et al., 2018). Recently, some hybrid methods have been proposed to take advantage of both, by designing a two-stage pipeline to first select and then rewrite (or compress) candidate sentences (Chen and Bansal, 2018;Gehrmann et al., 2018;Zhang et al., 2018;Xu and Durrett, 2019). Compression or rewriting aims to discard uninformative phrases in the selected sentences. However, most of these hybrid systems suffer from the inevitable disconnection between the two stages in the pipeline.
Meanwhile, modeling long-range context for document summarization remains a challenge (Xu et al., 2016). Pre-trained language models (Devlin et al., 2019) are designed mostly for sentences or a short paragraph, thus poor at capturing longrange dependencies throughout a document. Empirical observations (Liu and Lapata, 2019) show that adding standard encoders such as LSTM or Transformer (Vaswani et al., 2017) on top of BERT to model inter-sentential relations does not bring in much performance gain.
In this paper, we present DISCOBERT, a discourse-aware neural extractive summarization model built upon BERT. To perform compression with extraction simultaneously and reduce redundancy across sentences, we take Elementary Discourse Unit (EDU), a sub-sentence phrase unit originating from RST (Mann and Thompson, 1988;Carlson et al., 2001) 2 as the minimal selection unit (instead of sentence) for extractive summarization. Figure 1 shows an example of discourse segmentation, with sentences broken down into EDUs (annotated with brackets). By operating on the discourse unit level, our model can discard redundant details in sub-sentences, therefore retaining additional capacity to include more concepts or events, leading to more concise and informative summaries.
Furthermore, we finetune the representations of discourse units with the injection of prior knowledge to leverage intra-sentence discourse relations. More specifically, two discourse-oriented graphs are proposed: RST Graph G R and Coreference Graph G C . Over these discourse graphs, Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is imposed to capture long-range interactions among EDUs. RST Graph is constructed from RST parse trees over EDUs of the document. On the other hand, Coreference Graph connects entities and their coreference clusters/mentions across the document. The path of coreference navigates the model from the core event to other occurrences of that event, and in parallel explores its interactions with other concepts or events.
The main contribution is threefold: (i) We propose a discourse-aware extractive summarization model, DISCOBERT, which operates on a subsentential discourse unit level to generate concise and informative summary with low redundancy. (ii) We propose to structurally model 2 We adopt RST as the discourse framework due to the availability of existing tools, the nature of the RST tree structure for compression, and the observations from Louis et al. (2010). Other alternatives includes Graph Bank (Wolf and Gibson, 2005) and PDTB (Miltsakaki et al., 2004).   inter-sentential context with two types of discourse graph. (iii) DISCOBERT achieves new state of the art on two popular newswire text summarization datasets, outperforming other BERT-base models.

Discourse Graph Construction
In this section, we first introduce the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), a linguistic theory for discourse analysis, and then explain how we construct discourse graphs used in DISCOBERT. Two types of discourse graph are considered: RST Graph and Coreference Graph. All edges are initialized as disconnected, and connections are later added for a subset of nodes based on RST discourse parse tree or coreference mentions.

Discourse Analysis
Discourse analysis focuses on inter-sentential relations in a document or conversation. In the RST framework, the discourse structure of text can be represented in a tree format. The whole document can be segmented into contiguous, adjacent and non-overlapping text spans called Elementary Discourse Units (EDUs). Each EDU is tagged as either Nucleus or Satellite, which characterizes its nuclearity or saliency. Nucleus nodes are generally more central, and Satellite nodes are more peripheral and less important in terms of content and grammatical reliance. There are dependencies among EDUs that represent their rhetorical relations.
In this work, we treat EDU as the minimal unit for content selection in text summarization. Fig-ure 2 shows an example of discourse segmentation and the parse tree of a sentence. Among these EDUs, rhetorical relations represent the functions of different discourse units. As observed in Louis et al. (2010), the RST tree structure already serves as a strong indicator for content selection. On the other hand, the agreement between rhetorical relations tends to be lower and more ambiguous. Thus, we do not encode rhetorical relations explicitly in our model.
In content selection for text summarization, we expect the model to select the most concise and pivotal concept in the document, with low redundancy. 3 However, in traditional extractive summarization methods, the model is required to select a whole sentence, even though some parts of the sentence are not necessary. Our proposed approach can select one or several fine-grained EDUs to render the generated summaries less redundant. This serves as the foundation of our DISCOBERT model.

RST Graph
When selecting sentences as candidates for extractive summarization, we assume each sentence is grammatically self-contained. But for EDUs, some restrictions need to be considered to ensure grammaticality. For example, Figure 2 illustrates an RST discourse parse tree of a sentence, where "[2] This iconic ... series" is a grammatical sentence but "[3] and shows ... 8" is not. We need to understand the dependencies between EDUs to ensure the grammaticality of the selected combinations. The detail of the derivation of the dependencies could be found in Sec 4.3.
The construction of the RST Graph aims to provide not only local paragraph-level but also longrange document-level connections among EDUs. We use the converted dependency version of the tree to build the RST Graph G R , by initializing an empty graph and treating every discourse dependency from the i-th EDU to the j-th EDU as a directed edge, i.e., G R [i][j] = 1.

Coreference Graph
Text summarization, especially news summarization, usually suffers from the well-known 'position bias' issue (Kedzie et al., 2018), where most of the key information is described at the very beginning Algorithm 1 Construction of the Coreference Graph GC .
of the document. However, there is still a decent amount of information spread in the middle or at the end of the document, which is often ignored by summarization models. We observe that around 25% of oracle sentences appear after the first 10 sentences in the CNNDM dataset. Besides, in long news articles, there are often multiple core characters and events throughout the whole document. However, existing neural models are poor at modeling such long-range context, especially when there are multiple ambiguous coreferences to resolve.
To encourage and guide the model to capture the long-range context in the document, we propose a Coreference Graph built upon discourse units. Algorithm 1 describes how to construct the Coreference Graph. We first use Stanford CoreNLP (Manning et al., 2014) to detect all the coreference clusters in an article. For each coreference cluster, all the discourse units containing the mention of the same cluster will be connected. This process is iterated over all the coreference mention clusters to create the final Coreference Graph. Figure 1 provides an example, where 'Pulitzer prizes' is an important entity and has occurred multiple times in multiple discourse units. The constructed Coreference Graph is shown on the right side of the document 4 . When graph G C is constructed, edges among 1-1, 2-1, 20-1 and 22-1 are all connected due to the mentions of 'Pulitzer prizes'. And today, the Pulitzer Prize for … South Carolina, which has a tiny … 85,000. whole document on the token level. Then, a selfattentive span extractor is used to obtain the EDU representations from the corresponding text spans. The Graph Encoder takes the output of the Document Encoder as input and updates the EDU representations with Graph Convolutional Network based on the constructed discourse graphs, which are then used to predict the oracle labels. Assume that document D is segmented into n EDUs in total, i.e., D = {d 1 , d 2 , · · · , d n }, where d i denotes the i-th EDU. Following Liu and Lapata (2019), we formulate extractive summarization as a sequential labeling task, where each EDU d i is scored by neural networks, and decisions are made based on the scores of all EDUs. The oracle labels are a sequence of binary labels, where 1 stands for being selected and 0 for not. We denote the labels as Y = {y * 1 , y * 2 , · · · , y * n }. During training, we aim to predict the sequence of labels Y given the document D. During inference, we need to further consider discourse dependency to ensure the coherence and grammaticality of the output summary.

Document Encoder
BERT is a pre-trained deep bidirectional Transformer encoder (Vaswani et al., 2017;Devlin et al., 2019). Following Liu and Lapata (2019), we encode the whole document with BERT and finetune the BERT model for summarization.
BERT is originally trained to encode a single sentence or sentence pair. However, a news article typically contains more than 500 words, hence we need to make some adaptation to apply BERT for document encoding. Specifically, we insert CLS and SEP tokens at the beginning and the end of each sentence, respectively. 5 In order to encode long documents such as news articles, we also extend the maximum sequence length that BERT can take from 512 to 768 in all our experiments.
The input document after tokenization is denoted as D = {d 1 , · · · , d n }, and d i = {w i1 , · · · , w i i }, where i is the number of BPE tokens in the i-th EDU. If d i is the first EDU in a sentence, there is also a CLS token prepended to d i ; if d j is the last EDU in a sentence, there is a SEP token appended to d j (see Figure 3). The schema of insertion of CLS and SEP is an approach used in Liu and Lapata (2019). For simplicity, these two tokens are not shown in the equations. BERT model is then used to encode the document: where {h B 11 , · · · , h B n n } is the BERT output of the whole document in the same length as the input.
After the BERT encoder, the representation of the CLS token can be used as sentence representation. However, this approach does not work in our setting, since we need to extract the representation for EDUs instead. Therefore, we adopt a Self-Attentive Span Extractor (SpanExt), proposed in Lee et al. (2017), to learn EDU representation.
For the i-th EDU with i words, with the output from the BERT encoder {h B i1 , h B i2 , · · · , h B i i }, we obtain EDU representation as follows: where α ij is the score of the j-th word in the EDU, a ij is the normalized attention of the j-th word w.r.t. all the words in the span. h S i is a weighted sum of the BERT output hidden states. Throughout the paper, all the W matrices and b vectors are parameters to learn. We abstract the above Self-Attentive Span Extractor as h After the span extraction step, the whole document is represented as a sequence of EDU representations: h S = {h S 1 , · · · , h S n } ∈ R d h ×n , which will be sent to the graph encoder.

Graph Encoder
Given the constructed graph G = (V, E), nodes V correspond to the EDUs in a document, and edges E correspond to either RST discourse relations or coreference mentions. We then use Graph Convolutional Network to update the representations of all the EDUs, to capture long-range dependencies missed by BERT for better summarization. To modularize architecture design, we present a single Discourse Graph Encoder (DGE) layer. Multiple DGE layers are stacked in our experiments.
Assume that the input for the k-th DGE layer is denoted as h n } ∈ R d h ×n , and the corresponding output is denoted as h The k-th DGE layer is designed as follows: where LN(·) represents Layer Normalization, N i denotes the neighorhood of the i-th EDU node. h (k+1) i is the output of the i-th EDU in the k-th DGE layer, and h (1) = h S , which is the output from the Document Encoder. After K layers of

Training & Inference
During training, h G is used for predicting the oracle labels. Specifically,ŷ i = σ(W 7 h G i + b 7 ) where σ(·) represents the logistic function, and y i is the prediction probability ranging from 0 to 1. The training loss of the model is binary cross-entropy loss given the predictions and oracles: . For DISCOBERT without graphs, the output from Document Encoder h S is used for prediction instead. The creation of oracle is operated on EDU level. We greedily pick up EDUs with their necessary dependencies until R-1 F 1 drops.
During inference, given an input document, after obtaining the prediction probabilities of all the EDUs, i.e.,ŷ = {ŷ 1 , · · · ,ŷ n }, we sortŷ in descending order, and select EDUs accordingly. Note that the dependencies between EDUs are also enforced in prediction to ensure grammacality of generated summaries.

Experiments
In this section, we present experimental results on two popular news summarization datasets. We compare our proposed model with state-of-the-art baselines and conduct detailed analysis to validate the effectiveness of DISCOBERT.

Datasets
We evaluate the models on two datasets: New York Times (NYT) (Sandhaus, 2008), CNN and Dailymail (CNNDM) (Hermann et al., 2015). We use the script from See et al. (2017) to extract summaries from raw data, and Stanford CoreNLP for sentence boundary detection, tokenization and parsing (Manning et al., 2014). Due to the limitation of BERT, we only encode up to 768 BERT BPEs. Table 1 provides statistics of the datasets. The edges in G C are undirected, while those in G R are directional. For CNNDM, there are 287,226, 13,368 and 11,490 samples for training, validation and test, respectively. We use the un-anonymized version as in previous summarization work. NYT is licensed by LDC 6 . Following previous work Xu and Durrett, 2019), we use 137,778, 17,222 and 17,223 samples for training, validation, and test, respectively.

State-of-the-art Baselines
We compare our model with the following state-ofthe-art neural text summarization models.
Extractive Models: BanditSum treats extractive summarization as a contextual bandit problem, trained with policy gradient methods (Dong et al., 2018). NeuSum is an extractive model with seq2seq architecture, where the attention mechanism scores the document and emits the index as the selection .
Compressive Models: JECS is a neural textcompression-based summarization model using BLSTM as the encoder (Xu and Durrett, 2019). The first stage is selecting sentences, and the second stage is sentence compression by pruning constituency parsing tree.
BERT-based Models: BERT-based models have achieved significant improvement on CNNDM and NYT, when compared with LSTM counterparts. BertSum is the first BERT-based extractive summarization model (Liu and Lapata, 2019). Our baseline model BERT is the re-implementation of BertSum. PNBert proposed a BERT-based model with various training strategies, including reinforcement learning and Pointer Networks (Zhong et al., 2019). HiBert is a hierarchical BERT-based model for document encoding, which is further pretrained with unlabeled data .

Implementation Details
We use AllenNLP (Gardner et al., 2018) as the code framework. The implementation of graph 6 https://catalog.ldc.upenn.edu/ LDC2008T19  encoding is based on DGL . Experiments are conducted on a single NVIDIA P100 card, and the mini-batch size is set to 6 due to GPU memory capacity. The length of each document is truncated to 768 BPEs. We use the pre-trained 'bert-base-uncased' model and fine tune it for all experiments. We train all our models for up to 80,000 steps. ROUGE (Lin, 2004) is used as the evaluation metrics, and 'R-2' is used as the validation criteria. The realization of discourse units and structure is a critical part of EDU pre-processing, which requires two steps: discourse segmentation and RST parsing. In the segmentation phase, we use a neural discourse segmenter based on the BiLSTM CRF framework (Wang et al., 2018) 7 . The segmenter achieved 94.3 F 1 score on the RST-DT test set, in which the human performance is 98.3. In the parsing phase, we use a shift-reduce discourse parser to extract relations and identify nuclearity (Ji and Eisenstein, 2014) The dependencies among EDUs are crucial to the grammaticality of selected EDUs. Here are the two steps to learn the derivation of dependencies: head inheritance and tree conversion. Head inheritance defines the head node for each valid non-terminal tree node. For each leaf node, the head is itself. We determine the head node(s) of non-terminal nodes based on their nuclearity. 9 For example, in Figure 2, the heads of text spans [1-5], [2][3][4][5], [3][4][5] and [4][5] need to be grounded to a single EDU. We propose a simple yet effective schema to convert RST discourse tree to a dependencybased discourse tree. 10 We always consider the dependency restriction such as the reliance of Satellite on Nucleus, when we create oracle during preprocessing and when the model makes the prediction. For the example in Figure 2, if the model selects "[5] being carried ... Liberia." as a candidate span, we will enforce the model to select "[3] and shows ... 8," and "[2] This ... series," as well.
The number of chosen EDUs depends on the average length of the reference summaries, dependencies across EDUs as mentioned above, and the length of the existing content. The optimal average number of EDUs selected is tuned on the development set.

Experimental Results
Results on CNNDM Table 2 shows results on CNNDM. The first section includes Lead3 baseline, sentence-based oracle, and discourse-based oracle. The second section lists the performance of baseline models, including non-BERT-based and BERTbased variants. The performance of our proposed model is listed in the third section. BERT is our implementation of sentence-based BERT model. DISCOBERT is our discourse-based BERT model without Discourse Graph Encoder. DISCOBERT w. G C and DISCOBERT w. G R are the discoursebased BERT model with Coreference Graph and RST Graph, respectively. DISCOBERT w. G R & G C is the fusion model encoding both graphs.
The proposed DISCOBERT beats the sentencebased counterpart and all the competitor models. With the help of Discourse Graph Encoder, the graph-based DISCOBERT beats the stateof-the-art BERT model by a significant margin (0.52/0.61/1.04 on R-1/-2/-L on F 1 ). Ablation study with individual graphs shows that the RST Graph is slightly more helpful than the Coreference 9 If both children are N(ucleus), then the head of the current node inherits the head of the left child. Otherwise, when one child is N and the other is S, the head of the current node inherits the head of the N child.
10 If one child node is N and the other is S, the head of the S node depends on the head of the N node. If both children are N and the right child does not contain a subject in the discourse, the head of the right N node depends on the head of the left N node.  Graph, while the combination of both achieves better performance overall.
Results on NYT Results are summarized in Table 3. The proposed model surpasses previous state-of-the-art BERT-based model by a significant margin. HIBERT * S and HIBERT * M used extra data for pre-training the model. We notice that in the NYT dataset, most of the improvement comes from the use of EDUs as minimal selection units. DIS-COBERT provides 1.30/1.29/1.82 gain on R-1/-2/-L over the BERT baseline. However, the use of discourse graphs does not help much in this case.

Grammaticality
Due to segmentation and partial selection of sentence, the output of our model might not be as grammatical as the original sentence. We manually examined and automatically evaluated model output, and observed that overall, the generated summaries are still grammatical, given the RST dependency tree constraining the rhetorical relations among EDUs. A set of simple yet effective post-processing rules helps to complete the EDUs in some cases.
Automatic Grammar Checking We followed Xu and Durrett (2019) to perform automatic grammar checking using Grammarly. Table 4 shows the grammar checking results, where the average number of errors in every 10,000 characters on CN-NDM and NYT datasets is reported. We compare DISCOBERT with sentence-based BERT model. 'All' shows the summation of the number of errors in all categories. As shown in the table, the   summaries generated by our model have retained the quality of the original text.
Human Evaluation We sampled 200 documents from the test set of CNNDM and for each sample, we asked two Turkers to grade three summaries from 1 to 5. Results are shown in Table 5. Sent-BERT model (the original BERTSum model) selects sentences from the document, hence providing the best overall readability, coherence, and grammaticality. In some cases, reference summaries are just long phrases, so the scores are slightly lower than those from the sentence model. DISCOBERT model is slightly worse than Sent-BERT model but is fully comparable to the other two variants.

Examples & Analysis
We show some examples of model output in Table 6. We notice that a decent amount of irrelevant details are removed from the extracted summary. Despite the success, we further conducted error analysis and found that the errors mostly originated from the RST dependency resolution and the upstream parsing error of the discourse parser. The misclassification of RST dependencies and the hand-crafted rules for dependency resolution hurted the grammaticality and coherence of the 'generated' outputs. Common punctuation issues include extra or missing commas, as well as missing quotation marks. Some of the coherence issue Clare Hines , who lives in Brisbane, was diagnosed with a brain tumour after suffering epileptic seizures. After a number of tests doctors discovered she had a benign tumour that had wrapped itself around her acoustic, facial and balance nerve -and told her she had have it surgically removed or she risked the tumour turning malignant. One week before brain surgery she found out she was pregnant.
Jordan Henderson, in action against Aston Villa at Wembley on Sunday, has agreed a new Liverpool deal. The club's vice captain puts pen to paper on a deal which will keep him at Liverpool until 2020. Rodgers will consider Henderson for the role of club captain after Steven Gerrard moves to LA Galaxy at the end of the campaign but, for now, the England international is delighted to have agreed terms on a contract that will take him through the peak years of his career. originates from missing or improper or missing anaphora resolution. In this example "['Johnny is believed to have drowned,] 1 [but actually he is fine,'] 2 [the police say.] 3 ", only selecting the second EDU yields a sentence "actually he is fine", which is not clear who is 'he' mentioned here.

Related Work
Neural Extractive Summarization Neural networks have been widely used in extractive summarization. Various decoding approaches, including ranking (Narayan et al., 2018), index prediction  and sequential labelling (Nallapati et al., 2017;Zhang et al., 2018;Dong et al., 2018), have been applied to content selection. Our model uses a similar configuration to encode the document with BERT as Liu and Lapata (2019) did, but we use discourse graph structure and graph encoder to handle the long-range dependency issue.
Neural Compressive Summarization Text summarization with compression and deletion has been explored in some recent work. Xu and Durrett (2019) presented a two-stage neural model for selection and compression based on constituency tree pruning. Dong et al. (2019) presented a neural sentence compression model with discrete operations including deletion and addition. Different from these studies, as we use EDUs as minimal selection basis, sentence compression is achieved automatically in our model.

Discourse & Summarization
The use of discourse theory for text summarization has been explored before. Louis et al. (2010) examined the benefit of graph structure provided by discourse relations for text summarization. Hirao et al. (2013); Yoshida et al. (2014) formulated the summarization problem as the trimming of the document discourse tree. Durrett et al. (2016) presented a system of sentence extraction and compression with ILP methods using discourse structure. Li et al. (2016) demonstrated that using EDUs as units of content selection leads to stronger summarization performance. Compared with them, our proposed method is the first neural end-to-end summarization model using EDUs as the selection basis.
Graph-based Summarization Graph approach has been explored in text summarization over decades. LexRank introduced a stochastic graphbased method for computing relative importance of textual units (Erkan and Radev, 2004). Yasunaga et al. (2017) employed a GCN on the relation graphs with sentence embeddings obtained from RNN. Tan et al. (2017) also proposed graphbased attention in abstractive summarization model. Fernandes et al. (2018) developed a framework to reason long-distance relationships for text summarization.

Conclusion
In this paper, we present DISCOBERT, which uses discourse unit as the minimal selection basis to reduce summarization redundancy and leverages two types of discourse graphs as inductive bias to capture long-range dependencies among discourse units. We validate the proposed approach on two popular summarization datasets, and observe consistent improvement over baseline models. For future work, we will explore better graph encoding methods, and apply discourse graphs to other tasks that require long document encoding.