Neural Extractive Text Summarization with Syntactic Compression

Recent neural network approaches to summarization are largely either selection-based extraction or generation-based abstraction. In this work, we present a neural model for single-document summarization based on joint extraction and syntactic compression. Our model chooses sentences from the document, identifies possible compressions based on constituency parses, and scores those compressions with a neural model to produce the final summary. For learning, we construct oracle extractive-compressive summaries, then learn both of our components jointly with this supervision. Experimental results on the CNN/Daily Mail and New York Times datasets show that our model achieves strong performance (comparable to state-of-the-art systems) as evaluated by ROUGE. Moreover, our approach outperforms an off-the-shelf compression module, and human and manual evaluation shows that our model’s output generally remains grammatical.


Introduction
Neural network approaches to document summarization have ranged from purely extractive (Cheng and Lapata, 2016;Nallapati et al., 2017;Narayan et al., 2018) to abstractive (Rush et al., 2015;Nallapati et al., 2016;Chopra et al., 2016;Tan et al., 2017;Gehrmann et al., 2018). Extractive systems are robust and straightforward to use. Abstractive systems are more flexible for varied summarization situations (Grusky et al., 2018), but can make factual errors (Cao et al., 2018; or fall back on extraction in practice (See et al., 2017). Extractive and compressive systems (Berg-Kirkpatrick et al., 2011;Qian and Liu, 2013;Durrett et al., 2016) combine the strengths of both approaches; however, there has been little work studying neural network models in this vein, and the approaches that have been employed typically use seq2seq-based sentence compression (Chen and Bansal, 2018).
In this work, we propose a model that can combine the high performance of neural extractive systems, additional flexibility from compression, and interpretability given by having discrete compression options. Our model first encodes the source document and its sentences and then sequentially selects a set of sentences to further compress. Each sentence has a set of compression options available that are selected to preserve meaning and grammaticality; these are derived from syntactic constituency parses and represent an expanded set of discrete options from prior work (Berg-Kirkpatrick et al., 2011;Wang et al., 2013). The neural model additionally scores and chooses which compressions to apply given the context of the document, the sentence, and the decoder model's recurrent state.
A principal challenge of training an extractive and compressive model is constructing the oracle summary for supervision. We identify a set of high-quality sentences from the document with beam search and derive oracle compression labels in each sentence through an additional refinement process. Our model's training objective combines these extractive and compressive components and learns them jointly.
We conduct experiments on standard single document news summarization datasets: CNN, Daily Mail (Hermann et al., 2015), and the New  Figure 2: Text compression example. In this case, "intimate", "well-known", "with their furry friends" and "featuring ... friends" are deletable given compression rules.
York Times Annotated Corpus (Sandhaus, 2008). Our model matches or exceeds the state-of-the-art on all of these datasets and achieves the largest improvement on CNN (+2.4 ROUGE-F 1 over our extractive baseline) due to the more compressed nature of CNN summaries. We show that our model's compression threshold is robust across a range of settings yet tunable to give differentlength summaries. Finally, we investigate the fluency and grammaticality of our compressed sentences. The human evaluation shows that our system yields generally grammatical output, with many remaining errors being attributed to the parser. 1

Compression in Summarization
Sentence compression is a long-studied problem dealing with how to delete the least critical information in a sentence to make it shorter Marcu, 2000, 2002;Martins and Smith, 2009;Cohn and Lapata, 2009;Wang et al., 2013;Li et al., 2014). Many of these approaches are syntax-driven, though end-to-end neural models have been proposed as well (Filippova et al., 2015;Wang et al., 2017). Past non-neural work on summarization has used both syntax-based (Berg-Kirkpatrick et al., 2011;Woodsend and Lapata, 2011) and discourse-based (Carlson et al., 2001;Hirao et al., 2013;Li et al., 2016) compressions. Our approach follows in the syntax-driven vein.
Our high-level approach to summarization is shown in Figure 1. In Section 3, we describe the models for extraction and compression. Our compression depends on having a discrete set of valid compression options that maintain the grammaticality of the underlying sentence, which we now proceed to describe.
Compression Rules We refer to the rules derived in Li et al. (2014), Wang et al. (2013), and Durrett et al. (2016) and design a concise set of syntactic rules including the removal of: 1. Appositive noun phrases; 2. Relative clauses and adverbial clauses; 3. Adjective phrases in noun phrases, and adverbial phrases (see Figure 2); 4. Gerundive verb phrases as part of noun phrases (see Figure 2); 5. Prepositional phrases in certain configurations like on Monday; 6. Content within parentheses and other parentheticals. Figure 2 shows examples of several compression rules applied to a short snippet. All combinations of compressions maintain grammaticality, though some content is fairly important in this context (the VP and PP) and should not be deleted. Our model must learn not to delete these elements.
Compressability Summaries from different sources may feature various levels of compression. At one extreme, a summary could be fully sentence-extractive; at another extreme, the editor may have compressed a lot of content in a sentence. In Section 4, we examine this question on our summarization datasets and use it to motivate our choice of evaluation datasets.
Universal Compression with ROUGE While we use syntax as a source of compression options, we note that other ways of generating compression options are possible, including using labeled compression data. However, supervising compression with ROUGE is critical to learn what information is important for this particular source, and in any case, labeled compression data is unavailable in many domains. In Section 5, we compare our model to off-the-shelf sentence compression module and find that it substantially underperforms our approach.

Model
Our model is a neural network model that encodes a source document, chooses sentences from that document, and selects discrete compression options to apply. The model architecture of sentence extraction module and text compression module are shown in Figure 3

Extractive Sentence Selection
A single document consists of n sentences D = {s 1 , s 2 , · · · , s n }. The i-th sentence is denoted as  Figure 3: Sentence extraction module of JECS. Words in input document sentences are encoded with BiLSTMs. Two layers of CNNs aggregate these into sentence representations h i and then the document representation v doc . This is fed into an attentive LSTM decoder which selects sentences based on the decoder state d and the representations h i , similar to a pointer network.
j-th word in s i . The content selection module learns to pick up a subset of D denoted asD = {ŝ 1 ,ŝ 2 , · · · ,ŝ k , |ŝ i ∈ D} where k sentences are selected.

Sentence & Document Encoder
We first use a bidirectional LSTM to encode words in each sentence in the document separately and then we apply multiple convolution layers and max pooling layers to extract the representation of every sentence. Specifically, where h i is a representation of the i-th sentence in the document. This process is shown in the left side of Figure 3 illustrated in purple blocks. We then aggregate these sentence representations into a document representation v doc with a similar BiLSTM and CNN combination, shown in Figure 3 with orange blocks.
Decoding The decoding stage selects a number of sentences given the document representation v doc and sentences' representations h i . This process is depicted in the right half of Figure 3. We use a sequential LSTM decoder where, at each time step, we take the representation h of the last selected sentence, the overall document vector v doc , and the recurrent state d t−1 , and produce a distribution over all of the remaining sentences excluding those already selected. This approach resembles pointer network-style approaches used in past work . Formally, we write this as: where h k is the representation of the sentence selected at time step t − 1. d t−1 is the decoding hid-Philadelphia … well-known artists with their furry friends .

Text Compression
After selecting the sentences, the text compression module evaluates our discrete compression options and decides whether to remove certain phrases or words in the selected sentences. Figure 4 shows an example of this process for deciding whether or not to delete a PP in this sentence. This PP was marked as deletable based on rules described in Section 2. Our network then encodes this sentence and the compression, combines this information with the document context v doc and decoding context h dec , and uses a feedforward network to decide whether or not to delete the span.
Let C i = {c i1 , · · · , c il } denote the possible compression spans derived from the rules described in Section 2. Let y i,c be a binary variable equal to 1 if we are deleting the cth option of the ith sentence. Our text compression module models p(y i,c |D,ŝ t = s i ) as described in the following section.
Compression Encoder We use a contextualized encoder, ELMo (Peters et al., 2018) to compute contextualized word representations. We then use CNNs with max pooling to encode the sentence (shown in blue in Figure 4) and the candidate compression (shown in light green in Figure 4). The sentence representation v sent and the compression span representation v comp are concatenated with the hidden state in sentence decoder h dec and the document representation v doc .

Compression Classifier
We feed the concatenated representation to a feedforward neural network to predict whether the compression span should be deleted or kept, which is formulated as a binary classification problem. This classifier computes the final probability The overall probability of a summary (ŝ,ŷ), whereŝ is the sentence oracle andŷ is the compression label, is the product of extraction and compression models: Heuristic Deduplication Inspired by the trigram avoidance trick proposed in Paulus et al. (2018) to reduce redundancy, we take full advantage of our linguistically motivated compression rules and the constituent parse tree and allow our model to compress deletable chunks with redundant information. We therefore take our model's output and apply a postprocessing stage where we remove any compression option whose unigrams are completely covered elsewhere in the summary. We perform this compression after the model prediction and compression.

Training
Our model makes a series of sentence extraction decisionsŝ and then compression decisionsŷ. To supervise it, we need to derive gold-standard labels for these decisions. Our oracle identification approach relies on first identifying an oracle set of sentences and then the oracle compression op-

Oracle Construction
Sentence Extractive Oracle We first identify an oracle set of sentences to extract using a beam search procedure similar to Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998). For each additional sentence we propose to add, we compute a heuristic cost equal to the ROUGE score of a given sentence with respect to the reference summary. When pruning states, we calculate the ROUGE score of the combination of sentences currently selected and sort in descending order. Let the beam width be β. The time complexity of the approximate approach is O(nkβ) where in practice k n and β n. We set β = 8 and n = 30 which means we only consider the first 30 sentences in the document.
The beam search procedure returns a beam of β different sentence combinations in the final beam. We use the sentence extractive oracle for both the extraction-only model and the joint extractioncompression model.
Oracle Compression Labels To form our joint extractive and compressive oracle, we need to give the compression decisions binary labels y i,c in each set of extracted sentences. For simplicity and computational efficiency, we assign each sentence  a single y i,c independent of the context it occurs in. For each compression option, we assess the value of it by comparing the ROUGE score of the sentence with and without this phrase. Any option that increases ROUGE is treated as a compression that should be applied. When calculating this ROUGE value, we remove stop words include stemming.
We run this procedure on each of our oracle extractive sentences. The fraction of positive and negative labels assigned to compression options is shown for each of the three datasets in Table  2. CNN is the most compressable dataset among CNN, DM and NYT.
ILP-based oracle construction Past work has derived oracles for extractive and compressive systems using integer linear programming (ILP) (Gillick and Favre, 2009;Berg-Kirkpatrick et al., 2011). Following their approach, we can directly optimize for ROUGE recall of an extractive or compressive summary in our framework if we specify a length limit. However, we evaluate on ROUGE F 1 as is standard when comparing to neural models that don't produce fixed-length summaries. Optimizing for ROUGE F 1 cannot be formulated as an ILP, since computing precision requires dividing by the number of selected words, making the objective no longer linear. We experimented with optimizing for ROUGE F 1 indirectly by finding optimal ROUGE recall summaries at various settings of maximum summary length. However, these summaries frequently contained short sentences to fill up the budget, and the collection of summaries returned tended to be less diverse than those found by beam search.

Learning Objective
Often, many oracle summaries achieve very similar ROUGE values.
We therefore want to avoid committing to a single oracle summary for the learning process. Our procedure from Section 4.1 can generate m extractive oracles s * i ; let s * i,t denote the gold sentence for the i-th oracle at timestep t. Past work (Narayan et al., 2018;Chen and Bansal, 2018) has employed policy gradient in this setting to optimize directly for ROUGE. However, because oracle summaries usually have very similar ROUGE scores, we choose to simplify this objective as . Put another way, we optimize the log likelihood averaged across m different oracles to ensure that each has high likelihood. We use m = 5 oracles during training. The oracle sentence indices are sorted according to the individual salience (ROUGE score) rather than document order.
The objective of the compression module is de- is the probability of the target decision for the c-th compression options of the i-th sentence. The joint loss function is L = L sent + αL comp . We set α = 1 in practice.

Experiments
We evaluate our model on two axes. First, for content selection, we use ROUGE as is standard. Second, we evaluate the grammaticality of our model to ensure that it is not substantially damaged by compression.

Experimental Setup
Datasets We evaluate the proposed method on three popular news summarization datasets: the New York Times corpus (Sandhaus, 2008), CNN and Dailymail (DM) (Hermann et al., 2015). 4 As discussed in Section 2, compression will give different results on different datasets depending on how much compression is optimal from the standpoint of reproducing the reference summaries, which changes how measurable the impact of compression is. In Table 2, we show the "compressability" of these three datasets: how valuable various compression options seem to be from the standpoint of improving ROUGE. We found that CNN has significantly more positive compression options than the other two. Critically, CNN also has the shortest references (37 words on average,  Models We present several variants of our model to show how extraction and compression work jointly. In extractive summarization, the LEAD baseline (first k sentences) is a strong baseline due to how newswire articles are written. LEADDEDUP is a non-learned baseline that uses our heuristic deduplication technique on the lead sentences. LEADCOMP is a compression-only model where compression is performed on the lead sentences. This shows the effectiveness of the compression module in isolation rather than in the context of abstraction. EXTRACTION is the extraction only model. JECS is the full Joint Extractive and Compressive Summarizer. We compare our model with various abstractive and extractive summarization models. NeuSum ) uses a seq2seq model to predict a sequence of sentences indices to be picked up from the document. Our extractive approach is most similar to this model. Refresh (Narayan et al., 2018), BanditSum (Dong et al., 2018) and LatSum  are extractive summarization models for comparison. We also compare with some abstractive models including PointGen-Cov (See et al., 2017), FARS (Chen and Bansal, 2018) and CBDec (Jiang and Bansal, 2018).
We also compare our joint model with a pipeline model with an off-the-shelf compression module. We implement a deletion-based BiLSTM model for sentence compression (Wang et al., 2017) and run the model on top of our extraction output. 5 5 We reimplemented the authors' model following their specification and matched their accuracy. For fair compari-  The pipeline model is denoted as EXTLSTMDEL. Table 3 shows experiments results on CNN. We list performance of the LEAD baseline and the performance of competitor models on these datasets. Starred models are evaluated according to our ROUGE metrics; numbers very closely match the originally reported results. Our model achieves substantially higher performance than all baselines and past systems (+2 ROUGE F1 compared to any of these). On this dataset, compression is substantially useful. Compression is somewhat effective in isolation, as shown by the performance of LEADDEDUP and LEADCOMP. But compression in isolation still gives less benefit (on top of LEAD) than when combined with the extractive model (JECS) in the joint framework. Furthermore, our model beats the pipeline model EXTLSTMDEL which shows the necessity of training a joint model with ROUGE supervision.

Results on Combined CNNDM and NYT
We also report the results on the full CNNDM and NYT although they are less compressable. Table 4 and Table 5 shows the experimental results on these datasets.
Our models still yield strong performance compared to baselines and past work on the CNNDM son, we tuned the deletion threshold to match the compression rate of our model; other choices did not lead to better ROUGE scores.  Table 5: Experimental results on the NYT50 dataset. ROUGE-1, -2 and -L F 1 is reported. JECS substantially outperforms our Lead-based systems and our extractive model.
dataset. The EXTRACTION model achieves comparable results to past successful extractive approaches on CNNDM and JECS improves on this across the datasets. In some cases, our model slightly underperforms on ROUGE-2. One possible reason is that we remove stop words when constructing our oracles, which could underestimate the importance of bigrams containing stopwords for evaluation. Finally, we note that our compressive approach substantially outperforms the compression-augmented LatSum model. That model used a separate seq2seq model for rewriting, which is potentially harder to learn than our compression model. On NYT, we see again that the inclusion of compression leads to improvements in both the LEAD setting as well as for our full JECS model. 6

Grammaticality
We evaluate grammaticality of our compressed summaries in three ways. First, we use Amazon Mechanical Turk to compare different compression techniques. Second, to measure absolute grammaticality, we use an automated out-of-thebox tool Grammarly. Finally, we conduct manual analysis.
Human Evaluation We first conduct a human evaluation on the Amazon Mechanical Turk platform. We ask Turkers to rank different compression versions of a sentence in terms of grammaticality. We compare our full JECS model and the off-the-shelf pipeline model EXTLSTMDEL, which have matched compression ratios. We also propose another baseline, EXTRACTDROPOUT, which randomly drops words in a sentence to match the compression ratio of the other two mod-  Table 6: Human preference, ROUGE and Grammarly grammar checking results. We asked Turkers to rank the models' output based on grammaticality. Error shows the number of grammar errors in 500 sentences reported by Grammarly. Our JECS model achieves the highest ROUGE and is preferred by humans while still making relatively few errors. els. The results are shown in Table 6. Turkers give roughly equal preference to our model and the EXTLSTMDEL model, which was learned from supervised compression data. However, our JECS model achieves substantially higher ROUGE score, indicating that it represents a more effective compression approach. We found that absolute grammaticality judgments were hard to achieve on Mechanical Turk; Turkers' ratings of grammaticality were very noisy and they did not consistently rate true article sentences above obviously noised variants. Therefore, we turn to other methods as described in the next two paragraphs.
Automatic Grammar Checking We use Grammarly to check 500 sentences sampled from the outputs of the three models mentioned above from CNN. Both EXTLSTMDEL and JECS make a small number of grammar errors, not much higher than the purely extractive LEAD3 baseline. One major source of errors for JECS is having the wrong article after the deletion of an adjective like an [awesome] style.

Manual Error Analysis
To get a better sense of our model's output, we conduct a manual analysis of our applied compressions to get a sense of how many are valid. We manually examined 40 model summaries, comparing the output with the raw sentences before compression, and identified the following errors:  We see a variety of compression options that are used in the first two examples, including removal of temporal PPs, large subordinate clauses, adjectives, and parentheticals. The last example features less compression, only removing a handful of adjectives in a manner which slightly changes the meaning of the summary.
Improving the parser and deriving a more semantically-aware set of compression rules can help achieving better grammaticality and readability. However, we note that such errors are largely orthogonal to the core of our approach; a more refined set of compression options could be dropped into our system and used without changing our fundamental model.

Compression Analysis
Compression Threshold Compression in our model is an imbalanced binary classification problem. The trained model's natural classification threshold (probability of DEL > 0.5) may not be optimal for downstream ROUGE. We experiment with varying the classification threshold from 0 (no deletion, only heuristic deduplication) to 1 (all compressible pieces removed). The results on CNN are shown in Figure 5 average ROUGE value at different compression thresholds. The model achieves the best performance at 0.45 but performs well in a wide range from 0.3 to 0.55. Our compression is therefore robust yet also provides a controllable parameter to change the amount of compression in produced summaries.
Compression Type Analysis We further break down the types of compressions used in the model.

Related Work
Neural Extractive Summarization Neural networks have shown to be effective in extractive summarization. Past approaches have structured the decision either as binary classification over sentences (Cheng and Lapata, 2016;Nallapati et al., 2017) or classification followed by ranking (Narayan et al., 2018).  used a seq-to-seq decoder instead. For our model, text compression forms a module largely orthogonal to the extraction module, so additional improvements to extractive modeling might be expected to stack with our approach.
Syntactic Compression Prior to the explosion of neural models for summarization, syntactic compression (Martins and Smith, 2009; Woodsend and Lapata, 2011) was relatively more common. Several systems explored the usage of constituency parses (Berg-Kirkpatrick et al., 2011;Wang et al., 2013;Li et al., 2014) as well as RSTbased approaches (Hirao et al., 2013;Durrett et al., 2016). Our approach follows in this vein but could be combined with more sophisticated neural text compression methods as well.
Neural Text Compression Filippova et al. (2015) presented an LSTM approach to deletionbased sentence compression. Miao and Blunsom (2016) proposed a deep generative model for text compression.  explored the compression module after the extraction model but the separation of these two modules hurt the performance. For this work, we find that relying on syntax gives us more easily understandable and controllable compression options. Contemporaneously with our work, Mendes et al. (2019) explored an extractive and compressive approach using compression integrated into a sequential decoding process; however, their approach does not leverage explicit syntax and makes several different model design choices.

Conclusion
In this work, we presented a neural network framework for extractive and compressive summarization. Our model consists of a sentence extraction model joined with a compression classifier that decides whether or not to delete syntax-derived compression options for each sentence. Training the model involves finding an oracle set of extraction and compression decision with high score, which we do through a combination of a beam search procedure and heuristics. Our model outperforms past work on the CNN/Daily Mail corpus in terms of ROUGE, achieves substantial gains over the extractive model, and appears to have acceptable grammaticality according to human evaluations.