At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization

Extractive methods have been proven effective in automatic document summarization. Previous works perform this task by identifying informative contents at sentence level. However, it is unclear whether performing extraction at sentence level is the best solution. In this work, we show that unnecessity and redundancy issues exist when extracting full sentences, and extracting sub-sentential units is a promising alternative. Specifically, we propose extracting sub-sentential units based on the constituency parsing tree. A neural extractive model which leverages the sub-sentential information and extracts them is presented. Extensive experiments and analyses show that extracting sub-sentential units performs competitively comparing to full sentence extraction under the evaluation of both automatic and human evaluations. Hopefully, our work could provide some inspiration of the basic extraction units in extractive summarization for future research.


Introduction
Automatic text summarization aims to produce a brief piece of text which can preserve the most important information in it.The important contents are identified and then extracted to form the output summary (Nenkova and McKeown, 2011).In recent decades, extractive methods have proven effective in many systems (Carbonell and Goldstein, 1998;Mihalcea and Tarau, 2004;McDonald, 2007;Cao et al., 2015;Cheng and Lapata, 2016;Zhou et al., 2018;Nallapati et al., 2017).
In previous works, extractive summarization systems perform extraction on the sentence level (Mihalcea and Tarau, 2004;Cheng and Lapata, 2016;Nallapati et al., 2017).As the extraction unit, a sentence is a grammatical unit of one or more * Contribution during internship at Microsoft Research.
words that express a statement, question, request, etc.There are several advantages of extracting sentences to form the output sentence.First, extractive systems are simpler, easier to develop, and faster during run-time in real application scenarios, compared with abstractive systems.Moreover, original sentences in the input are naturally fluent and grammatically correct.Finally, extracted sentences are factually faithful to the input document, compared with abstractive methods (Cao et al., 2018).
Despite the success of extractive systems, from previous works, it is still not clear whether extracting at sentence level is the best solution for extractive document summarization.There are several drawbacks of extracting the full sentences from the input document.The most obvious issue is that the extracted sentences may contain unnecessary information.Some previous works have also noticed this problem and try to solve it by compressing or rewriting the extracted sentences (Martins and Smith, 2009;Chen and Bansal, 2018;Xu and Durrett, 2019).Furthermore, extracted sentences may contain duplicate contents.Thus, methods such as Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) and sentence fusion (Barzilay and McKeown, 2005;Lebanoff et al., 2019) are proposed to avoid or merge duplicate contents.
The redundancy and unnecessity issues might be caused by extracting full sentences since an important sentence may also contain unnecessary information.Besides, different importance sentences may have duplicate (un)important words.This inspires us that we can perform extraction at a finer granularity so that the important and unimportant contents could be separated.Therefore, we propose that extracting sub-sentential units in a sentence could be a solution.As for the sub-sentential units, we mainly focus on the non-terminal nodes in a constituency parsing tree in this paper.In a given parsing tree of a sentence, the root node rep-resents the entire sentence while the leaf nodes represent each corresponding lexical token.An extractive system could perform extraction on the non-terminal nodes which can express more finegrained information.To keep the advantages of extractive methods, we choose the nodes which can still express a full statement.Specifically, the nodes with the clause tag such as S and SBAR are used for creating extraction units.
In this paper, we conduct experiments and analyses to answer the following questions: From the perspective of having supervision or not, there are two major types: unsupervised methods and supervised methods.One of the difficulties in training an extractive system is the lack of extraction labels.The reason is that most of the reference summary is written by human experts, therefore, it is hard to find the exact appearance in the input document.Without natural training labels, unsupervised and supervised methods treat extractive summarization as different problems.
Graph-based methods (Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Wan and Yang, 2006) are very useful unsupervised methods.In these methods, the input document is represented as a connected graph.The vertices represent the sentences, and the edges between vertices have attached weights that show the similarity of the two sentences.The score of a sentence is the importance of its corresponding vertex, which can be computed using graph algorithms.
Supervised methods for extractive summarization create training labels manually.(Cao et al., 2015;Ren et al., 2017) directly train regression models using ROUGE scores as the supervision.(Cheng and Lapata, 2016;Nallapati et al., 2017;Zhou et al., 2018;Zhang et al., 2019) search the oracle extracted sentences as the training labels.Cheng and Lapata (2016) propose treating document summarization as a sequence labeling task.They first encode the sentences in the document and then classify each sentence into two classes, i.e., extraction or not.Nallapati et al. (2017) propose a system called SummaRuNNer with more features, which also treat extractive document summarization as a sequence labeling task.Zhou et al. (2018) propose using pointer networks (Vinyals et al., 2015) to repeatedly extract sentences.
Recently, Reinforcement Learning (RL) is also introduced in unsupervised extractive summarization (Dong et al., 2018;Böhm et al., 2019).Dong et al. (2018) treat sentence extraction as a Bandit problem so that they can train a RL-based system whose reward is the ROUGE scores.Böhm et al. (2019) propose that using human judgments as reward is better than ROUGE.However, these methods still need some kinds of reward as a signal, which differ from the previously introduced fully unsupervised methods.

Q1: Review of Extracting Full Sentence
Performing sentence extraction in summarization systems have proven effective in previous works (Luhn, 1958;Cao et al., 2015;Cheng and Lapata, 2016;Nallapati et al., 2017).Despite the success of these systems, it is still unclear whether performing content extraction at the sentence level is the best solution.In this section, we will examine the drawbacks of extraction at the sentence level.

The Dataset
There are various datasets for text summarization, such as DUC/TAC, CNN/Daily Mail, New York Times, etc.In this paper, we take the most commonly used dataset in recent research works (Cheng and Lapata, 2016;Nallapati et al., 2017;Zhou et al., 2018;Xu and Durrett, 2019;Lebanoff et al., 2019), CNN/Daily Mail, as our testbed.The statistics of it can be found in Table 2.One of the most distinguishable features of this dataset is that the output summary is in the form of highlights written by the news editors.As shown in the example in Figure 1, the summary (highlights) is a list of bullets.Therefore, extractive methods perform well on this dataset (Grusky et al., 2018).

The Drawbacks
There are two main potential drawbacks of extracting sentences.First, unnecessary information is smuggled with the extracted sentences.Second, duplicate content may appear when extracting multiple sentences.To analyze whether the issues exist, we conduct experiments and analyses with both count-based statistics and human judgments.We consider two different settings to reach our final conclusion, i.e., the extractive oracle and a real extractive system.First, we check the quality of the sentence level extractive oracle, since it is the upper bound of any extraction system.Two different methods are used in recent extractive summarization research for building the oracle training label.The first one is based on semantic correspondence (Woodsend and Lapata, 2010) of document sentences and reference summary, used in (Cheng and Lapata, 2016).The second one is heuristic, which maximizes the ROUGE score with respect to gold summaries.This one is more broadly used in many recent extractive systems (Nallapati et al., 2017;Zhou et al., 2018;Zhang et al., 2019;Liu, 2019).We adopt the second method since it is more widely used and easy to implement.The extractive oracle is computed with the metric of ROUGE-2 F1 score, which is also the metric used in the final automatic evaluation in these systems.
Second, we check the output of a BERT-based sentence level extraction method and denote it as BERT-SENT.Following previous works (Devlin et al., 2018;Liu, 2019;Zhang et al., 2019), BERT-SENT treats extractive document summarization as a sentence classification task.The model is borrowed from Liu ( 2019), but we remove the interval segment embeddings in it since it does not have obvious benefits.

Unnecessary Information
In this section, we examine whether unnecessary information is introduced unavoidably when extracting full sentences.
Oracle: Table 1 shows the information of the extractive oracle on the CNN/Daily Mail test set.
The ROUGE-1 precision of the extractive oracle is 52.59, which means that there are 47.41 percent of the unigrams are not in the reference summary.As to the ROUGE-2 scores, the precision drastically drops to 33.97.These two metrics show that large amount of unwanted lexical units, i.e. unigram and bigram, are extracted along with the desired contents.This indicates that there exists unnecessary information on the lexical level.
The surface lexical matching (ROUGE scores) has its limits, that it cannot fully express the semantic level.We also conduct human analysis to check whether unnecessary information is extracted at the same time.The labeling criteria of unnecessary information is whether a 5-token span is not needed comparing to ground-truth summary.We randomly sampled 50 documents from the CNN/Daily Mail test set.Evaluation results show that, 48% of the extractive oracles contain unnecessary information.
BERT-SENT: Similar experiments and analyses are also conducted on BERT-SENT.Table 6 shows the cound-based statistics.Results show that 63.07%percent of the unigrams and 82.73% bigrams are not in the reference summary.These rates are much higher than the sentence-level extractive oracle, and show that the unnecessity issue is quite severe.Human evaluation shows that 54% the outputs have unnecessary information.

Redundancy
In this section, we check whether redundancy problem exists in extractive summarization.Similarly, we conduct the experiments and analysis on the extractive oracle and the BERT-SENT summarization system.We first define a metric for redundancy, i.e., the n-gram overlap rate.We calculate the ngram overlap between each pair of sentences.This overlap is calculated as: Oracle Table 1 shows the information of the extractive oracle on the CNN/Daily Mail test set.It can be observed that there are 19.24%unigram and 2.22% bigram are duplicated in the extractive oracle, which is much higher than the reference summmary.
Beyond this lexical level statistics, human evaluation is also conducted.Results show that 12% of the extractive oracle has the redundancy issue.This result matches the n-gram overlap rates and shows that the redundancy issue even exists in oracle.6 shows that the BERT-SENT has high n-gram overlap rates, i.e., 27.18% 1-gram and 7.68% 2-gram overlap.Thus, the redundancy issue is more severe in a real system than the extractive oracle, even for a state-of-theart BERT-based system.Human evaluation also shows that 49% of the BERT-SENT output has the redundancy issue.

Q2: Efficacy of Extracting
Sub-sentential Units In this section, we propose an alternative to performing extraction on full sentence for extractive document summarization.Instead, we perform extraction on sub-sentential units.Specifically, the extraction units are based on the clause nodes in the constituency parsing tree of a sentence.Figure 3 shows two simplified examples of the constituency tree.The root node in a constituency tree represents the entire sentence, and the leaf node represents its corresponding lexical token.Extracting on the root node is essentially extracting the full sentence, while extracting on the leaf node is doing compressing by extracting words (Filippova et al., 2015).We perform extraction on the non-terminal nodes which can both express a relatively complete meaning and be human-readable.Therefore, the clause nodes, such as S and SBAR, become a good choice.In this section, we introduce how to perform extraction on the sub-sentential units, and present a BERT-based model for it.

The Sub-Sentential Units
In order to perform extraction on the sub-sentential units, we need to determine what units can be extracted.The proposed method is based on the constituency parsing tree.The basic idea is based on the sub-sentential clauses in the tree.In our experiments, we adopt the syntactic tagset used in the Penn Treebank (PTB) (Marcus et al., 1993).There are two main types in the PTB tagset, phrase and clause.We use the clause tag since the information in a clause is more complete than a phrase.Given the parsing tree t i of sentence s i , we traverse it to determine the boundary of extraction units.Specifically, every clause is treated as the extraction unit candidates.If one of its ancestors is a clause node, we choose the highest level ancestor clause node (except for the root node) as the extraction unit to include more complete information.This heuristic is visualized in Figure 3.If no subsentential clauses can be found in a sentence, we use the full sentence as the extraction unit.Finally, the input sentence is split into chunks using the selected clauses' boundaries.

Model
In this section, we propose a BERT-based neural extractive summarization model for extracting sub-sentential units (SSE).We following previous works (Cheng and Lapata, 2016;Nallapati et al., 2017;Xu and Durrett, 2019;Liu, 2019) to treat the document summarization as a sequence labeling task.Figure 2 shows the overview of the proposed model.It consists of two levels of encoders.The first level is the BERT-based document encoder, and the second level is the Transformer-based subsentential units encoder.The BERT-based document encoder reads the tokens in the document, and then the Transformer-based encoder constructs the final extraction unit representations.

BERT-based Encoder
Following previous works (Liu, 2019;Zhang et al., 2019;Lebanoff et al., 2019) which use BERT and achieve state-of-the-art results, we use BERT as the first level encoder.The processed input document is denoted as D = (S 1 , S 2 , . . ., S n ) = (w 1 , w 2 , . . ., w m ) with n sentences, m BPE tokens.The i-th sentence contains l chunks S i = (C i,1 , C i,2 , . . ., C i,l ).The j-th chunk with k words in S i is denoted as C i,j = (w i,j,1 , . . ., w i,j,k ).Following Liu (2019), we add additional [CLS] and [SEP] labels between sentences to separate them.However, since the extraction unit is not the full sentence, the vector of [CLS] is not used for classification in our model.After the BERT encoder, the vector of the m document tokens are represented as (w BT 1 , w BT 2 , . . ., w BT m ).

Transformer-based Encoder
The BERT-based encoder reads the entire document and builds the representation of each words.The Transformer-based encoder then constructs the final representation of each chunk.As shown in Figure 2, we first apply an average pooling on the chunk level.Specifically, given the BERT-based encoder output of chunk C i,j = (w BT i,j,1 , . . ., w BT i,j,k ), the pooled representation C i,j is: To note that, the [CLS] and [SEP] labels are not covered by the chunks, and thus not used in the average pooling.
After the average pooling, the document is represented as a sequence of chunk vectors: C = (C 1,1 , . . ., C 1,l1 , . . ., C n,1 , . . ., C n,ln ).We then apply a chunk level Transformer to capture their relationship for extracting summaries: where MultiHead(•) is Multi Head Attention in Transformer (Vaswani et al., 2017), LN(•) is Layer Normalization (Ba et al., 2016), FFN(•) is feedforward network which consists of two linear transformations with a ReLU activation in between.In this paper, we simplify the MultiHead(Q, K, V ) to MultiHead(•) since we only use the self attention mechanism for encoding thus Q = K = V .

Training Objective
With the chunk level representation vectors C , the model predict the output probability of each chunk C i,j : where σ(•) is the sigmoid function, W o and b o are weight parameters of a linear layer.
The training objective of the model is the binary cross-entropy loss given the extractive oracle label y i,j and the predicted probability p(C i,j ): 5 Experiment

Dataset
Following previous extractive works (Zhou et al., 2018;Xu and Durrett, 2019;Lebanoff et al., 2019;Zhang et al., 2019;Dong et al., 2018), we conduct data preprocessing using the same method 1 in See et al. (2017), including sentence splitting and word tokenization.We then use a state-of-the-art BERTbased constituency parser (Kitaev and Klein, 2018) to process the input document whose performance is 95.17 F1 on WSJ test set.The statistic of the original CNN/Daily Mail dataset and the sub-sentential version are listed in Table 2.

CNN/Daily Mail
Training

Implementation Details
We found that the tokenizer used in the constituency parser is different from the one in BERT.Therefore, we apply some simple tokenization fix to process the text before feeding them into BERT.The input of BERT-based encoder is then processed with the BERT's subword tokenizer.Since the maximum length in the BERT's position embedding is 512, we truncated the document to 512 subwords.We use Adam (Kingma and Ba, 2015) as the optimizing algorithm.For the hyperparameters of Adam optimizer, we set the learning rate α = 2e − 5, two momentum parameters β 1 = 0.9 and β 2 = 0.999 respectively, and = 10 −8 .The model is implemented with PyTorch (Paszke et al., 2017) and Py-Torch Transformer (Wolf et al., 2019).We use the bert-base-uncased version of BERT, which has 12 pre-trained Transformer layers.We train the model using 4 NVIDIA P100 GPUs with a batch size of 40.The dropout (Srivastava et al., 2014) rates in all the Transformer layers are set to 0.1.We train the model for 4 epochs which takes about 6 hours.The final model is picked according to the performance on the development set among the 4 model checkpoints.
During inference, we rank the extraction units according to p(C i,j ) and select the top ones.Since the extraction unit in this paper is shorter than full sentence, we repeatedly select next sub-sentential unit until the summary length reaches the limit.The length limit is set to 60 words according to the statistics on the development set in Table 2.

Evaluation Metric
We employ ROUGE (Lin, 2004) as our evaluation metric.ROUGE measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS).It has become the standard evaluation metric for DUC shared tasks and popular for summarization evaluation.Following previous work, we use ROUGE-1 (unigram), ROUGE-2 (bigram) and ROUGE-L (LCS) as the evaluation metrics in the reported experimental results.
Additionally, we also conduct human evaluation on the output summaries.Following previous works (Cheng and Lapata, 2016;Nallapati et al., 2017;Liu, 2019;Zhang et al., 2019), we randomly sampled 50 documents from the CNN/Daily Mail test set, which is the same as in §3.

Automatic Evaluation
Table 3 shows the ROUGE evaluation results.We compare SSE with the following systems: Abstractive Systems Pointer-Generator Network (PGN) (See et al., 2017) and DCA (Celikyilmaz et al., 2018) are sequence-to-sequence models with copy and coverage mechanisms.Fas-tRewrite (Chen and Bansal, 2018) conducts extraction first then generation.InconsisLoss (Hsu et al., 2018) regularizes the word level attention with sentence level extraction attention.Bottom-Up (Gehrmann et al., 2018) applies constrains on the copying probability.
The proposed method SSE achieves the state-ofthe-art results on the CNN/Daily Mail dataset.According to the output of the official ROUGE script, the difference between SSE and baselines are all statistically significant with a 0.95 confidence interval.Compared to our sentence-level extraction baseline system BERT-SENT, using sub-sentential unit extraction leads to a +0.

Human Evaluation
Human evaluations are also conducted on the same 50 randomly sampled documents as in Section 3. The BERT-SENT and SSE models are evaluated.
The workers are asked to rank the outputs of these systems from best to worst by the overall quality (with ties allowed).In addition, we are also curious about how sub-sentential extraction solves the problems of full sentence extraction.Specifically, the workers are asked to identify whether redundant or unnecessary information exists.Table 4 presents the human evaluation results.We compare the SSE with BERT-SENT.As shown Document: (CNN) When Etan Patz went missing in York City at age 6 , hardly anyone in America could help but see his face at their breakfast table .His photo 's appearance on milk cartons after his May 1979 disappearance marked an era of heightened awareness of crimes against children .On Friday , more than 35 years after frenzied media coverage of his case horrified parents everywhere , a New York jury will again deliberate over a possible verdict against the man charged in his killing , Pedro Hernandez .He confessed to police three years ago .Etan Patz 's parents have waited that long for justice , but some have questioned whether that is at all possible in Hernandez 's case .......He said he killed the boy and threw his body away in a plastic bag .Neither the child nor his remains have ever been recovered .But Hernandez has been repeatedly diagnosed with schizophrenia and has an " IQ in the borderline-to-mild mental retardation range , " his attorney Harvey Fishbein has said .Police interrogated Hernandez for 7 1/2 hours before he confessed .......A judge found Ramos responsible for the boy 's death and ordered him to pay the family $ 2 million -money the Patz family has never received ....... Reference: The young boy 's face appeared on milk cartons all across the United States .Patz 's case marked a time of heightened awareness of crimes against children .Pedro Hernandez confessed three years ago to the 1979 killing in . in the results, SSE performs better than BERT-SENT for both redundancy and unnecessity.The frequency of having these issues drops 20% and 17% respectively.Thus the overall quality of SSE is also better than BERT-SENT.

Analysis
Table 5 shows an example of a document, gold summary and the output of SSE.It can be observed that by performing sub-sentential extraction, the full sentence is broken into more fine-grained semantic units.Therefore, during extraction, the model can extract important parts without introducing unimportant contents.
Table 6 shows the statistics of the outputs of BERT-SENT and SSE.Similarly, we conduct experiments and analyses with both statistics and human judgments, on both the unnecessary information and redundancy issues as in Section 3. Compared to BERT-SENT, SSE performs significantly better in terms of ROUGE precision by a large margin.This shows that extracting sub-sentential units can bring less unimportant information.We also found that the n-gram overlap rate of SSE is also much lower than BERT-SENT, which shows that the output contains less redundant contents.

Q4: Readability of Sub-Sentential Units
Performing sub-sentential unit extraction improves the ROUGE scores and alleviates the problem of extracting full sentences.However, it is not clear whether this method introduces new issues.One possible issue is that the sub-sentential units are fragmented so the readability is poor.To investigate this problem, we also add an item about the readability of the produced summary in the human evaluation questionnaire.In detail, the workers are asked to rank the system outputs by the readability (with ties allowed).
Table 4 shows the results of readability.As shown in results, the BERT-SENT is always ranked as the best since the sentences are fully extracted.The readability of extracted sub-sentential units is slightly worse than the full sentences.We also manually checked the output of SSE whose readability is worse.We found that there are two reasons: 1) the sub-sentence is fragmented which affects the readability; 2) the sub-sentence is wrongly extracted due to the error of the constituency parser.Therefore, we hope that the readability of SSE could be improved if we can: 1) design better subsentential unit extraction algorithm; 2) have an even better syntactic parser.

Conclusion
In this paper, we investigate the problem of the extraction granularity for extractive document summarization.We observe that performing extraction at sentence level has the redundancy and unnecessity issues.We found that these problems can be alleviated by doing sub-sentential unit extraction.
Both automatic and human evaluations show that sub-sentential extraction performs competitively compared to the full-sentence-extraction systems.Therefore, sub-sentential unit extraction could be a promising alternative to full-sentence extraction.Our experiments and analyses on revisiting the basic extraction unit could provide some hints for future research on this direction.

Figure 1 :
Figure 1: A screenshot example of the documentsummary pair in the CNN/Daily Mail dataset.

Figure 3 :
Figure 3: Two simplified constituency parsing trees.The nodes in circles are candidates.The final selected node is the on in red solid-lined circle.
Figure 2: The overview of the BERT-based model for sub-sentential extraction (SSE).In this simplified example, the document has 3 sentences.The first and the third sentences have two extraction units and the second sentence has one.After encoding the document with pre-trained BERT encoder, an average pooling layer are used to aggregate information of each extraction unit.The final Transformer layer captures the document-level information and then the MLP predicts the extraction probability.

Table 2 :
Data statistics of CNN/Daily Mail dataset.
1 https://github.com/abisee/cnn-dailymail 56 ROUGE improvement.As for the other existing systems which leverage BERT or other pre-training techniques and perform extraction on sentence level, SSE still outperforms them statistically significantly in terms of ROUGE.

Table 3 :
Full length ROUGE F1 evaluation (%) on CNN / Daily Mail test set.Results for comparison systems are taken from the authors respective papers or obtained on our data by running publicly released software.

Table 4 :
Human evaluation results.Unnecessity and Redundancy are reported as occurrence frequency, and lower is better.Readability and Overall are reported as ranking, and lower is better.

Table 5 :
An example document and gold summary in the CNN/Daily Mail test set.The words highlighted with red are extracted as a full sentence.The italic words highlighted with cyan are extracted as sub-sentential units.

Table 6 :
Statistics of the BERT-SENT and SSE methods on the CNN/Daily Mail test set.