Searching for Effective Neural Extractive Summarization: What Works and What’s Next

The recent years have seen remarkable success in the use of deep neural networks on text summarization. However, there is no clear understanding of why they perform so well, or how they might be improved. In this paper, we seek to better understand how neural extractive summarization systems could benefit from different types of model architectures, transferable knowledge and learning schemas. Besides, we find an effective way to improve the current framework and achieve the state-of-the-art result on CNN/DailyMail by a large margin based on our observations and analysis. Hopefully, our work could provide more hints for future research on extractive summarization.


Introduction
Recent years has seen remarkable success in the use of deep neural networks for text summarization (See et al., 2017;Celikyilmaz et al., 2018;Jadhav and Rajan, 2018). So far, most research utilizing the neural network for text summarization has revolved around architecture engineering (Zhou et al., 2018;Chen and Bansal, 2018;Gehrmann et al., 2018).
Despite their success, it remains poorly understood why they perform well and what their shortcomings are, which limits our ability to design better architectures. The rapid development of neural architectures calls for a detailed empirical study of analyzing and understanding existing models.
In this paper, we primarily focus on extractive summarization since they are computationally efficient, and can generate grammatically and coherent summaries (Nallapati et al., 2017). and seek to better understand how neural network-based approaches to this task could benefit from different types of model architectures, transferable knowledge, and learning schemas, and how they might be improved.
Architectures Architecturally, the better performance usually comes at the cost of our understanding of the system. To date, we know little about the functionality of each neural component and the differences between them (Peters et al., 2018b), which raises the following typical questions: 1) How does the choice of different neural architectures (CNN, RNN, Transformer) influence the performance of the summarization system? 2) Which part of components matters for specific dataset? 3) Do current models suffer from the over-engineering problem?
Understanding the above questions can not only help us to choose suitable architectures in different application scenarios, but motivate us to move forward to more powerful frameworks.
External Transferable Knowledge and Learning schemas Clearly, the improvement in accuracy and performance is not merely because of the shift from feature engineering to structure engineering, but the flexible ways to incorporate external knowledge (Mikolov et al., 2013;Peters et al., 2018a;Devlin et al., 2018) and learning schemas to introduce extra instructive constraints (Paulus et al., 2017;Arumae and Liu, 2018). For this part, we make some first steps toward answers to the following questions: 1) Which type of pre-trained models (supervised or unsupervised pre-training) is more friendly to the summarization task? 2) When architectures are explored exhaustively, can we push the state-of-the-art results to a new level by introducing external transferable knowledge or changing another learning schema?
To make a comprehensive study of above an- alytical perspectives, we first build a testbed for summarization system, in which training and testing environment will be constructed. In the training environment, we design different summarization models to analyze how they influence the performance. Specifically, these models differ in the types of architectures (Encoders: CNN, LSTM, Transformer (Vaswani et al., 2017); Decoders: auto-regressive 3 , non auto-regressive), external transferable knowledge (GloVe (Pennington et al., 2014), BERT (Devlin et al., 2018), NEWSROOM (Grusky et al., 2018)) and different learning schemas (supervised learning and reinforcement learning).
To peer into the internal working mechanism of above testing cases, we provide sufficient evaluation scenarios in the testing environment. Concretely, we present a multi-domain test, sentence shuffling test, and analyze models by different metrics: repetition, sentence length, and position bias, which we additionally developed to provide a better understanding of the characteristics of different datasets.
Empirically, our main observations are summarized as: 1) Architecturally speaking, models with autoregressive decoder are prone to achieving better performance against non auto-regressive decoder. Besides, LSTM is more likely to suffer from the architecture overfitting problem while Transformer is more robust.
2) The success of extractive summarization system on the CNN/DailyMail corpus heavily relies on the ability to learn positional information of the sentence.
3) Unsupervised transferable knowledge is more useful than supervised transferable knowl-edge since the latter one is easily influenced by the domain shift problem. 4) We find an effective way to improve the current system, and achieving the state-of-the-art result on CNN/DailyMail by a large margin with the help of unsupervised transferable knowledge (42.39 R-1 score). And this result can be further enhanced by introducing reinforcement learning (42.69 R-1 score).
Hopefully, this detailed empirical study can provide more hints for the follow-up researchers to design better architectures and explore new stateof-the-art results along a right direction.

Related Work
The work is connected to the following threads of work of NLP research.

Task-oriented Neural Networks Interpreting
Without knowing the internal working mechanism of the neural network, it is easy for us to get into a hobble when the performance of a task has reached the bottleneck. More recently, Peters et al. (2018b) investigate how different learning frameworks influence the properties of learned contextualized representations. Different from this work, in this paper, we focus on dissecting the neural models for text summarization.
A similar work to us is Kedzie et al. (2018), which studies how deep learning models perform context selection in terms of several typical summarization architectures, and domains. Compared with this work, we make a more comprehensive study and give more different analytic aspects. For example, we additionally investigate how transferable knowledge influence extractive summarization and a more popular neural architecture, Transformer. Besides, we come to inconsistent conclusions when analyzing the auto-regressive decoder. More importantly, our paper also shows how existing systems can be improved, and we have achieved a state-of-the-art performance on CNN/DailyMail.
Extractive Summarization Most of recent work attempt to explore different neural components or their combinations to build an end-to-end learning model. Specifically, these work instantiate their encoder-decoder framework by choosing recurrent neural networks (Cheng and Lapata, 2016;Nallapati et al., 2017;Zhou et al., 2018) as encoder, auto-regressive decoder (Chen and Bansal, 2018;Jadhav and Rajan, 2018;Zhou et al., 2018) or non auto-regressive decoder (Isonuma et al., 2017;Narayan et al., 2018;Arumae and Liu, 2018) as decoder, based on pre-trained word representations (Mikolov et al., 2013;Pennington et al., 2014). However, how to use Transformer in extractive summarization is still a missing issue. In addition, some work uses reinforcement learning technique (Narayan et al., 2018;Wu and Hu, 2018;Chen and Bansal, 2018), which can provide more direct optimization goals. Although above work improves the performance of summarization system from different perspectives, yet a comprehensive study remains missing.

A Testbed for Text Summarization
To analyze neural summarization system, we propose to build a Training-Testing environment, in which different text cases (models) are firstly generated under different training settings, and they are further evaluated under different testing settings. Before the introduction of our Train-Testing testbed, we first give a description of text summarization.

Task Description
Existing methods of extractive summarization directly choose and output the salient sentences (or phrases) in the original document. Formally, given a document D = d 1 , · · · , d n consisting of n sentences, the objective is to extract a subset of sentences R = r 1 , · · · , r m from D, m is deterministic during training while is a hyper-parameter in testing phase. Additionally, each sentence con- Generally, most of existing extractive summarization systems can be abstracted into the following framework, consisting of three major modules: sentence encoder, document encoder and decoder. At first, a sentence encoder will be utilized to convert each sentence d i into a sentential representation d i . Then these sentence representations will be contextualized by a document encoder to s i . Finally, a decoder will extract a subset of sentences based on these contextualized sentence representations.

Setup for Training Environment
The objective of this step is to provide typical and diverse testing cases (models) in terms of model architectures, transferable knowledge and learning schemas.

Sentence Encoder
We instantiate our sentence encoder with CNN layer (Kim, 2014). We don't explore other options as sentence encoder since strong evidence of previous work (Kedzie et al., 2018) shows that the differences of existing sentence encoder don't matter too much for final performance.

Document Encoder
Given a sequence of sentential representation d 1 , · · · , d n , the duty of document encoder is to contextualize each sentence therefore obtaining the contextualized representations s 1 , · · · , s n . To achieve this goal, we investigate the LSTM-based structure and the Transformer structure, both of which have proven to be effective and achieved the state-of-the-art results in many other NLP tasks. Notably, to let the model make the best of its structural bias, stacking deep layers is allowed.
LSTM Layer Long short-term memory network (LSTM) was proposed by (Hochreiter and Schmidhuber, 1997) to specifically address this issue of learning long-term dependencies, which has proven to be effective in a wide range of NLP tasks, such as text classification (Liu et al., , 2016b, semantic matching (Rocktäschel et al., 2015;Liu et al., 2016a), text summarization (Rush et al., 2015) and machine translation (Sutskever et al., 2014).
Transformer Layer Transformer (Vaswani et al., 2017) is essentially a feed-forward selfattention architecture, which achieves pairwise interaction by attention mechanism. Recently, Transformer has achieved great success in many other NLP tasks (Vaswani et al., 2017;Dai et al., 2018), and it is appealing to know how this neural module performs on text summarization task.

Decoder
Decoder is used to extract a subset of sentences from the original document based on contextualized representations: s 1 , · · · , s n . Most existing architecture of decoders can divide into autoregressive and non auto-regressive versions, both of which are investigated in this paper.
Sequence Labeling (SeqLab) The models, which formulate extractive summarization task as a sequence labeling problem, are equipped with non auto-regressive decoder. Formally, given a document D consisting of n sentences d 1 , · · · , d n , the summaries are extracted by predicting a sequence of label y 1 , · · · , y n (y i ∈ {0, 1}) for the document, where y i = 1 represents the i-th sentence in the document should be included in the summaries.
Pointer Network (Pointer) As a representative of auto-regressive decoder, pointer network-based decoder has shown superior performance for extractive summarization (Chen and Bansal, 2018;Jadhav and Rajan, 2018). Pointer network selects the sentence by attention mechanism using glimpse operation (Vinyals et al., 2015). When it extracts a sentence, pointer network is aware of previous predictions.

External transferable knowledge
The success of neural network-based models on NLP tasks cannot only be attributed to the shift from feature engineering to structural engineering, but the flexible ways to incorporate external knowledge (Mikolov et al., 2013;Peters et al., 2018a;Devlin et al., 2018). The most common form of external transferable knowledge is the parameters pre-trained on other corpora.
To investigate how different pre-trained models influence the summarization system, we take the following pre-trained knowledge into consideration.
Unsupervised transferable knowledge Two typical unsupervised transferable knowledge are explored in this paper: context independent word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and contextualized word embeddings (Peters et al., 2018a;Devlin et al., 2018), have put the state-of-the-art results to new level on a large number of NLP taks recently.
Supervised pre-trained knowledge Besides unsupervised pre-trained knowledge, we also can utilize parameters of networks pre-trained on other summarization datasets. The value of this investigation is to know transferability between different dataset. To achieve this, we first pre-train our model on the NEWSROOM dataset (Grusky et al., 2018), which is one of the largest datasets and contains samples from different domains. Then, we fine-tune our model on target domains that we investigate.

Learning Schemas
Utilizing external knowledge provides a way to seek new state-of-the-art results from the perspective of introducing extra data. Additionally, an alternative way is resorting to change the learning schema of the model. In this paper, we also explore how different learning schemas influence extractive summarization system by comparing supervised learning and reinforcement learning.

Setup for Testing Environment
In the testing environment, we provide sufficient evaluation scenarios to get the internal working mechanism of testing models. Next, we will make a detailed deception.
ROUGE Following previous work in text summarization, we evaluate the performance of different architectures with the standard ROUGE-1, ROUGE-2 and ROUGE-L F 1 scores (Lin, 2004) by using pyrouge package 4 .

Cross-domain Evaluation
We present a multidomain evaluation, in which each testing model will be evaluated on multi-domain datasets based on CNN/DailyMail and NEWSROOM. Detail of the multi-domain datasets is descried in Tab. 2.
Repetition We design repetition score to test how different architectures behave diversely on avoiding generating unnecessary lengthy and repeated information. We use the percentage of repeated n-grams in extracted summary to measure the word-level repetition, which can be calculated as: where Count is used to count the number of ngrams and Uniq is used to eliminate n-gram duplication. The closer the word-based repetition score is to 1, the lower the repeatability of the words in summary.
Positional Bias It is meaningful to study whether the ground truth distribution of the datasets is different and how it affects different architectures. To achieve this we design a positional bias to describe the uniformity of ground truth distribution in different datasets, which can be calcu-lated as: We divide each article into k parts (we choose k = 30 because articles from CNN/DailyMail and NEWSROOM have 30 sentences by average) and p(i) denotes the probability that the first golden label is in part i of the articles.
Sentence Length Sentence length will affect different metrics to some extent. We count the average length of the k-th sentence extracted from different decoders to explore whether the decoder could perceive the length information of sentences.
Sentence Shuffling We attempt to explore the impact of sentence position information on different structures. Therefore, we shuffle the orders of sentences and observe the robustness of different architectures to out-of-order sentences.

Datasets
Instead of evaluating model solely on a single dataset, we care more about how our testing models perform on different types of data, which allows us to know if current models suffer from the over-engineering problem.   (Nallapati et al., 2016) is commonly used for summarization. The dataset consists of online news articles with paired human-generated summaries (3.75 sentences on average). For the data prepossessing, we use the data with nonanonymized version as (See et al., 2017), which doesn't replace named entities.
NEWSROOM Recently, NEWSROOM is constructed by (Grusky et al., 2018), which contains 1.3 million articles and summaries extracted from 38 major news publications across 20 years. We regard this diversity of sources as a diversity of summarization styles and select seven publications with the largest number of data as different domains to do the cross-domain evaluation. Due to the large scale data in NEWSROOM, we also choose this dataset to do transfer experiment.

Training Settings
For different learning schemas, we utilize cross entropy loss function and reinforcement learning method close to Chen and Bansal (2018) with a small difference: we use the precision of ROUGE-1 as a reward for every extracted sentence instead of the F 1 value of ROUGE-L. hird columns show the scope and methods of interactions for different words w i in a sentence.
For context-independent word representations (GloVe, Word2vec), we directly utilize them to initialize our words of each sentence, which can be fine-tuned during the training phase.
For BERT, we truncate the article to 512 tokens and feed it to a feature-based BERT (without gradient), concatenate the last four layers and get a 128-dimensional token embedding after passing through a MLP.

Experimental Observations and Analysis
Next, we will show our findings and analyses in terms of architectures and external transferable knowledge.

Analysis of Decoders
We understand the differences between decoder Pointer and SeqLab by probing their behaviours in different testing environments.
Domains From Tab. 3, we can observe that models with pointer-based decoder are prone to achieving better performance against SeqLabbased decoder. Specifically, among these eight datasets, models with pointer-based decoder outperform SeqLab on six domains and achieves comparable results on the other two domains. For example, in "NYTimes", "WashingtonPost"  Table 3: Results of different architectures over different domains, where Enc. and Dec. represent document encoder and decoder respectively. Lead means to extract the first k sentences as the summary, usually as a competitive lower bound. Oracle represents the ground truth extracted by the greedy algorithm (Nallapati et al., 2017), usually as the upper bound. The number k in parentheses denotes k sentences are extracted during testing and choose lead-k as a lower bound for this domain. All the experiments use word2vec to obtain word representations. and "TheGuardian" domains, Pointer surpasses SeqLab by at least 1.0 improvment (R-1). We attempt to explain this difference from the following three perspectives.
Repetition For domains that need to extract multiple sentences as the summary (first two domains in Tab. 3), Pointer is aware of the previous prediction which makes it to reduce the duplication of n-grams compared to SeqLab. As shown in Fig. 1(a), models with Pointer always get higher repetition scores than models with Se-qLab when extracting six sentences, which indicates that Pointer does capture word-level information from previous selected sentences and has positive effects on subsequent decisions.
Positional Bias For domains that only need to extract one sentence as the summary (last six domains in Tab. 3), Pointer still performs better than SeqLab. As shown in Fig. 1(b), the performance gap between these two decoders grows as the positional bias of different datasets increases. For example, from the Tab. 3, we can see in the domains with low-value positional bias, such as "FoxNews(1.8)", "NYDailyNews(1.9)", SeqLab achieves closed performance against Pointer.
Sentence length We find Pointer shows the ability to capture sentence length information based on previous predictions, while SeqLab doesn't. We can see from the Fig. 1(c) that models with Pointer tend to choose longer sentences as the first sentence and greatly reduce the length of the sentence in the subsequent extractions. In comparison, it seems that models with SeqLab tend to extract sentences with similar length. The ability allows Pointer to adaptively change the length of the extracted sentences, thereby achieving better performance regardless of whether one sentence or multiple sentences are required.

Analysis of Encoders
In this section, we make the analysis of two encoders LSTM and Transformer in different testing environments.
Domains From Tab. 3, we get the following observations: 1) Transformer can outperform LSTM on some datasets "NYDailyNews" by a relatively large margin while LSTM beats Transformer on some domains with closed improvements. Besides, during different training phases of these eight domains, the hyper-parameters of Transformer keep unchanged 5 while for LSTM, many sets of hyper-5 4 layers 512 dimensions for Pointer and 12 layers 512 dimensions for SeqLab parameters are used 6 . Above phenomena suggest that LSTM easily suffers from the architecture overfitting problem compared with Transformer. Additionally, in our experimental setting, Transformer is more efficient to train since it is two or three times faster than LSTM.
2) When equipped with SeqLab decoder, Transformer always obtains a better performance compared with LSTM, the reason we think is due to the non-local bias (Wang et al., 2018) of Transformer.
Shuffled Testing In this settings, we shuffle the orders of sentences in training set while test set keeps unchanged. We compare two models with different encoders (LSTM, Transformer) and the results can be seen in Fig. 2  models. However, Transformer obtains lower decrease against LSTM, suggesting that Transformer are more robust.
Disentangling Testing Transformer provides us an effective way to disentangle position and content information, which enables us to design a specific experiment, investigating what role positional information plays. As shown in Tab. 4, we dynamically regulate the ratio between sentence embedding and positional embedding by two coefficients α and β.
Surprisingly, we find even only utilizing positional embedding (the model is only told how many sentences the document contains), 7 In Vaswani et al. (2017), the input of Transformer is √ d * word embedding plus positional embedding, so we design the above different proportions to carry out the disentangling test.  our model can achieve 40.08 on R-1, which is comparable to many existing models. By contrast, once the positional information is removed, the performance dropped by a large margin. This experiment shows that the success of such extractive summarization heavily relies on the ability of learning the positional information on CNN/DailyMail, which has been a benchmark dataset for most of current work.

Analysis of Transferable Knowledge
Next, we show how different types of transferable knowledge influences our summarization models.
Unsupervised Pre-training Here, as a baseline, word2vec is used to obtain word representations solely based on the training set of CNN/DailyMail.
As shown in Tab. 5, we can find that contextindependent word representations can not contribute much to current models. However, when the models are equipped with BERT, we are excited to observe that the performances of all types of architectures are improved by a large margin. Specifically, the model CNN-LSTM-Pointer has achieved a new state-of-the-art with 42.11 on R-1, surpassing existing models dramatically.
Supervised Pre-training In most cases, our models can benefit from the pre-trained parameters learned from the NEWSROOM dataset. However, the model CNN-LSTM-Pointer fails and the performance are decreased. We understand this phenomenon by the following explanations: The transferring process from CNN/DailyMail to NEWSROOM suffers from the domain shift problem, in which the distribution of golden labels' positions are changed. And the observation from Fig.  2 shows that CNN-LSTM-Pointer is more sensitive to the ordering change, therefore obtaining a lower performance.  Why does BERT work? We investigate two different ways of using BERT to figure out from where BERT has brought improvement for extractive summarization system. In the first usage, we feed each individual sentence to BERT to obtain sentence representation, which does not contain contextualized information, and the model gets a high R-1 score of 41.7. However, when we feed the entire article to BERT to obtain token representations and get the sentence representation through mean pooling, model performance soared to 42.3 R-1 score.
The experiment indicates that though BERT can provide a powerful sentence embedding, the key factor for extractive summarization is contextualized information and this type of information bears the positional relationship between sentences, which has been proven to be critical to extractive summarization task as above.

Learning Schema and Complementarity
Besides supervised learning, in text summarization, reinforcement learning has been recently used to introduce more constraints. In this paper, we also explore if several advanced techniques be complementary with each other.
We first choose the based model LSTM-Pointer and LSTM-Pointer + BERT, then the reinforcement learning are introduced aiming to further optimize our models. As shown in Tab. 6, we observe that even though the performance of LSTM+PN has been largely improved by BERT, when applying reinforcement learning, the performance can be improved further, which indicates that there is indeed a complementarity between architecture, transferable knowledge and reinforcement learning.

Conclusion
In this paper, we seek to better understand how neural extractive summarization systems could benefit from different types of model architectures, transferable knowledge, and learning schemas. Our detailed observations can provide more hints for the follow-up researchers to design more powerful learning frameworks.