Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents

Neural-based summarization models suffer from the length limitation of text encoder. Long documents have to been truncated before they are sent to the model, which results in huge loss of summary-relevant contents. To address this issue, we propose the sliding selector network with dynamic memory for extractive summarization of long-form documents, which employs a sliding window to extract summary sentences segment by segment. Moreover, we adopt memory mechanism to preserve and update the history information dynamically, allowing the semantic flow across different windows. Experimental results on two large-scale datasets that consist of scientific papers demonstrate that our model substantially outperforms previous state-of-the-art models. Besides, we perform qualitative and quantitative investigations on how our model works and where the performance gain comes from.


Introduction
Text summarization is an important task of natural language processing which aims to distil salient contents from a textual document. Existing summarization models can be roughly classified into two categories, which are abstractive and extractive. Abstractive summarization usually adopts natural language generation technology to produce a wordby-word summary. In general, these approaches are flexible but may yield disfluent summaries (Liu and Lapata, 2019a). By comparison, extractive approaches aim to select a subset of the sentences in the source document, thereby enjoying better fluency and efficiency (Cao et al., 2017).
Although many summarization approaches have demonstrated their success on relatively short documents, such as news articles, they usually fail Paragraph 1: Medical tourism is illustrated as occurrence in which individuals travel abroad to receive healthcare services. It is a multibillion dollar industry and countries like India, Thailand, Israel, Singapore, … Paragraph 2: The prime driving factors in medical tourism are increased medical costs, increased insurance premiums, increasing number of uninsured or partially insured individuals in developed countries, … …… Paragraph 5: It is generally presumed in marketing that products with similar characteristics will be equally preferred by the consumers, however, attributes, which make the product similar to other products, will not…. to achieve desired performance when directly applied in long-form documents, such as scientific papers. This inferior performance is partly due to the truncation operation, which inevitably leads to information loss, especially for extractive models because parts of gold sentences would be inaccessible. In addition, the accurate modeling of long texts remains a challenge (Frermann and Klementiev, 2019).
A practical solution for this problem is to use a sliding window to process documents separately. This approach is used in other NLP tasks, such as machine reading comprehension (Wang et al., 2019b). However, such a paradigm is not suitable for summarization task because the concatenation of summaries that are independently extracted from local contexts is usually inconsistent with the gold summary of the entire document. Figure 1 shows an example to illustrate this problem. The core topic of the source document is "medical tourism," which is discussed in Paragraphs 1 and 2. How-ever, the 5-th paragraph is mainly about "consumer and product." As a consequence, the paragraphby-paragraph extraction approach might produce a both repetitive and noisy summary. Under this circumstance, the supervised signals will have a negative effect on model behaviors because understanding why Paragraph 5 should output an empty result without information conveying from previous texts is confused for the model.
In this paper, we propose a novel extractive summarization model for long-form documents. We split the input document into multiple windows and encode them with a sliding encoder sequentially. During this process, we introduce a memory to preserve salient information learned from previous windows, which is used to complete and enrich local texts. Intuitively, our model has the following advantages: 1) In each window, the text encoder processes a relatively short segment, thereby yielding more accurate representations. 2) The local text representations can capture beyond-window contextual information via the memory module. 3) The previous selection results are also parameterized in the memory block, allowing the collaboration among summary sentences.
To sum up, our contributions are threefold.
(1) We propose a novel extractive summarization model that can summarize documents of arbitrary length without truncation loss. Also, it employs the memory mechanism to address context fragmentation. To the best of our knowledge, we are the first to propose applying memory networks into extractive text summarization task.
(2) The proposed framework (i.e., a sliding encoder combined with dynamic memory) provides a general solution for summarizing long documents and can be easily extended to other abstractive and extractive summarization models.
(3) Our model achieves the state-of-the-art results on two widely used datasets for long document summarization. Moreover, we conduct extensive analysis to understand how our model works and where the performance gain comes from.

Related Work
Neural Extractive Summarization.
Recently, pre-trained language model (e.g. BERT (Devlin et al., 2018)) has provided substantial performance gain for extractive summarization. Liu and Lapata (2019b) modified standard BERT for document modelling. Xu et al. (2019) used a span-BERT to perform span-level summarization. Zhong et al. (2020) regarded document summarization as a semantic matching task and used a Siamese-BERT as the matching model. However, the valid length of standard BERT is only 512, which means most of them can hardly generalize to long-form documents effectively. Long Document Summarization. Recent years have seen a surge of interest on long document summarization, especially scientific publications. Celikyilmaz et al. (2018) used a multi-agent framework to boost the encoder performance. Cohan et al. (2018) proposed a hierarchical network that incorporates the discourse structures into the encoder and decoder. Xiao and Carenini (2019) proposed to model the local and global contexts jointly. Cui et al. (2020) proposed a hybrid model that employs a neural topic model (NTM) to infer latent topics as a kind of global information.
Despite their success, these approaches still face the input length limitation and the difficulty in encoding long texts accurately. In comparison, our model addresses these problems with a novel segment-wise extraction way and can summarize arbitrarily long documents without any content truncation. Memory Networks. Memory network (Weston et al., 2015) is a general framework that employs a memory bank to model long-term information. Due to its flexible architecture and superior adaptability, it has been applied into various NLP scenarios, such as text classification (Zeng et al., 2018), question answering (Kumar et al., 2016;Xiong et al., 2016), and sentiment analysis (Tang et al., 2016). In this study, we leverage a memory module capture beyond-window when performing segmentlevel summarization. To the best of our knowledge, memory networks have never been applied into extractive summarization task.

Model
This section describes our model, namely, the Sliding Selector Network with Dynamic Memory (SSN-DM), of which Figure 2 gives an overall architecture. Formally, given a document D of arbitrary length, we first split D into multiple segments according to the pre-defined window length. Then, we use a BERT encoder to sequentially encode each segment and select salient sentences. During this process, a memory module is applied to achieve the information flow across different windows. Finally, the extracted sentences are aggregated to generate the final summary. We elucidate each module in the following subsections.

Sliding Encoder
Let seg k = s k 1 , s k 2 , . . . , s k n be the kth window consisting of n sentences. We encode the window text with a pre-trained BERT, which has been proven effective on extractive summarization task (Liu and Lapata, 2019b;Xu et al., 2019;Cui et al., 2020). Following previous studies, we modify the standard BERT by inserting [CLS] and [SEP ] tokens into the beginning and end of each sentence, respectively.
where w k i,j denotes the jth word of the ith sentence. O B = {h k 1,CLS , h k 1,2 , . . . , h k n,SEP } denotes the representations of each token learned by BERT. We regard the hidden states of [CLS] tokens H k = {h k 1,CLS , h k 2,CLS , . . . , h k n,CLS } as the corresponding sentence representations.
On top of BERT encoder, we add an additional layer to incorporate two types of structural information. The first part is the position information of the current window. In our segment-wise encoding, the position embeddings equipped in BERT are recalculated in each window, thereby losing the exact position of each token in the entire document. This positional bias may lead to inferior performance (Zhong et al., 2019;Dai et al., 2019). To address this problem, we assign a window-level position encoding to each window as a complementary feature, indicating its relative position in the document.
In addition, we further introduce a group of section (e.g., introduction, conclusion) embeddings to capture the discourse information, which has been proved an important feature for scientific papers summarization (Cohan et al., 2018). Combining Figure 2: The framework of our model. There are three major components: (1) The sliding encoder generates representation of each sentence in the current window.

FFNN FFNN
(2) The memory layer infuses history information into sentence representations via graph neural networks. (3) The predication layer aggregates learned features to compute the binary sentence labels.
these two parts, the structural encoding layer can be denoted as: where e k w indicates the kth window-level position embedding, and e s the section embedding. Both of them are randomly initialized and learned as a part of the model. Throughout the paper, W * represents trainable parameter matrix.
Noticeably, the section features might not be generally available for long texts of other genres. Therefore, in our experiments, we consider e s as an optional setting and conduct quantitative investigations on Section 5 to probe into its effect on model performance.

Graph-based Memory Interaction
After encoding the window text, we infuse the history information of previous texts into the learned representations H k via a memory module. Let M k ∈ R l×dm be the memory block in the kth window that preserves salient information of previous k − 1 windows, where l represents the number of memory slots and d m represents the dimension of memory vector. M 0 is initialized with fixed values in the first window and then updated in the learning process dynamically. The detail of this part is explained in Section 3.4.
We use a graph neural network to model the interaction between memory module and the current window. Concretely, we first construct a bipartite GAT , , … , graph that consists of l memory nodes and n sentence nodes, whose embeddings are initialized with M k and H k , respectively. Then, we use graph attention network (GAT; Velickovic et al., 2018) to encode this graph. Given a sentence node h i , we update its representation by aggregating its neighboring nodes, as shown as follows, where α i,j denotes the attention weight from node h k i to node m k j . Multi-head attention is applied to stabilize the calculation process. Function SG(·) stands for stop-gradient operation.
We referH k andM k to the sentence representations and memory vectors after graph propagation, respectively. During the graph interaction, the sentence representations are completed and enriched by history information and vice versa.
Empirical observations of prior research (Tang et al., 2016;Zeng et al., 2018) have shown that stacking multiple memory layers can bring further performance gain. Similarly, in our model, the multi-hops setting can be achieved by increasing the graph iteration number, i.e., repeating the GAT calculation process (Eq. 3).

Prediction Layer
We have obtained the sentence representations H k derived from window text, and its extended versioñ H k enriched by memory information. Given ith sentence, we send h k i andh k i into a MLP classifier to compute its summary label.
whereỹ i represents the predicted probability of ith sentence, and • represents the point-wise operation. f o is a feed-forward network with three hidden layers. We construct interaction features betweeñ h k i and h k i to capture the importance of ith sentence in both current segment and history context.
The training objective of the model is to minimize the binary cross-entropy loss given the predictions and ground truth sentence labels, i.e., After processing the entire document, we rank all the sentences and select top-k as the final summary, where k is a hyperparameter set according to the average length of reference summaries. It worth noting that the memory module also acts as an intermediary to make the sentence scores of different windows comparable.

Dynamic Memory Updating
Now we explain the learning process of memory module. Figure 3 presents the information flow of our model. In each window, after the prediction layer, we update the memory values with two inputs.
First, recall that in GAT calculation, the updated memory vectorsM k has also encoded the contextual information of the current window during the interaction with H k . Therefore, we combinẽ M k and M k with gating mechanism (Chung et al., 2014).
where u k i is the liner interpolation between history memory m k i and the newly computed mi k . σ k i ∈ R dm is an gate vector to modulates the information flow.
The second part refers to the extraction result of the current window. We first aggregate the sentence representations with their predicted probabilities (Eq.4) to parameterize the selected sentences.
Here, r k sum can be considered a sentence-level coverage vector (See et al., 2017) that records what contents has been extracted from the current window. This ensures that the following selection is informed by previous decisions.
Then, we use a single feedforward layer to generate new memory M k+1 = {m k+1 1 , . . . , m k+1 l } for next window.  (Cohan et al., 2018) are two recently constructed datasets collected from arXiv.com and PubMed.com, respectively. Both of them consist of scientific papers, which are much longer than the common news articles. We preprocess and split datasets in accordance with (Cohan et al., 2018) and use the oracle labels created by (Xiao and Carenini, 2019). Their statistics is summarized in Table 1. Figure 4 shows the position distributions of ground-truth sentences of the two datasets, where we can see the importance of the long text processing ability for extractive summarization models. For example, the maximum length of standard BERT is 512, which means that a large proportion (colored in grey) of ground-truth sentences would be inaccessible for existing state-of-the-art BERTbased summarization models.

Models for Comparison
We compare our model with the following state-ofthe-art summarization approaches. Pointer Generator Network (PGN; See et al., 2017) extends the standard seq2seq framework with attention, coverage, and copy mechanism. Discourse-Aware (Cohan et al., 2018) is an abstractive model particularly designed for summarizing long-form document with discourse structure. It employs a hierarchical encoder and explicitly introduces the section information of scientific papers. Seq2seq-local&global (Xiao and Carenini, 2019) is also an extractive model for long document summarization that jointly encodes local and global contexts. Match-Sum (Zhong et al., 2020) is a state-of-theart BERT-based summarization model. It performs summary-level extraction based on the matching scores between candidate summary and the source document. Topic-GraphSum (Cui et al., 2020) introduces a joint neural topic model to explore latent topics as a kind of global information to help summarize long documents. Since Cui et al. (2020) used different data preprocessing, we repeat the experiments using the model released by the authors and preprocess the data in accordance with previous studies (Cohan et al., 2018;Xiao and Carenini, 2019) to make the results comparable.

Implementation Details
For the sliding encoder, we use the "bert-baseuncased" version with the hidden size of 768 and fine-tune it for all experiments. The maximum length of window is set to 512, and we segment the  documents with sentence as the smallest unit to alleviate semantic fragility. For the memory module, we set the number of slots to 50 and the dimension of the memory vector to 768, same with the hidden size of the encoder. The iteration number of GAT is set to 2. We use Rouge (Lin, 2004) as the evaluation metric and select the hyperparameters by grid search based on the "Rouge-2" performance on validation sets. Further analysis about the impacts of hyperparameters are discussed in Section 5.2. We train our model with 2 NVIDIA V100 cards with a small batch size of 16. During the training, we use Adam (Kingma and Ba, 2015) to optimize parameters with a learning rate of 5e-4. An earlystop strategy (Caruana et al., 2000) is applied when valid loss is no longer decent. The extracted sentence number is set to 7 for arXiv dataset and 6 for PubMed dataset according to their average summary length. We report the average results over 5 runs. Table 2 presents the results of different models on two datasets. The first section includes traditional approaches and the Oracle; the second and the third sections includes abstractive and extractive models, respectively; and the last section reports ours. Our model with discourse represents that we leverage section information as additional feature (Eq. 2). Several observations deserve to be mentioned.

Main Results
• Encoding long texts for abstractive summarization is a challenge. The vanilla seq2seq with attention model and the pointer network perform rather poorly on the two datasets. A possible reason is that most encoders experience difficulties in modeling long-range contextual dependency when encoding long texts (Vaswani et al., 2017;Frermann and Klementiev, 2019), thereby leading to the inferior performance during the generation (decoding) process.
• Global Information Modeling is important for summarizing long documents. We also observe that Seq2seq-local&global and Topic- GraphSum show promising results on the two datasets. Both of them explicitly model the global information (e.g., latent topics). Such observation provides a useful instruction for designing the summarization model for long documents.
• Our framework is effective. Our two models substantially outperform all the baselines on two datasets. Figure 5 shows the proportion of sentences selected by each window, where we can see that our model can extract contents from any position of an entire document. By contrast, BERT-Sum and Topic-GraphSum, two BERTbased strong baselines, can only select sentences from the first 512 or 768 words because their truncation setting. This superiority endows our model a higher upper bound when summarizing long documents.
• Discourse structure is automatically captured. The last section of Table 3 shows that the incorporation of discourse information brings no substantial performance gain for our model, though observations in previous studies (Cohan et al., 2018;Xiao and Carenini, 2019) have shown it an effective feature on arXiv and PubMed datasets. A possible reason is that our window-level position encoding has already learned such discourse information because it indicates the window's relative position in the document, while scientific papers are generally organized in specific and relatively fixed structure. This observation implies that the performance of our model does not rely on prior information of datasets. As a result, our model could be easily generalized to long texts of other genres.

Results on Varying Hyperparameters
We conduct experiments to probe into the impact of several important hyperparameters on model performance, including window length, number of memory slots, and number of memory hops (i.e., iteration number of GAT).  Table 3: R-1 results on varying iteration numbers t of GAT.
Impact of Window Length. Intuitively, a shorter window means more accurate text encoding. However, it will result in more segments, which is demanding for memory module. Therefore, it is important to find a balanced window length. Figure 6 (left) shows that the overall performance is enhanced when the window length increases from a small value (128). This is because that too short windows suffer from semantic fragility. However, when the window length is set to 368-512, the performance shows a stable trend, implying that the step number and text length are both in a suitable range. For the sake of efficiency, we set the window length to 512 in our experiments.
Impact of Slots Numbers. Figure 6 (right) presents the Rouge-1 results on varying slot numbers. As can be seen, the curves on the two datasets are not monotonous and show a similar trend. In particular, within a particular range where l is relatively small, more slots produce better performance because the memory capacity is improving. However, such increasing trend will reach a saturation when slot number exceeds a threshold, which is 60 in our experiments.
Impact of Iteration Numbers. Recall that in memory layer, we employ a GAT to calculate the interaction between the memory and the window texts. To select the best iteration number (hop number) t, we compare the performance of different t on the validation sets of two datasets. Table 3 shows when t goes from 0 to 2, the performance is slightly boosted. However, this increasing trend is not always monotonous, and a larger t does not bring further substantial gain. To balance the time cost and performance, we select t=2 for the two datasets.

Effect of Dynamic Memory
In this subsection, we perform quantitative and qualitative investigations to understand the effect

1-th window 4-th window 5-th window
Our full model Social isolation and exclusion are associated with poor health status and premature death, while social cohesion, the quality of social relationships and the existence of trust, mutual obligations, and respect in communities, helps to protect people and their health. ..…… Good nutrition is important for health and well-being at all stages of the life course; however, its determinants change with age. …… Their results suggest that participation in social and cultural activities is beneficial for health, since it helps people to remain active and socially connected, fighting social isolation…… …… We decided to take a snapshot of the metropolitan area of the city of @entity investigating the relationship between adherence to diet or nutritional regimen, BMI, and subjective well-being and the impact of social and cultural participation. Engagement with community activities, friendships, and meaningful volunteer work are perceived as strategies for maintaining social participation, especially for people with a chronic disease. Thus, encouraging participation in social and cultural activities could be a key tool to fight social isolation and its health detrimental outcomes.…… …… Av a i l a b i l i t y and access to cultural and social activities are a key element of healthy environment, especially of urban environment. Social isolation can have a negative effect on nutrition, and thus we speculated that social and cultural participation might influence adherence to diet. …… Subjective well-being significantly correlates with high self-esteem, and self-esteem shares significant variance in both mental well-being and happiness. Selfesteem has been found to be the most dominant and powerful… …

Ablated model (w/o memory)
Social isolation and exclusion are associated with poor health status and premature death, while social cohesion, the quality of social relationships and the existence of trust, mutual obligations, and respect in communities, helps to protect people and their health.…… Good nutrition is important for health and well-being at all stages of the life course; however, its determinants change with age. …… Their results suggest that participation in social and cultural activities is beneficial for health, since it helps people to remain active and socially connected, fighting social isolation. …… …… We decided to take a snapshot of the metropolitan area of the city of @entity investigating the relationship between adherence to diet or nutritional regimen, BMI, and subjective well-being and the impact of social and cultural participation.
Engagement with community activities, friendships, and meaningful volunteer work are perceived as strategies for maintaining social participation, especially for people with a chronic disease. Thus, encouraging participation in social and cultural activities could be a key tool to fight social isolation and its health detrimental outcomes. …… …… Av a i l a b i l i t y and access to cultural and social activities are a key element of healthy environment, especially of urban environment. Social isolation can have a negative effect on nutrition, and thus we speculated that social and cultural participation might influence adherence to diet. …… Subjective well-being significantly correlates with high self-esteem, and self-esteem shares significant variance in both mental well-being and happiness. Selfesteem has been found to be the most dominant and powerful… … Figure 7: Comparison between the output of our full model (top) and the ablated model (bottom). We use underlined text to denote model-selected sentences and bold text to denote the ground truth sentences. The ablated model selects repetitive contents in 4-th window and noisy contents in 5-th window. of memory module. To this end, we construct an ablated version by removing the memory module and then seek to observe the result difference.
Case Study. Figure 7 provides a case study that compares the selection results of the ablated model and our full model. In 4-th window, the ablated model selects a repetitive sentence, whereas our full model avoids such error. This positive effect is brought by the extraction results preserved in memory module, which serve as a reminder of what information has already been selected. We also note that the ablated model selects wrong sentences in 5-th window. This is because that the model mistakes the "self-esteem" as the salient information. By contrast, our model, being aware of previous texts, correctly captures the "social isolation" as the core topic and filters the noisy sentences.
Quantitative analysis. In Figure 8, we compare the Rouge scores between our full model and the ablated one. As can be seen, the performance declines dramatically on both datasets when the memory module is removed. This proves that the dynamic memory indeed plays a necessary role in our model.
We further analyze the effecf of memory module in better granularity. Intuitively, the memory module should enhance our model in the following aspects: (1) Reducing Redundancy. Our mem- ory module explicitly records the previous predictions and functions like a sentence-level coverage mechanism, which is expected to reduce repetition.
(2) Avoiding Noise. As discussed in Section 1, segment-wise extraction tend to mistake locally important content as summary sentences due to the lack of global context. Our memory module allows the cross-window information flow and therefore should alleviate this problem.
(3) Perceiving Sentence Length. The awareness of previous selections may also allow the model to capture sentence length information (Zhong et al., 2019). Ideally, our model is able to adaptively change the change the length of extracted sentence, thereby achieving better performance.
To verify our hypothesis, we design three measurements to quantitatively evaluate the model per-  Table 4: Comparison between our full model and the ablated version. S Rep , S N oise and S Len are the metrics of repetition, noise, and length deviation. Lower is better.
formance on above aspects. Similar to (Zhong et al., 2019), we use S Rep = 1 − CountU niq(ngram) Count(ngram) to measure the degree of repetition, where Count(ngram) and CountU niq(ngram) are the total and unique number of ngrams of selected sentences. For the noise measurement, we have S noise = Count(N oisySent) Count(ExtractSent) , where N oisySent are the sentences with "R-1" smaller than a threshold. For the length deviation, we have S Len = (|sum|−|ref |) |ref | , where |sum| and |ref | denote the length of model-produced summary and reference summary, respectively. Table 4 presents the comparison results. The model achieves better performance in three indicators when combined with memory mechanism, consistent with aforementioned analysis.

Conclusion and Future Work
In this study, we propose a novel extractive summarization that can summarize long-form documents without content loss. We conduct extensive experiments on two well-studied datasets that consist of scientific papers. Experimental results demonstrate that our model outperforms previous stateof-the-art models. In the future, we will extend our framework (i.e., a sliding encoder combined with long-range memory modeling) to abstractive summarization models.