Self-Supervised Learning for Contextualized Extractive Summarization

Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at the document level. In this paper, we aim to improve this task by introducing three auxiliary pre-training tasks that learn to capture the document-level context in a self-supervised fashion. Experiments on the widely-used CNN/DM dataset validate the effectiveness of the proposed auxiliary tasks. Furthermore, we show that after pre-training, a clean model with simple building blocks is able to outperform previous state-of-the-art that are carefully designed.


Introduction
Extractive summarization aims at shortening the original article while retaining the key information through the way of selection sentences from the original articles. This paradigm has been proven effective by many previous systems (Carbonell and Goldstein, 1998;Mihalcea and Tarau, 2004;McDonald, 2007;Cao et al., 2015). In order to decide whether to choose a particular sentence, the system should have a global view of the document context, e.g., the subject and structure of the document. However, previous works (Nallapati et al., 2017;Al-Sabahi et al., 2018; usually directly build an end-to-end training system to learn to choose sentences without explicitly modeling the document context, counting on that the system can automatically learn the document-level context.
We argue that it is hard for these end-to-end systems to learn to leverage the document context from scratch due to the challenges of this task, and a well pre-trained embedding model that incorporates document context should help on this 1 Code can be found in this repository: https:// github.com/hongwang600/Summarization Last week, I went to attend a one-day meeting. I booked the flight in advanced.
[masked sentence] The earliest next flight will be a few days later. I had to use the online discussion instead.
But the flight was cancelled due to the weather. But I lost my passport.
The meeting was cancelled. The weather is good today.

Masked Paragraph
Candidate Sentences Figure 1: An example for the Mask pre-training task. A sentence is masked in the original paragraph, and the model is required to predicted the missing sentence from the candidate sentences.
task. In recent years, extensive works (Pennington et al., 2014;Nie and Bansal, 2017;Lin et al., 2017;Peters et al., 2018;Devlin et al., 2018;Subramanian et al., 2018;Cer et al., 2018;Logeswaran and Lee, 2018;Pagliardini et al., 2018) have been done in learning the word or sentence representations, but most of them only use a sentence or a few sentences when learning the representation, and the document context can hardly be included in the representation. Hence, we introduce new pre-training methods that take the whole document into consideration to learn the contextualized sentence representation with self-supervision.
Self-supervised learning (Raina et al., 2007;Doersch et al., 2015;Agrawal et al., 2015;Wang and Gupta, 2015) is a newly emerged paradigm, which aims to learn from the intrinsic structure of the raw data. The general framework is to construct training signals directly from the structured raw data, and use it to train the model. The structure information learned through the process can then be easily transformed and benefit other tasks. Thus self-supervised learning has been widely applied in structured data like text (Okanohara and Tsujii, 2007;Collobert and Weston, 2008;Peters et al., 2018;Devlin et al., 2018;Wu et al., 2019) and images (Doersch et al., 2015;Agrawal et al., 2015;Wang and Gupta, 2015;Lee et al., 2017).
Since documents are well organized and structured, it is intuitive to employ the power of selfsupervised learning to learn the intrinsic structure of the document and model the document-level context for the summarization task.
In this paper, we propose three self-supervised tasks (Mask, Replace and Switch), where the model is required to learn the document-level structure and context. The knowledge learned about the document during the pre-training process will be transferred and benefit on the summarization task. Particularly, The Mask task randomly masks some sentences and predicts the missing sentence from a candidate pool; The Replace task randomly replaces some sentences with sentences from other documents and predicts if a sentence is replaced. The Switch task switches some sentences within the same document and predicts if a sentence is switched. An illustrating example is shown in Figure 1, where the model is required to take into account the document context in order to predict the missing sentence. To verify the effectiveness of the proposed methods, we conduct experiments on the CNN/DM dataset (Hermann et al., 2015;Nallapati et al., 2016) based on a hierarchical model. We demonstrate that all of the three pre-training tasks perform better and converge faster than the basic model, one of which even outperforms the state-of-the-art extractive method NEUSUM .
The contributions of this work include: • To the best of our knowledge, we are the first to consider using the whole document to learn contextualized sentence representations with selfsupervision and without any human annotations.
• We introduce and experiment with various self-supervised approaches for extractive summarization, one of which achieves the new state-ofthe-art results with a basic hierarchical model.
• Benefiting from the self-supervised pretraining, the summarization model is more sample efficient and converges much faster than those trained from scratch.

Basic Model
As shown in Figure 2, our basic model for extractive summarization is mainly composed of two parts: a sentence encoder and a document-level self-attention module. The sentence encoder is a bidirectional LSTM (Hochreiter and Schmidhu- Linear Linear Figure 2: The structure of the Basic Model. We use LSTM and self-attention module to encode the sentence and document respectively. X i represent the word embedding for sentence i. S i and D i represent the independent and document involved sentence embedding for sentence i respectively. ber, 1997), which encodes each individual sentence X i (a sequence of words) and whose output vector at the last step is viewed as the sentence representation S i . Given the representations of all the sentences, a self-attention module (Vaswani et al., 2017) is employed to incorporate document-level context and learn the contextualized sentence representation D i for each sentence. 2 Finally, a linear layer is applied to predict whether to choose the sentence to form the summary.

Self-supervised Pre-training Methods
In this section, we will describe three selfsupervised pre-training approaches. Through solving each pre-training task, the model is expected to learn the document-level contextualized sentence embedding model from the raw documents, which will then be used to solve the downstream summarization task. Note that we are only pretraining the sentence encoder and documentlevel self-attention module of the basic model for extractive summarization.
Mask Similar to the task of predicting missing word, the Mask task is to predict the masked sentence from a candidate pool. Specifically, we first mask some sentences within a document with the probability P m and put these masked sentences (x m 1 , x m 2 , · · · , x m t ) into a candidate pool T m . The model is required to predict the correct sentence from the pool for each masked position i. We replace the sentence in the masked position i with a special token unk and compute its document contextualized sentence embedding D i . We use the same sentence encoder in the basic model to obtain the sentence embedding S m for these candidate sentences in T m . We score each candidate sentence j in T m by using the cosine similarity: To train the model, we adopt a ranking loss to maximize the margin between the gold sentence and other sentences: where γ is a tuned hyper-parameter, j points to the gold sentence in T m for the masked position i, and k points to another non-target sentence in T m .
Replace The Replace task is to randomly replace some sentences (with probability P r ) in the document with sentences from other documents, and then predict if a sentence is replaced. Particularly, we use sentences from 10, 000 randomly chosen documents to form a candidate pool T r . Each sentence in the document will be replaced with probability P r by a random sentence in T r . Let C r be the set of positions where sentences are replaced. We use a linear layer f r to predict if the sentence is replaced based on the document embedding D, and minimize the MSE loss: where y r i = 1 if i ∈ C r (i.e., the sentence in position i has been replaced), otherwise y r i = 0. Switch The Switch task is similar to the Replace task. Instead of filling these selected sentences with sentences out of the document, this task chooses to use sentences within the same document by switching these selected sentences, i.e., each selected sentence will be put in another position within the same document. Let C s be the set of positions where the sentences are switched. Similarly, we use a linear layer f s to predict if a sentence is switched and minimize the MSE loss: where y s i = 1 if i ∈ C s , otherwise y s i = 0.

Experiment
To show the effectiveness of the pre-training method (Mask, Replace and Switch), we conduct experiments on the commonly used dataset CNN/DM (Hermann et al., 2015;Nallapati et al., 2016), and compare them with a popular baseline Lead3 (See et al., 2017), which selects first three sentences as the summary, and the state-of-theart extractive summarization method NEUSUM , which jointly scores and selects sentences using pointer network.

On CNN/DM Dataset
Model and training details We use the rulebased system from  to label sentences in a document, e.g., sentences to be extracted will be labeled as 1. Rouge score 3 (Lin, 2004) is used to evaluate the performance of the model, and we report Rouge-1, Rouge-2, and Rouge-L as in prior work. We use the pretrained glove embedding (Pennington et al., 2014) with 100 dimensions to initialize the word embedding. A one-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997) is used as the sentence encoder, and the size of hidden state is 200. A 5layer Transformer encoder (Vaswani et al., 2017) with 4 heads is used as the document-level selfattention module. A linear classification layer is used to predict whether to choose the sentence. The training process consists of two phrases. First, we use the pre-training task to pre-train the basic model using the raw article from the  (Lin, 2004) scores for the basic model, baselines, pre-training methods, and analytic experiments. All of our Rouge scores have a 95% confidence interval of at most ±0.25 as reported by the official ROUGE script. The best result is marked in bold, and those that are not significantly worse than the best are marked with * .
CNN/DM dataset without labels. Second, we finetune the pre-trained model for the extractive summarization task using the sentence labels. The learning rate is set as 0.0001 in the pre-training phase and 0.00001 in the fine-tune phase. We train each pre-training task until it is converged or the number of training epochs reaches the upper bound 30. We set the probability to mask, replace or switch sentences as 0.25.

Results
We show the Rouge score on the development set during the training process in Figure 3, and present the best Rouge score for each method in Table 1. All pre-training methods improve the performance compared with the Basic model. Especially, Switch method achieves the best result on all the three evaluations compared with other pre-training methods, and is even better than the state-of-the-art extractive model NEUSUM 4 . In the terms of convergence, the Mask, Replace and Switch task takes 21, 24, 17 epochs in the training phase respectively, and 18, 13, 9 epochs to achieve the best performance in the fine-tune phase. The basic model takes 24 epochs to obtain the best result. From Figure 3, we can see that the Switch task converges much faster than the basic model. Even adding on the epochs taken in the pre-training phase, Switch method (26 epochs) takes roughly the same time as the Basic model (24 epochs) to achieve the best performance.

Ablation Study
Reuse only the sentence encoder Our basic model has mainly two components: a sentence encoder and a document-level self-attention module. The sentence encoder focuses on each sentence, while document-level self-attention module incorporates more document information. To investigate the role of the document-level self-attention module, we only reuse the sentence encoder of the pre-train model, and randomly initialize the document-level self-attention module. The results is shown in Table 1 as SentEnc. We can see that using the whole pre-training model (Switch 0.25) can achieve better performance, which indicates the model learn some useful document-level information from the pre-training task. We notice that only using the sentence encoder also get some improvement over the basic model, which means that the pre-training task may also help to learn the independent sentence representation.
On the sensitivity of hyper-parameter In this part, we investigate the sensitivity of the model to the important hyper-parameter P w , i.e., the probability to switch sentences. In the previous experiment, we switch sentences with probability 0.25. We further try the probability of 0.15 and 0.35, and show the results in Table 1 as Switch 0.15 and Switch 0.35. We can see Switch 0.15 achieve basically the same result as Switch 0.25, and Switch 0.35 is slightly worse. So the model is not so sensitive to the hyper-parameter of the probability to switch sentences, and probability between 0.15 and 0.25 should be able to work well.

Conclusion
In this paper, we propose three self-supervised tasks to force the model to learn about the document context, which will benefit the summarization task. Experiments on the CNN/DM verify that through the way of pre-training on our proposed tasks, the model can perform better and converge faster when learning on the summarization task. Especially, through the Switch pre-training task, the model even outperforms the state-of-theart method NEUSUM . Further analytic experiments show that the document context learned by the document-level self-attention module will benefit the model in summarization task, and the model is not so sensitive to the hyperparameter of the probability to switch sentences. Figure 4: The Rouge-1 and Rouge-L score for each pre-training method and the basic model on the development set during the training process.   (Lin, 2004) score for basic model, the pre-training methods, and the baselines. We use the script from https://github.com/magic282/ NeuSum to compute the Rouge score. All of our Rouge scores have a 95% confidence interval of at most ±0.22 as reported by the official ROUGE script. The best result for each score is marked in bold, and those that are not significantly worse than the best are marked with * .

A.1 Evaluation results using scripts from NEUSUM
A.2 Rouge-1 and Rouge-L results The Rouge-1 and Rouge-L results are shown in Figure 4, from which we can see that the Switch method achieves the best performance.