Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers

Unsupervised extractive document summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for unsupervised extractive summarization. Specifically, we first pre-train a hierarchical transformer model using unlabeled documents only. Then we propose a method to rank sentences using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves state-of-the-art performance on unsupervised summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results.

1 Introduction Document summarization is the task of transforming a long document into its shorter version while still retaining its important content. Researchers have explored many paradigms for summarization, while the most popular ones are extractive summarization and abstractive summarization (Nenkova and McKeown, 2011). As their names suggest, extractive summarization generates summiries by extracting text from original documents, and abstractive summarization rewrites documents by paraphrasing or deleting some words or phrases.
Most summarization models require labeled data, where documents are paired with human written summaries. Unfortunately, human labeling for summarization task is expensive and therefore high quality large scale labeled summarization datasets are rear (Hermann et al., 2015) compared to growing web documents created everyday. It is also not possible to create summaries for documents in all text domains and styles. In this paper, we focus on unsupervised summarization, where we only need unlabeled documents during training.
Many attempts for unsupervised summarization are extractive (Carbonell and Goldstein, 1998;Radev et al., 2000;Lin and Hovy, 2002;Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Wan, 2008;Wan and Yang, 2008;Hirao et al., 2013;Parveen et al., 2015). The core problem is to identify salient sentences in a document. The most popular approaches among these work rank sentences in the document using graph based algorithms, where each node is a sentence and weights of edges are measured by sentence similarities. Then a graph ranking method is employed to estimate sentence importance. For example, Tex-tRank (Mihalcea and Tarau, 2004) utilizes word co-occurrence statistics to compute similarity and then employs PageRank (Page et al., 1997) to rank sentences. Sentence similarities in (Zheng and Lapata, 2019) are measured with BERT (Devlin et al., 2019) and sentences are sorted w.r.t. their centralities in a directed graph.
Recently, there has been increasing interest in developing unsupervised abstractive summarization models (Wang and Lee, 2018;Fevry and Phang, 2018;Chu and Liu, 2019;Yang et al., 2020). These models are mostly based on sequence to sequence learning (Sutskever et al., 2014) and sequential denoising auto-encoding (Dai and Le, 2015). Unfortunately, there is no guarantee that summaries produced by these models are grammatical and consistent with facts described original documents. Zhang et al. (2019) propose an unsupervised method to pre-train a hierarchical transformer model (i.e., HIBERT) for document modeling. The hierarchical transformer has a token-level transformer to learn sentence representations and a sentence-level transformer to learn interactions between sentences with self-attention. In Zhang et al. (2019), HIBERT is applied to supervised extractive summarization. However, we believe that after pre-training HIBERT on large scale unlabeled data, the self-attention scores in the sentence-level transformer becomes meaningful for estimating the importance of sentences. Intuitively, if many sentences in a document attend to one particular sentence with high attention scores, then this sentence should be important. In this paper, we find that (sentence-level) transformer attentions (in a hierarchical transformer) can be used to rank sentences for unsupervised extractive summarization, while previous work mostly leverage graph based (or rule based) methods and sentence similarities computed with off-the-shelf sentence embeddings. Specifically, we first introduce two pretraining tasks for hierarchical transformers (i.e., extended HIBERT) to obtain sentence-level selfattentions using unlabled documents only. Then, we design a method to rank sentences by using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves stateof-the-art performance on unsupervised summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results. Our code and models are available at https://github.com/xssstory/STAS.

Related Work
In this section, we introduce work on supervised summarization, unsupervised summarization and pre-training.
Supervised Summarization Most summarization models require supervision from labeled datasets, where documents are paired with human written summaries. As mentioned earlier, extractive summarization aims to extract important sentences from documents and it is usually viewed as a (sentence) ranking problem by using scores from classifiers (Kupiec et al., 1995) or sequential labeling models (Conroy and O'leary, 2001). Summarization performance of this class of methods are greatly improved, when human engineered features Nenkova et al., 2006;Filatova and Hatzivassiloglou, 2004) are replaced with convolutional neural networks (CNN) and long shortterm memory networks (LSTM) (Cheng and Lapata, 2016;Nallapati et al., 2017;Narayan et al., 2018;Zhang et al., 2018).
Abstractive summarization on the other hand can generate new words or phrases and are mostly based on sequence to sequence (seq2seq) learning (Bahdanau et al., 2015). To better fit in the summarization task, the original seq2seq model is extended with copy mechanism (Gu et al., 2016), coverage model (See et al., 2017), reinforcement learning (Paulus et al., 2018) as well as bottom-up attention (Gehrmann et al., 2018). Recently, pretrained transformers (Vaswani et al., 2017) achieve tremendous success in many NLP tasks (Devlin et al., 2019;. Pre-trained methods customized for both extractive (Zhang et al., 2019;Liu and Lapata, 2019) and abstractive (Dong et al., 2019; summarization again advance the state-of-the-art in supervised summarization. Our model also leverages pre-trained methods and models, but it is unsupervised. Unsupervised Summarization Compared to supervised models, unsupervised models only need unlabeled documents during training. Most unsupervised extractive models are graph based (Carbonell and Goldstein, 1998;Radev et al., 2000;Lin and Hovy, 2002;Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Wan, 2008;Wan and Yang, 2008;Hirao et al., 2013;Parveen et al., 2015). For example, TextRank (Mihalcea and Tarau, 2004) treats sentences in a document as nodes in an undirected graph, and edge weights are measured with cooccurrence based similarities between sentences. Then PageRank (Page et al., 1999) is employed to determine the final ranking scores for sentences. Zheng and Lapata (2019) builds directed graph by utilizing BERT (Devlin et al., 2019) to compute sentence similarities. The importance score of a sentence is the weighted sum of all its out edges, where weights for edges between the current sentence and preceding sentences are negative. Thus, leading sentences tend to obtain high scores. Unlike Zheng and Lapata (2019), sentence positions are not explicitly modeled in our model and therefore our model is less dependent on sentence positions (as shown in experiments).
There are also an interesting line of work on unsupervised abstractive summarization. Yang et al. (2020) pre-trains a seq2seq Transformer by predicting the first three sentences of news documents and then further tunes the model with semantic classification and denoising auto-encoding objectives. The model described in Wang and Lee (2018) utilizes seq2seq auto-encoding coupled with adversarial training and reinforcement learning. Fevry and Phang (2018) and Baziotis et al. (2019) focus on sentence summarization (i.e., compression). Chu and Liu (2019) proposes yet another denoising auto-encoding based model in multi-document summarization domain. However, the performance of these unsupervised models are still unsatisfactory compared to their extractive counterparts.
Pre-training Pre-training methods in NLP learn to encode text by leveraging unlabeled text. Early work mostly concentrate on pre-training word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017). Later, sentence encoder can also be pre-trained with language model (or masked language model) objectives (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;. Zhang et al. (2019) propose a method to pre-train a hierarchical transformer encoder (document encoder) by predicting masked sentences in a document for supervised summarization, while we focus on unsupervised summarization. In our method, we also propose a new task (sentence shuffling) for pre-training hierarchical transformer encoders. Iter et al. (2020) propose a contrastive pre-training objective to predict relative distances of surrounding sentences to the anchor sentence, while our sentence shuffling task predicts original positions of sentences from a shuffled docuemt. Besides, pre-training methods mentioned above focus on learning good word, sentence or document representations for downstream tasks, while our method focuses on learning sentence level attention distributions (i.e., sentence associations), which is shown in our experiments to be very helpful for unsupervised summarization.

Model
In this section, we describe our unsupervised summarization model STAS (as shorthand for Sentencelevel Transformer based Attentive Summarization). We first introduce how documents are encoded in our model. Then we present methods to pre-trained our document encoder. Finally we apply the pretrained encoder to unsupervised summarization.

Document Modeling
j is a token in S i . As a common wisdom, we also add two special tokens (i.e., w i 0 = <s> and w i |S i | = </s>) to S i , which represents the begin and end of a sentence, respectively. Transformer models (Vaswani et al., 2017), which are composed of multiple self-attentive layers and skip connections (He et al., 2016), have shown tremendous success in text encoding (Devlin et al., 2019). Due to the hierarchical nature of documents, we encode the document D using a hierarchical Transformer encoder, which contains a token-level Transformer T rans T and a sentence-level Transformer T rans S as shown in Figure 1. Let || denote an operator for sequences concatenation. T rans T views D as a flat sequence of tokens denoted as D = (S 1 || S 2 || . . . || S |D| ). After we apply T rans T to D, we obtain contextual representations for all tokens . We use the representation at each <s> token as the representation for that sentence and therefore representations for all sentences in The sentence-level Transformer T rans S takes V as input and learns sentence representations given other sentences in D as context: where H = (h 1 , h 2 , . . . , h |D| ) and h i is the final representation of S i ; A is the self-attention matrix and A i,j is the attention score from sentence S i to sentence S j . T rans S contains multiple layers and each layer contains multiple attention heads.
To obtain A, we first average the attention scores across different heads and then across different layers. Our hierarchical document encoder is similar to the hierarchical Transformer model described in Zhang et al. (2019). The main difference is that our token-level Transformer encodes all sentences in a document as a whole rather than separately.

Pre-training
In this section, we pre-train the hierarchical document encoder introduced in Section 3.1 using unlabeled documents only. We expect that after pre-training, the encoder would obtain the ability of modeling interactions (i.e., attentions) among sentences in a document. In this following, we introduce two tasks we used to pre-train the encoder.

Masked Sentences Prediction
The first task is Masked Sentences Prediction (MSP) described in Zhang et al. (2019). We randomly mask 15% of sentences in a document and then predict the original sentences. Let D = (S 1 , S 2 , . . . , S |D| ) denote a document and D = ( S 1 , . . . , S |D| ) the document with some sentences masked, where token, in 10% of cases replaces S i with a random sentence and in the remaining 10% of cases keep S i unchanged. The masking strategy is similar to that of BERT (Devlin et al., 2019), but it is applied on sentence level. Let I = {i| S i = mask(S i )} denote the set of indices for masked sentences and O = {S i |i ∈ I} the original sentences corresponding to masked sentences. Supposing i ∈ I , we demonstrate how we predict the original sentence S i = (w i 0 , w i 1 , . . . , w i |S i | ) given D. As shown in Figure 2, we first encode D using the encoder in Section 3.1 and obtain H = ( h 1 , h 2 , . . . , h |D| ). Then we use h i (i.e., the contextual representation of S i ) to predict S i one token at a time with a conditional Transformer decoder T ransDec M . We inject the information of S i to T ransDec M by adding h i after the self attention sub-layer of each Transformer block in Figure 2: An example of masked sentences prediction. The third sentence in the document is masked and the hierarchical encoder encodes the masked document. We then use T ransDec S to predict the original sentence one token at a time.
T ransDec M . Assuming w i 0:j−1 has been generated, the probability of w i j given w i 0:j−1 and D iŝ The probability of all original sentences given D is MSP is proposed in HIBERT (Zhang et al., 2019) for supervised summarization, while we use MSP and transformer attention for sentence ranking in unsupervised summarization (Section 3.3). Note that the goal and the way of using MSP in this work is different from these in HIBERT.
Sentence Shuffling We propose a new task that shuffles the sentences in a document and then select sentences in the original order one by one. We expect that the hierarchical document encoder can learn to select sentences based on their contents rather than positions.
Recall that D = (S 1 , S 2 , . . . , S |D| ) is a document, we shuffle the sentences in D and obtain a permuted document D = (S 1 , S 2 , . . . , S |D| ) where S i is the ith sentence in the original document and there exists a sentence S P i = S i in the permuted document D (i.e., P i ∈ [1, |D|] is the position of S i in D ). In this task, we predict P = (P 1 , P 2 , . . . , P |D| ).
As shown in Figure 3, we first use the document encoder in Section 3.1 to encode D and yields its context dependent sentence representations H = (h 1 , h 2 , . . . , h |D| ). Supposing that P 0 , P 1 , P 2 , . . . , P t−1 are known 1 , we predict P t using a Pointer Network (Vinyals et al., 2015) with Transformer as its decoder. Let T ransDec P denote the transformer decoder in PointerNet, E P i is the absolute positional embedding of P i in original document and p i the positional embedding of P i during decoding. The input of T ransDec P is the sum of sentence representations and positional embeddings: The output h o t summaries the sentences depermutated so far.
Then the probability of selecting S pt is estimated with the attention (Bahdanau et al., 2015) between h o t and all sentences in D as follows: where g is a feed forward neural network with the following parametrization: where v a ∈ R d×1 , U a ∈ R d×d , W a ∈ R d×d are trainable parameters. Finally the probability of positions of original sentences in the shuffled document is: During training, for each batch of documents we apply both the masked sentence prediction and sentence shuffling tasks. One document D generates a masked document D and a shuffled document D . Note that 15% of sentences are masked in the masked document D, and all sentences are shuffled in the shuffled document D . The whole model is optimized with the following objective: where D is a document in the training document set X .

Unsupervised Summarization
In this section, we propose our unsupervised extractive summarization method. Extractive summarization aims to select the most important sentences in document. Once we have obtained a hierarchical encoder using the pre-training methods in Section 3.2, we are ready to rank sentences and no additional fine-tuning is needed in this step.
Our first ranking criteria is based on the probabilities of sentences in a document. Recall that D = (S 1 , S 2 , . . . , S |D| ) is a document and its probability is It is not straight forward to estimate p(S i |S 1:i−1 ) directly since document models in this work are all bidirectional. However, we can estimate p(S i |D ¬S i ) using the masked sentences prediction task in Section 3.2. We therefore use p(S i |D ¬S i ) to approximate p(S i |S 1:i−1 ). Finding the most important sentence is equivalent to finding the sentence with highest probability (i.e., p(S i |D ¬S i )). In the following we demonstrate how to estimate p(S i |D ¬S i ). As in Section 3.2, we create D ¬S i by masking S i in D (i.e., replacing all tokens in S i with [MASK] tokens). p(S i |D ¬S i ) can be estimated using Equation (5). To make the probabilities of different sentences comparable, we normalize them by their length. Then we obtainr i as follows 2 (also see Equation (5)) We also normalizer i across sentences (in a document) and obtain our first ranking criteria r i : In the second ranking criteria, we model the contributions of other sentences to the current sentence explicitly. We view a document D as a directed graph, where each sentence in it is a node. The connections between sentences (i.e., edge weights) can be modeled using the self-attention matrix A of the sentence level Transformer encoder described in Section 3.1, which is produced by a pre-trained hierarchical document encoder. We assume that a sentence S j can transmit its importance score r i to an arbitrary sentence S i through the edge between them. Let A j,i denote the attention score from S j to S i . After receiving all transmissions from all sentences, the second ranking score for S i is as follows: The final ranking score of S i combines the score from itself as well as other sentences: γ 1 and γ 2 are coefficients tuned on development set. r i can be computed iteratively by assigning r i to r i and repeating Equation (13) and Equation (14) for T iterations. We find a small T (T ≤ 3) works well according to the development set.

Experiments
In this section we assess the performance of STAS on the document summarization task. We firstly introduce datasets we used and then give our implementation details. Finally we compare our method against previous methods.

Datasets
We  (2019), we adopted the splits widely used in abstractive summarization (Paulus et al., 2018) for the NYT dataset, which ranks articles by their publication date and used the first 589,284 for training, the next 32,736 for validation and the remaining 32,739 for test. Then, we filter out documents whose summaries are shorter than 50 words as in (Zheng and Lapata, 2019) and finally retain 36,745 for training, 5,531 for validation and 4,375 for test. We segment sentences using the Stanford CoreNLP toolkit . Sentences are then tokenized with the UTF-8 based BPE tokenizer used in RoBERTa and GPT-2 (Radford et al., 2019) and the resulting vocabulary contains 50,265 subwords. During training, we only leverage articles in CNN/DM or NYT; while we do use both articles and summaries in validation sets to tune hyper-parameters of our models.
We evaluated the quality of summaries from different models using ROUGE (Lin, 2004). We report the full length F1 based ROUGE-1, ROUGE-2, ROUGE-L on both CNN/DM and NYT datasets. These ROUGE scores are computed using the ROUGE-1.5.5.pl script 4 .

Implementation Details
The main building blocks of STAS are Transformers (Vaswani et al., 2017). In the following, we describe the sizes of them using the number of layers L, the number of attention heads A, and the hidden size N . As in (Vaswani et al., 2017;Devlin et al., 2019), the hidden size of the feed-forward sublayer is always 4H. STAS contains one hierarchical encoder (see Section 3.1) and two decoders, where they are used for the masked sentences prediction and sentence shuffling pre-training tasks (see Section 3.2). The token-level encoder is initialized with the parameters of RoBERTa BASE   5 and we set L = 12, H = 768, A = 12. The sentence-level encoder and the two decoders are shallower and we all adopt the setting L = 6, H = 768, A = 12.
We trained our models with 4 Nvidia Tesla V100 GPUs and optimized them using Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999. Since the encoder is partly pre-trained (initialized with RoBERTa) and the decoders are initialized randomly, we set a lager learning rate for decoders. Specifically, we used 4e-5 for the encoder and 4e-4 for the decoders. Since CNN/DM is larger than NYT, we employed a batch size of 512 for CNN/DM and 64 for NYT (to ensure a sufficient number of model updates) 6 . Limited by the positional embedding of RoBERTa, all documents are truncated to 512 subword tokens. We trained our models on both CNN/DM and NYT for 100 epochs. It takes around one hour training on the CNN/DM and 30 minutes on the NYT for each epoch. The best checkpoint is at around epoch 85 on CNN/DM and epoch 65 on NYT according to validation sets.
When extracting the summary for a new document during test time, we rank all sentences using Equation (14) and select the top-3 sentences as the summary. When selecting sentences on the CNN/DM dataset, we find that trigram blocking (i.e., removing sentences with repeating trigrams to existing summary sentences) (Paulus et al., 2018) can reduce the redundancy, while trigram blocking does not help on NYT.

Results
Our main results are shown in Table 1. The first block includes several recent supervised models for document summarization. REFRESH (Narayan et al., 2018) is an extractive model, which is trained by globally optimizing the ROUGE metric with reinforcement learning. PTR-GEN (See et al., 2017) is a sequence to sequence based abstractive model with copy and coverage mechanism. Liu and Lapata (2019) initialize encoders of extractive model (BertSumExt) and abstractive model (BertSumAbs) with pre-trained BERT.
We present the results of previous unsupervised methods in the second block. LEAD-3 simply selects the first three sentences as the summary for each document. TEXTRANK (Mihalcea and Tarau, 2004) views a document as a graph with sentences as nodes and edge weights using the sentence similarities. It selects top sentences as summary w.r.t. PageRank (Page et al., 1999) scores. PACSUM (Zheng and Lapata, 2019) is yet another graphbased extractive model using BERT as sentence features. Sentences are ranked using centralities (sum of all out edge weights). They made the ranking criterion positional sensitive by forcing negative edge weights for edges between the current sentence and its preceding sentences. Adv-RF (Wang and Lee, 2018) and TED (Yang et al., 2020) are all based on unsupervised seq2seq auto-encoding with additional objectives of adversarial training, reinforcement learning and seq2seq pre-training to predict leading sentences.
PACSUM is based on the BERT (Devlin et al., 2019) initialization. RoBERTa , which extends BERT with better training strategies and more training data, outperforms BERT on many tasks. We therefore re-implemented PACSUM and extended it with both BERT and RoBERTa initialization (i.e., PACSUM (BERT) and PACSUM (RoBERTa)) 7 . On CNN/DM, our re-implementation PACSUM (BERT) is comparable with Zheng and Lapata (2019). The results of PACSUM (BERT) and the RoBERTa initialized PACSUM (RoBERTa) are almost the same. Perhaps because it relies more on position information rather than sentence similarities computed by BERT or RoBERTa. STAS outperforms all unsupervised models in comparison on CNN/DM and the difference between STAS and all other unsupervised models are significant with a 0.95 confidence interval according to the ROUGE script. In the following, all significant tests on ROUGE are measured with a 0.95 confidence interval using the ROUGE script. Since STAS does not model sentence positions explicitly during ranking, while PACSUM does, we linearly combine the ranking scores of STAS and PACSUM (i.e., STAS + PAC-SUM) 8 . The combination further improves the performance.
On NYT, the trend is similar. STAS is slightly better than PACSUM although not significantly better (STAS is siginificantly better than all the other unsupervised models in comparison). Interestingly, there are also no significant differences between STAS and two supervised models (REFRESH and PTR-GEN). STAS + PACSUM even significantly outperforms the supervised REFRESH. The significant tests above all utilize the ROUGE script.

Analysis
Ablation Study In Section 3.2, we proposed two pre-training tasks. Are they are all useful for summarization? As shown in Table 2, when we only employ the masked sentences prediction task (MSP), we can obtain a ROUGE-2 of 17.73, which is already very close to the result of PACSUM (see Table 1). When we add the sentence shuffling task (denoted as MSP+SS (STAS)), we improves the performance over MSP. Note that we can not use only the sentence shuffling task (SS), because the first term in our sentence scoring equation (see Equation (14)) depends on the probabilities produced by decoder in the MSP task. In section 3.3, we propose two criteria to score sentences (see the two terms in Equation (14)). The effects of them are shown in the second block of Table 2. Since the attention based criterion r relies on sentence probability based criterion r, we cannot remove r and instead we set r = 1 D to see the effect of r. As a result, ROUGE-2 decreases by 0.14, which indicates that r is necessary for ranking. Also note that when setting r = 1 D , our method is  is sufficient. To study the effect of the attention based criterion r , we set r = 0, which means sentences are ranked using sentence probability based criterion r. We can see that the performance drops dramatically by 5 ROUGE-2.
A j,i v.s. A i,j In Equation (13), we compute r i with A j,i (attention score from S j to S i ). The intuition of using A j,i is that a sentence is important if the interpretation of other important sentences depends on it. However, an alternative is to use A i,j . It shows in Table 3 that A j,i is indeed better.

Sentence Position Distribution
We also analyze how extractive sentences by different models are distributed in documents? We compare STAS against LEAD-3, PACSUM and ORACLE using the first 12 sentences in all documents on CNN/DM test set. ORACLE is the upper bound for extractive models. Extractive summaries of ORACLE are generated by selecting a subset of sentences in a document, which maximize ROUGE score (Nallapati et al., 2017). As shown in Figure 4, we can see that sentences selected by ORACLE are smoothly distributed across all positions, while LEAD-3 only selects the first 3 sentences. Compared to STAS, the sentence distribution of PACSUM is closer to that of LEAD-3 and STAS produces a sentence distribution that is more similar to that of ORACLE. The observation above indicates that our model relies less on sentence positions compared to PACSUM. We further computed the Kullback Leibler divergence between the sentence position distribution of an unsupervised model and the distribution of OR-ACLE and we denote it as KL(·||ORC). We found KL(PACSUM||ORC) = 0.614 is much large than KL(STAS||ORC) = 0.098, indicating STAS is better correlated with ORACLE. We introduce the sentence shuffling task to encourage STAS to select sentences based on their contents rather than their positions only (see Section 3.2). After we remove the sentence shuffling task from STAS during pre-training (see MSP in Figure 4), there is a clear trend that leading sentences are more frequently selected. Moreover, KL(STAS||ORC) < KL(MSP||ORC) = 0.108. By introducing the sentence shuffle task, sentence positional distribution of STAS is closer to that of ORACLE.  Why Sentence Shuffling? Since the Sentences Shuffling task aims to make STAS less dependent on sentence positions. However, there are potentially simpler methods to remove sentence position information. For example, (a) we can remove the sentence-level positional embedding and (b) for each sentence in the token level, we can use a positional embedding from positional 0. Results in Table 4 indicates that upon the MSP objective, strategies (a) and (b) hurt the performance of MSP significant, while SS improves over MSP. It may because positional embeddings, whether on token or sentence level, are important (at least for the MSP task). One advantage of SS over (a) and (b) is that it can make our model less dependent on positions (see Section 4.4) and retain the power of positional embeddings and the MSP objective at the same time.

Conclusions
In this paper, we find that (sentence-level) transformer attention (in a hierarchical transformer) can be used to rank sentences for unsupervised extractive summarization, while previous work leverage graph based (or rule based) methods and sentence similarities computed with off-the-shelf sentence embeddings. We propose the sentence shuffling task for pre-training hierarchical transformers, which helps our model to select sentences based on their contents rather than their positions only. Experimental results on CNN/DM and NYT datasets show that our model outperforms other recently proposed unsupervised methods. The sentence position distribution analysis shows that our method is less dependent on sentence positions. When combined with recent unsupervised model explicitly modeling sentence positions, we obtain even better results. In the next step, we plan to apply our models to unsupervised abstractive summarization.