HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization

Neural extractive summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these inaccurate labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders (Devlin et al., 2018), we propose Hibert (as shorthand for HIerachical Bidirectional Encoder Representations from Transformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained Hibert to our summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.


Introduction
Automatic document summarization is the task of rewriting a document into its shorter form while still retaining its important content. Over the years, many paradigms for document summarization have been explored (see Nenkova and McKeown (2011) for an overview). The most popular two among them are extractive approaches and abstractive approaches. As the name implies, extractive approaches generate summaries by extracting parts of the original document (usually sentences), while abstractive methods may generate new words or phrases which are not in the original document.
Extractive summarization is usually modeled as a sentence ranking problem with length constraints (e.g., max number of words or sentences). Top ranked sentences (under constraints) are selected as summaries. Early attempts mostly leverage manually engineered features (Filatova and Hatzivassiloglou, 2004a). Based on these sparse features, sentence are selected using a classifier or a regression model. Later, the feature engineering part in this paradigm is replaced with neural networks. Cheng and Lapata (2016) propose a hierarchical long short-term memory network (LSTM; Hochreiter and Schmidhuber 1997) to encode a document and then use another LSTM to predict binary labels for each sentence in the document. This architecture is widely adopted recently (Nallapati et al., 2017;Narayan et al., 2018;Zhang et al., 2018). Our model also employs a hierarchical document encoder, but we adopt a hierarchical transformer (Vaswani et al., 2017) rather a hierarchical LSTM. Because recent studies (Vaswani et al., 2017;Devlin et al., 2018) show the transformer model performs better than LSTM in many tasks.
Abstractive models do not attract much attention until recently. They are mostly based on sequence to sequence (seq2seq) models (Bahdanau et al., 2015), where a document is viewed a sequence and its summary is viewed as another sequence. Although seq2seq based summarizers can be equipped with copy mechanism (Gu et al., 2016;See et al., 2017), coverage model (See et al., 2017) and reinforcement learning (Paulus et al., 2017), there is still no guarantee that the generated summaries are grammatical and convey the same meaning as the original document does. It seems that extractive models are more reliable than their abstractive counterparts.
However, extractive models require sentence level labels, which are usually not included in most summarization datasets (most datasets only contain document-summary pairs). Sentence labels are usually obtained by rule-based methods (e.g., maximizing the ROUGE score between a set of sentences and reference summaries) and may not be accurate. Extractive models proposed re-cently (Cheng and Lapata, 2016;Nallapati et al., 2017) employ hierarchical document encoders and even have neural decoders, which are complex. Training such complex neural models with inaccurate binary labels is challenging. We observed in our initial experiments on one of our dataset that our extractive model (see Section 3.3 for details) overfits to the training set quickly after the second epoch, which indicates the training set may not be fully utilized. Inspired by the recent pre-training work in natural language processing (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018), our solution to this problem is to first pre-train the "complex"' part (i.e., the hierarchical encoder) of the extractive model on unlabeled data and then we learn to classify sentences with our model initialized from the pre-trained encoder. In this paper, we propose HIBERT, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train HIBERT for document modeling. We apply the pre-trained HIBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.

Related Work
In this section, we introduce work on extractive summarization, abstractive summarization and pre-trained natural language processing models. For a more comprehensive review of summarization, we refer the interested readers to McKeown (2011) andMani (2001).
Extractive Summarization Extractive summarization aims to select important sentences (sometimes other textual units such as elementary discourse units (EDUs)) from a document as its summary. It is usually modeled as a sentence ranking problem by using the scores from classifiers (Kupiec et al., 1995), sequential labeling models (Conroy and O'leary, 2001) as well as integer linear programmers (Woodsend and Lapata, 2010). Early work with these models above mostly leverage human engineered features such as sentence position and length (Radev et al., 2004), word frequency (Nenkova et al., 2006) and event features (Filatova and Hatzivassiloglou, 2004b).
As the very successful applications of neural networks to a wide range of NLP tasks, the manually engineered features (for document encoding) are replaced with hierarchical LSTMs/CNNs and the sequence labeling (or classification) model is replaced with an LSTM decoder (Cheng and Lapata, 2016;Nallapati et al., 2017). The architecture is widely adopted in recent neural extractive models and is extended with reinforcement learning (Narayan et al., 2018;Dong et al., 2018), latent variable models (Zhang et al., 2018), joint scoring  and iterative document representation . Recently, transformer networks (Vaswani et al., 2017) achieves good performance in machine translation (Vaswani et al., 2017) and a range of NLP tasks (Devlin et al., 2018;Radford et al., 2018). Different from the extractive models above, we adopt a hierarchical Transformer for document encoding and also propose a method to pre-train the document encoder.
Abstractive Summarization Abstractive summarization aims to generate the summary of a document with rewriting. Most recent abstractive models (Nallapati et al., 2016) are based on neural sequence to sequence learning (Bahdanau et al., 2015;Sutskever et al., 2014). However, the generated summaries of these models can not be controlled (i.e., their meanings can be quite different from the original and contents can be repeated). Therefore, copy mechanism (Gu et al., 2016), coverage model (See et al., 2017) and reinforcement learning model optimizing ROUGE (Paulus et al., 2017) are introduced. These problems are alleviated but not solved. There is also an interesting line of work combining extractive and abstractive summarization with reinforcement learning (Chen and Bansal, 2018), fused attention (Hsu et al., 2018) and bottom-up attention (Gehrmann et al., 2018). Our model, which is a very good extractive model, can be used as the sentence extraction component in these models and potentially improves their performance.
Pre-trained NLP Models Most model pretraining methods in NLP leverage the natural ordering of text. For example, word2vec uses the surrounding words within a fixed size window to predict the word in the middle with a log bilinear model. The resulting word embedding table can be used in other downstream tasks. There are other word embedding pre-training methods using similar techniques (Pennington et al., 2014;Bojanowski et al., 2017). Peters et al. (2018) and Radford et al. (2018) find even a sentence encoder Figure 1: The architecture of HIBERT during training. sent i is a sentence in the document above, which has four sentences in total. sent 3 is masked during encoding and the decoder predicts the original sent 3 .
(not just word embeddings) can also be pre-trained with language model objectives (i.e., predicting the next or previous word). Language model objective is unidirectional, while many tasks can leverage the context in both directions. Therefore, Devlin et al. (2018) propose the naturally bidirectional masked language model objective (i.e., masking several words with a special token in a sentence and then predicting them). All the methods above aim to pre-train word embeddings or sentence encoders, while our method aims to pre-train the hierarchical document encoders (i.e., hierarchical transformers), which is important in summarization.

Model
In this section, we present our model HIBERT. We first introduce how documents are represented in HIBERT. We then describe our method to pre-train HIBERT and finally move on to the application of HIBERT to summarization.

Document Representation
is a sentence in D and w i j a word in S i . Note that following common practice in natural language processing literatures, w i |S i | is an artificial EOS (End Of Sentence) token.
To obtain the representation of D, we use two encoders: a sentence encoder to transform each sentence in D to a vector and a document encoder to learn sentence representations given their surrounding sentences as context. Both the sentence encoder and document encoder are based on the Transformer encoder described in Vaswani et al. (2017). As shown in Figure 1, they are nested in a hierarchical fashion. A transformer encoder usually has multiple layers and each layer is composed of a multi-head self attentive sub-layer followed by a feed-forward sub-layer with residual connections (He et al., 2016) and layer normalizations (Ba et al., 2016). For more details of the Transformer encoder, we refer the interested readers to Vaswani et al. (2017). To learn the repre- where e(w i j ) and p j are the word and positional embeddings of w i j , respectively. The word embedding matrix is randomly initialized and we adopt the sine-cosine positional embedding (Vaswani et al., 2017) 1 . Then the sentence encoder (a Transformer) transforms E i into a list of hidden rep- We take the last hidden representation h i |S i | (i.e., the representation at the EOS token) as the representation of sentence S i . Similar to the representation of each word in S i , we also take the sentence position into account. The final representation of S i iŝ Note that words and sentences share the same positional embedding matrix. In analogy to the sentence encoder, as shown in Figure 1, the document encoder is yet another Transformer but applies on the sentence level. After running the Transformer on a sequence of sentence representations (ĥ 1 ,ĥ 2 , . . . ,ĥ |D| ), we obtain the context sensitive sentence representations (d 1 , d 2 , . . . , d |D| ). Now we have finished the encoding of a document with a hierarchical bidirectional transformer encoder HIBERT. Note that in previous work, document representation are also learned with hierarchical models, but each hierarchy is a Recurrent Neural Network (Nallapati et al., 2017; or Convolutional Neural Network (Cheng and Lapata, 2016). We choose the Transformer because it outperforms CNN and RNN in machine translation (Vaswani et al., 2017), semantic role labeling (Strubell et al., 2018) and other NLP tasks (Devlin et al., 2018). In the next section we will introduce how we train HIBERT with an unsupervised training objective.

Pre-training
Most recent encoding neural models used in NLP (e.g., RNNs, CNNs or Transformers) can be pretrained by predicting a word in a sentence (or a text span) using other words within the same sentence (or span). For example, ELMo (Peters et al., 2018) and OpenAI-GPT (Radford et al., 2018) predict a word using all words on its left (or right); while word2vec (Mikolov et al., 2013) predicts one word with its surrounding words in a fixed window and BERT (Devlin et al., 2018) predicts (masked) missing words in a sentence given all the other words.
All the models above learn the representation of a sentence, where its basic units are words. HIBERT aims to learn the representation of a document, where its basic units are sentences. Therefore, a natural way of pre-training a document level model (e.g., HIBERT) is to predict a sentence (or sentences) instead of a word (or words). We could predict a sentence in a document with all the sentences on its left (or right) as in a (document level) language model. However, in summarization, context on both directions are available. We therefore opt to predict a sentence using all sentences on both its left and right.
We randomly select 15% of the sentences in D and mask them. Then, we predict these masked sentences. The prediction task here is similar with the Cloze task (Taylor, 1953;Devlin et al., 2018), but the missing part is a sentence. However, during test time the input document is not masked, to make our model can adapt to documents without masks, we do not always mask the selected sentences. Once a sentence is selected (as one of the 15% selected masked sentences), we transform it with one of three methods below. We will use an ex-ample to demonstrate the transformation. For instance, we have the following document and the second sentence is selected 2 : William Shakespeare is a poet . He died in 1616 . He is regarded as the greatest writer .
In 80% of the cases, we mask the selected sentence (i.e., we replace each word in the sentence with a mask token [MASK] (where "He died in 1616 . " is masked).
In 10% of the cases, we keep the selected sentence as it is. This strategy is to simulate the input document during test time (with no masked sentences).
In the rest 10% cases, we replace the selected sentence with a random sentence. In this case, the document after transformation is William Shakespeare is a poet . Birds can fly . He is regarded as the greatest writer . The second sentence is replaced with "Birds can fly ." This strategy intends to add some noise during training and make the model more robust.

Sentence Prediction
After the application of the above procedures to a document D = (S 1 , S 2 , . . . , S |D| ), we obtain the masked document D = (S 1 ,S 2 , . . . ,S |D| ). Let K denote the set of indicies of selected sentences in D. Now we are ready to predict the masked sentences M = {S k |k ∈ K} using D. We first apply the hierarchical encoder HIBERT in Section 3.1 to D and obtain its context sensitive sentence representations (d 1 ,d 2 , . . . ,d |D| ). We will demonstrate how we predict the masked sentence S k = (w k 0 , w k 1 , w k 2 , . . . , w k |S k | ) one word per step (w k 0 is an artificially added BOS token). At the jth step, we predict w k j given w k 0 , . . . , w k j−1 and D.d k already encodes the information of D with a focus around its kth sentenceS k . As shown in Figure 1, we employ a Transformer decoder (Vaswani et al., 2017) to predict w k j withd k as its additional input. The transformer decoder we used here is slightly different from the original one. The original decoder employs two multi-head attention layers to include both the context in encoder and decoder, while we only need one to learn the decoder context, since the context in encoder is a vector (i.e., d k ). Specifically, after applying the word and positional embeddings to (w k 0 , . . . , w k j−1 ), we obtain E k 1:j−1 = (ẽ k 0 , . . . ,ẽ k j−1 ) (also see Equation 1). Then we apply multi-head attention sub-layer to E k 1:j−1 : where q j−1 , K j−1 , V j−1 are the input query, key and value matrices of the multi-head attention function (Vaswani et al., 2017) MultiHead(·, ·, ·), respectively. W Q ∈ R d×d , W K ∈ R d×d and W V ∈ R d×d are weight matrices. Then we include the information of D by addition:x j−1 =h j−1 +d k We also follow a feedforward sub-layer (one hidden layer with ReLU (Glorot et al., 2011) activation function) afterx j−1 as in Vaswani et al. (2017): 5) Note that the transformer decoder can have multiple layers by applying Equation (3) to (5) multiple times and we only show the computation of one layer for simplicity.
The probability of w k j given w k 0 , . . . , w k j−1 and D is: p(w k j |w k 0:j−1 , D) = softmax(W Og j−1 ) (6) Finally the probability of all masked sentences M given D is The model above can be trained by minimizing the negative log-likelihood of all masked sentences given their paired documents. We can in theory have unlimited amount of training data for HIBERT, since they can be generated automatically from (unlabeled) documents. Therefore, we can first train HIBERT on large amount of data and then apply it to downstream tasks. In the next section, we will introduce its application to document summarization.

Extractive Summarization
Extractive summarization selects the most important sentences in a document as its summary. In this section, summarization is modeled as a sequence labeling problem. Specifically, a document is viewed as a sequence of sentences and a summarization model is expected to assign a True or False label for each sentence, where True means this sentence should be included in the summary. In the following, we will introduce the details of our summarization model based HIBERT. Let D = (S 1 , S 2 , . . . , S |D| ) denote a document and Y = (y 1 , y 2 , . . . , y |D| ) its sentence labels (methods for obtaining these labels are in Section 4.1). As shown in Figure 2, we first apply the hierarchical bidirectional transformer encoder HIBERT to D and yields the context dependent representations for all sentences (d 1 , d 2 , . . . , d |D| ). The probability of the label of S i can be estimated using an additional linear projection and a softmax: where W S ∈ R 2×d . The summarization model can be trained by minimizing the negative loglikelihood of all sentence labels given their paired documents.

Experiments
In this section we assess the performance of our model on the document summarization task. We first introduce the dataset we used for pre-training and the summarization task and give implementation details of our model. We also compare our model against multiple previous models.

Datasets
We conducted our summarization experiments on the non-anonymous version CNN/Dailymail (CNNDM) dataset (Hermann et al., 2015;See et al., 2017), and the New York Times dataset (Durrett et al., 2016;Xu and Durrett, 2019 To create sentence level labels for extractive summarization, we used a strategy similar to Nallapati et al. (2017). We label the subset of sentences in a document that maximizes ROUGE (Lin, 2004) (against the human summary) as True and all other sentences as False.
To unsupervisedly pre-train our document model HIBERT (see Section 3.2 for details), we created the GIGA-CM dataset (totally 6,626,842 documents and 2,854 million words), which includes 6,339,616 documents sampled from the English Gigaword 4 dataset and the training split of the CNNDM dataset. We used the validation set of CNNDM as the validation set of GIGA-CM as well. As in See et al. (2017), documents and summaries in CNNDM, NYT50 and GIGA-CM are all segmented and tokenized using Stanford CoreNLP toolkit . To reduce the vocabulary size, we applied byte pair encoding (BPE; Sennrich et al. 2016) to all of our datasets. To limit the memory consumption during training, we limit the length of each sentence to be 50 words (51th word and onwards are removed) and split documents with more than 30 sentences into smaller documents with each containing at most 30 sentences.

Implementation Details
Our model is trained in three stages, which includes two pre-training stages and one finetuning stage. The first stage is the open-domain pretraining and in this stage we train HIBERT with the pre-training objective (Section 3.2) on GIGA-CM dataset. In the second stage, we perform the indomain pre-training on the CNNDM (or NYT50) dataset still with the same pre-training objective. In the final stage, we finetune HIBERT in the summarization model (Section 3.3) to predict extractive sentence labels on CNNDM (or NYT50).
The sizes of the sentence and document level Transformers as well as the Transformer decoder in HIBERT are the same. Let L denote the number of layers in Transformer, H the hidden size and A the number of attention heads. As in (Vaswani et al., 2017;Devlin et al., 2018), the hidden size of the feedforward sublayer is 4H. We mainly trained two model sizes: HIBERT S (L = 6, H = 512 and A = 8) and HIBERT M (L = 6, H = 768 and A = 12). We trained both HIBERT S and HIBERT M on a single machine with 8 Nvidia Tesla V100 GPUs with a batch size of 256 documents. We optimized our models using Adam with learning rate of 1e-4, β 1 = 0.9, β 2 = 0.999, L2 norm of 0.01, learning rate warmup 10,000 steps and learning rate decay afterwards using the strategies in Vaswani et al. (2017). The dropout rate in all layers are 0.1. In pre-training stages, we trained our models until validation perplexities do not decrease significantly (around 45 epochs on GIGA-CM dataset and 100 to 200 epochs on CN-NDM and NYT50). Training HIBERT M for one epoch on GIGA-CM dataset takes approximately 20 hours.
Our models during fine-tuning stage can be trained on a single GPU. The hyper-parameters are almost identical to these in the pre-training stages except that the learning rate is 5e-5, the batch size is 32, the warmup steps are 4,000 and we train our models for 5 epochs. During inference, we rank sentences using p(y i |D) (Equation (8)) and choose the top K sentences as summary, where K is tuned on the validation set.

Evaluations
We evaluated the quality of summaries from different systems automatically using ROUGE (Lin, 2004). We reported the full length F1 based ROUGE-1, ROUGE-2 and ROUGE-L on the  Table 1: Results of various models on the CNNDM test set using full-length F1 ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L).
CNNDM and NYT50 datasets. We compute ROUGE scores using the ROUGE-1.5.5.pl script. Additionally, we also evaluated the generated summaries by eliciting human judgments. Following (Cheng and Lapata, 2016;Narayan et al., 2018), we randomly sampled 20 documents from the CNNDM test set. Participants were presented with a document and a list of summaries produced by different systems. We asked subjects to rank these summaries (ties allowed) by taking informativeness (is the summary capture the important information from the document?) and fluency (is the summary grammatical?) into account. Each document is annotated by three different subjects.

Results
Our main results on the CNNDM dataset are shown in Table 1, with abstractive models in the top block and extractive models in the bottom block. Pointer+Coverage (See et al., 2017), Abstract-ML+RL (Paulus et al., 2017) and DCA (Celikyilmaz et al., 2018) are all sequence to sequence learning based models with copy and coverage modeling, reinforcement learning and deep communicating agents extensions. SentRewrite (Hsu et al., 2018) and InconsisLoss (Chen and Bansal, 2018) all try to decompose the word by word summary generation into sentence selection from document and "sentence" level summarization (or compression). Bottom-Up (Gehrmann et al., 2018) generates summaries by combines a word prediction model with the decoder attention model. The extractive models are usually based on hierarchical encoders (SummaRuNNer; Nallapati et al. 2017 and NeuSum;Cheng and Lapata 2016). They have been extended with reinforcement learning (Refresh; Narayan et al. 2018 and BanditSum;Dong et al. 2018), Maximal Marginal Relevance (NeuSum-MMR;, latent variable modeling (LatentSum; Zhang et al. 2018) and syntactic compression (JECS; Xu and Durrett 2019). Lead3 is a baseline which simply selects the first three sentences. Our model HIBERT S (in-domain), which only use one pretraining stage on the in-domain CNNDM training set, outperforms all of them and differences between them are all significant with a 0.95 confidence interval (estimated with the ROUGE script). Note that pre-training HIBERT S (in-domain) is very fast and it only takes around 30 minutes for one epoch on the CNNDM training set. Our models with two pre-training stages (HIBERT S ) or larger size (HIBERT M ) perform even better and HIBERT M outperforms BERT by 0.5 ROUGE 5 . We also implemented two baselines. One is the hierarchical transformer summarization model (HeriTransfomer; described in 3.3) without pretraining. Note the setting for HeriTransfomer is (L = 4,H = 300 and A = 4) 6 . We can see that the pre-training (details in Section 3.2) leads to a +1.25 ROUGE improvement. Another baseline is based on a pre-trained BERT (Devlin et al., 2018) 7 and finetuned on the CNNDM dataset. We used the BERT base model because our 16G RAM V100 GPU cannot fit BERT large for the summarization task even with batch size of 1. The positional embedding of BERT supports input length up to 512 words, we therefore split documents with more than 10 sentences into multiple blocks    Table 2). EXTRACTION is a extractive model based hierarchical LSTM and we use the numbers reported by Xu and Durrett (2019). The improvement of HIBERT M over the baseline without pre-training (HeriTransformer) becomes 2.0 ROUGE. HIBERT S (in-domain), HIBERT M (in-domain), HIBERT S and HIBERT M all outperform BERT significantly according to the ROUGE script.
We also conducted human experiment with 20 randomly sampled documents from the CNNDM test set. We compared our model HIBERT M against Lead3, DCA, Latent, BERT and the human reference (Human) 9 . We asked the subjects to rank 8 We use 10 sentences per block, because maximum sentence length 50 × 10 < 512 (maximum BERT supported length). The last block of a document may have less than 10 sentences. 9 We obtained the outputs of DCA and Latent via emails.  the outputs of these systems from best to worst. As shown in Table 4, the output of HIBERT M is selected as the best in 30% of cases and we obtained lower mean rank than all systems except for Human. We also converted the rank numbers into ratings (rank i to 7 − i) and applied student t-test on the ratings. HIBERT M is significantly different from all systems in comparison (p < 0.05), which indicates our model still lags behind Human, but is better than all other systems.
Pre-training Strategies As mentioned earlier, our pre-training includes two stages. The first stage is the open-domain pre-training stage on the GIGA-CM dataset and the following stage is the in-domain pre-training on the CNNDM (or NYT50) dataset. As shown in Table 3, we pretrained HIBERT S using only open-domain stage (Open-Domain), only in-domain stage (In-Domain) or both stages (Open+In-Domain) and applied it to the CNNDM summarization task. Results on the validation set of CNNDM indicate the two-stage pre-training process is necessary.

Conclusions
The core part of a neural extractive summarization model is the hierarchical document encoder. We proposed a method to pre-train document level hierarchical bidirectional transformer encoders on unlabeled data. When we only pre-train hierarchical transformers on the training sets of summarization datasets with our proposed objective, application of the pre-trained hierarchical transformers to extractive summarization models already leads to wide improvement of summarization performance.
Adding the large open-domain dataset to pre-training leads to even better performance.
In the future, we plan to apply models to other tasks that also require hierarchical document encodings (e.g., document question answering). We are also interested in improving the architectures of hierarchical document encoders and designing other objectives to train hierarchical transformers.