Summary Level Training of Sentence Rewriting for Abstractive Summarization

As an attempt to combine extractive and abstractive summarization, Sentence Rewriting models adopt the strategy of extracting salient sentences from a document first and then paraphrasing the selected ones to generate a summary. However, the existing models in this framework mostly rely on sentence-level rewards or suboptimal labels, causing a mismatch between a training objective and evaluation metric. In this paper, we present a novel training signal that directly maximizes summary-level ROUGE scores through reinforcement learning. In addition, we incorporate BERT into our model, making good use of its ability on natural language understanding. In extensive experiments, we show that a combination of our proposed model and training procedure obtains new state-of-the-art performance on both CNN/Daily Mail and New York Times datasets. We also demonstrate that it generalizes better on DUC-2002 test set.


Introduction
The task of automatic text summarization aims to compress a textual document to a shorter highlight while keeping salient information of the original text. In general, there are two ways to do text summarization: Extractive and Abstractive (Mani and Maybury, 2001). Extractive approaches generate summaries by selecting salient sentences or phrases from a source text, while abstractive approaches involve a process of paraphrasing or generating sentences to write a summary.
Recent work (Liu, 2019;Zhang et al., 2019c) demonstrates that it is highly beneficial for extractive summarization models to incorporate pretrained language models (LMs) such as BERT (Devlin et al., 2019) into their architectures. However, the performance improvement from the pretrained LMs is known to be relatively small in case of abstractive summarization (Zhang et al., 2019a;Hoang et al., 2019). This discrepancy may be due to the difference between extractive and abstractive approaches in ways of dealing with the taskthe former classifies whether each sentence to be included in a summary, while the latter generates a whole summary from scratch. In other words, as most of the pre-trained LMs are designed to be of help to the tasks which can be categorized as classification including extractive summarization, they are not guaranteed to be advantageous to abstractive summarization models that should be capable of generating language (Wang and Cho, 2019;Zhang et al., 2019b).
On the other hand, recent studies for abstractive summarization (Chen and Bansal, 2018;Hsu et al., 2018;Gehrmann et al., 2018) have attempted to exploit extractive models. Among these, a notable one is Chen and Bansal (2018), in which a sophisticated model called Reinforce-Selected Sentence Rewriting is proposed. The model consists of both an extractor and abstractor, where the extractor picks out salient sentences first from a source article, and then the abstractor rewrites and compresses the extracted sentences into a complete summary. It is further fine-tuned by training the extractor with the rewards derived from sentencelevel ROUGE scores of the summary generated from the abstractor.
In this paper, we improve the model of Chen and Bansal (2018), addressing two primary issues. Firstly, we argue there is a bottleneck in the existing extractor on the basis of the observation that its performance as an independent summarization model (i.e., without the abstractor) is no better than solid baselines such as selecting the first 3 sentences. To resolve the problem, we present a novel neural extractor exploiting the pre-trained LMs (BERT in this work) which are expected to perform better according to the recent studies (Liu, 2019;Zhang et al., 2019c). Since the extractor is a sort of sentence classifier, we expect that it can make good use of the ability of pre-trained LMs which is proven to be effective in classification.
Secondly, the other point is that there is a mismatch between the training objective and evaluation metric; the previous work utilizes the sentence-level ROUGE scores as a reinforcement learning objective, while the final performance of a summarization model is evaluated by the summary-level ROUGE scores. Moreover, as Narayan et al. (2018) pointed out, sentences with the highest individual ROUGE scores do not necessarily lead to an optimal summary, since they may contain overlapping contents, causing verbose and redundant summaries. Therefore, we propose to directly use the summary-level ROUGE scores as an objective instead of the sentence-level scores. A potential problem arising from this apprsoach is the sparsity of training signals, because the summary-level ROUGE scores are calculated only once for each training episode. To alleviate this problem, we use reward shaping (Ng et al., 1999) to give an intermediate signal for each action, preserving the optimal policy.
We empirically demonstrate the superiority of our approach by achieving new state-of-the-art abstractive summarization results on CNN/Daily Mail and New York Times datasets (Hermann et al., 2015;Durrett et al., 2016). It is worth noting that our approach shows large improvements especially on ROUGE-L score which is considered a means of assessing fluency (Narayan et al., 2018). In addition, our model performs much better than previous work when testing on DUC-2002 dataset, showing better generalization and robustness of our model.
Our contributions in this work are three-fold: a novel successful application of pre-trained transformers for abstractive summarization; suggesting a training method to globally optimize sentence selection; achieving the state-of-the-art results on the benchmark datasets, CNN/Daily Mail and New York Times.

Sentence Rewriting
In this paper, we focus on single-document multisentence summarization and propose a neural abstractive model based on the Sentence Rewriting framework (Chen and Bansal, 2018;Xu and Dur-rett, 2019) which consists of two parts: a neural network for the extractor and another network for the abstractor. The extractor network is designed to extract salient sentences from a source article. The abstractor network rewrites the extracted sentences into a short summary.

Learning Sentence Selection
The most common way to train extractor to select informative sentences is building extractive oracles as gold targets, and training with crossentropy (CE) loss. An oracle consists of a set of sentences with the highest possible ROUGE scores. Building oracles is finding an optimal combination of sentences, where there are 2 n possible combinations for each example. Because of this, the exact optimization for ROUGE scores is intractable. Therefore, alternative methods identify the set of sentences with greedy search (Nallapati et al., 2017), sentence-level search (Hsu et al., 2018;Shi et al., 2019) or collective search using the limited number of sentences (Xu and Durrett, 2019), which construct suboptimal oracles. Even if all the optimal oracles are found, training with CE loss using these labels will cause underfitting as it will only maximize probabilities for sentences in label sets and ignore all other sentences.
Alternatively, reinforcement learning (RL) can give room for exploration in the search space. Chen and Bansal (2018), our baseline work, proposed to apply policy gradient methods to train an extractor. This approach makes an end-toend trainable stochastic computation graph, encouraging the model to select sentences with high ROUGE scores. However, they define a reward for an action (sentence selection) as a sentencelevel ROUGE score between the chosen sentence and a sentence in the ground truth summary for that time step. This leads the extractor agent to a suboptimal policy; the set of sentences matching individually with each sentence in a ground truth summary isn't necessarily optimal in terms of summary-level ROUGE score. Narayan et al. (2018) proposed policy gradient with rewards from summary-level ROUGE. They defined an action as sampling a summary from candidate summaries that contain the limited number of plausible sentences. After training, a sentence is ranked high for selection if it often occurs in high scoring summaries. However, their approach still has a risk of ranking redundant sen- Figure 1: The overview architecture of the extractor netwrok tences high; if two highly overlapped sentences have salient information, they would be ranked high together, increasing the probability of being sampled in one summary.
To tackle this problem, we propose a training method using reinforcement learning which globally optimizes summary-level ROUGE score and gives intermediate rewards to ease the learning.

Pre-trained Transformers
Transferring representations from pre-trained transformer language models has been highly successful in the domain of natural language understanding tasks (Radford et al., 2018;Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019). These methods first pre-train highly stacked transformer blocks (Vaswani et al., 2017) on a huge unlabeled corpus, and then fine-tune the models or representations on downstream tasks.

Model
Our model consists of two neural network modules, i.e. an extractor and abstractor. The extractor encodes a source document and chooses sentences from the document, and then the abstractor paraphrases the summary candidates. Formally, a single document consists of n sentences D = {s 1 , s 2 , · · · , s n }. We denote i-th sentence as s i = {w i1 , w i2 , · · · , w im } where w ij is the j-th word in s i . The extractor learns to pick out a subset of D denoted asD = {ŝ 1 ,ŝ 2 , · · · ,ŝ k |ŝ i ∈ D} where k sentences are selected. The abstractor rewrites each of the selected sentences to form a summary S = {f (ŝ 1 ), f (ŝ 2 ), · · · , f (ŝ k )}, where f is an abstracting function. And a gold summary consists of l sentences A = {a 1 , a 2 , · · · , a l }.

Extractor Network
The extractor is based on the encoder-decoder framework. We adapt BERT for the encoder to exploit contextualized representations from pretrained transformers. BERT as the encoder maps the input sequence D to sentence representation vectors H = {h 1 , h 2 , · · · , h n }, where h i is for the i-th sentence in the document. Then, the decoder utilizes H to extractD from D.

Leveraging Pre-trained Transformers
Although we require the encoder to output the representation for each sentence, the output vectors from BERT are grounded to tokens instead of sentences. Therefore, we modify the input sequence and embeddings of BERT as Liu (2019) did.
In the original BERT's configure, a [CLS] token is used to get features from one sentence or a pair of sentences. Since we need a symbol for each sentence representation, we insert the [CLS] token before each sentence. And we add a [SEP] token at the end of each sentence, which is used to differentiate multiple sentences. As a result, the vector for the i-th [CLS] symbol from the top BERT layer corresponds to the i-th sentence representation h i .
In addition, we add interval segment embeddings as input for BERT to distinguish multiple sentences within a document. For s i we assign a segment embedding E A or E B conditioned on i is odd or even. For example, for a consecutive sequence of sentences s 1 , s 2 , s 3 , s 4 , s 5 , we assign E A , E B , E A , E B , E A in order. All the words in each sentence are assigned to the same segment embedding, i.e. segment embeddings for w 11 , w 12 , · · · , w 1m is E A , E A , · · · , E A . An illustration for this procedure is shown in Figure 1.

Sentence Selection
We use LSTM Pointer Network (Vinyals et al., 2015) as the decoder to select the extracted sentences based on the above sentence representations. The decoder extracts sentences recurrently, producing a distribution over all of the remaining sentence representations excluding those already selected. Since we use the sequential model which selects one sentence at a time step, our decoder can consider the previously selected sentences. This property is needed to avoid selecting sentences that have overlapping information with the sentences extracted already.
As the decoder structure is almost the same with the previous work, we convey the equations of Chen and Bansal (2018) to avoid confusion, with minor modifications to agree with our notations. Formally, the extraction probability is calculated as: where e t is the output of the glimpse operation: In Equation 3, z t is the hidden state of the LSTM decoder at time t (shown in green in Figure 1). All the W and v are trainable parameters.

Abstractor Network
The abstractor network approximates f , which compresses and paraphrases an extracted document sentence to a concise summary sentence. We use the standard attention based sequence-tosequence (seq2seq) model (Bahdanau et al., 2015;Luong et al., 2015) with the copying mechanism (See et al., 2017) for handling out-of-vocabulary (OOV) words. Our abstractor is practically identical to the one proposed in Chen and Bansal (2018).

Training
In our model, an extractor selects a series of sentences, and then an abstractor paraphrases them.
As they work in different ways, we need different training strategies suitable for each of them. Training the abstractor is relatively obvious; maximizing log-likelihood for the next word given the previous ground truth words. However, there are several issues for extractor training. First, the extractor should consider the abstractor's rewriting process when it selects sentences. This causes a weak supervision problem (Jehl et al., 2019), since the extractor gets training signals indirectly after paraphrasing processes are finished. In addition, thus this procedure contains sampling or maximum selection, the extractor performs a nondifferentiable extraction. Lastly, although our goal is maximizing ROUGE scores, neural models cannot be trained directly by maximum likelihood estimation from them.
To address those issues above, we apply standard policy gradient methods, and we propose a novel training procedure for extractor which guides to the optimal policy in terms of the summary-level ROUGE. As usual in RL for sequence prediction, we pre-train submodules and apply RL to fine-tune the extractor.

Training Submodules
Extractor Pre-training Starting from a poor random policy makes it difficult to train the extractor agent to converge towards the optimal policy. Thus, we pre-train the network using cross entropy (CE) loss like previous work (Bahdanau et al., 2017;Chen and Bansal, 2018). However, there is no gold label for extractive summarization in most of the summarization datasets. Hence, we employ a greedy approach (Nallapati et al., 2017) to make the extractive oracles, where we add one sentence at a time incrementally to the summary, such that the ROUGE score of the current set of selected sentences is maximized for the entire ground truth summary. This doesn't guarantee optimal, but it is enough to teach the network to select plausible sentences. Formally, the network is trained to minimize the cross-entropy loss as follows: where s * t is the t-th generated oracle sentence.
Abstractor Training For the abstractor training, we should create training pairs for input and target sentences. As the abstractor paraphrases on sentence-level, we take a sentence-level search for each ground-truth summary sentence. We find the most similar document sentence s t by: And then the abstractor is trained as a usual sequence-to-sequence model to minimize the cross-entropy loss: j is the j-th word of the target sentence a t , and Φ is the encoded representation for s t .

Guiding to the Optimal Policy
To optimize ROUGE metric directly, we assume the extractor as an agent in reinforcement learning paradigm (Sutton et al., 1998). We view the extractor has a stochastic policy that generates actions (sentence selection) and receives the score of final evaluation metric (summary-level ROUGE in our case) as the return While we are ultimately interested in the maximization of the score of a complete summary, simply awarding this score at the last step provides a very sparse training signal. For this reason we define intermediate rewards using reward shaping (Ng et al., 1999), which is inspired by Bahdanau et al. (2017)'s attempt for sequence prediction. Namely, we compute summary-level score values for all intermediate summaries: (R({ŝ 1 }), R({ŝ 1 ,ŝ 2 }), · · · , R({ŝ 1 ,ŝ 2 , · · · ,ŝ k })) (10) The reward for each step r t is the difference between the consecutive pairs of scores: This measures an amount of increase or decrease in the summary-level score from selectingŝ t . Using the shaped reward r t instead of awarding the whole score R at the last step does not change the optimal policy (Ng et al., 1999). We define a discounted future reward for each step as R t = k t=1 γ t r t+1 , where γ is a discount factor. Additionally, we add 'stop' action to the action space, by concatenating trainable parameters h stop (the same dimension as h i ) to H. The agent treats it as another candidate to extract. When it selects 'stop', an extracting episode ends and the final return is given. This encourages the model to extract additional sentences only when they are expected to increase the final return.
Following Chen and Bansal (2018), we use the Advantage Actor Critic (Mnih et al., 2016) method to train. We add a critic network to estimate a value function V t (D,ŝ 1 , · · · ,ŝ t−1 ), which then is used to compute advantage of each action (we will omit the current state (D,ŝ 1 , · · · ,ŝ t−1 ) to simplify): where Q t (s i ) is the expected future reward for selecting s i at the current step t. We maximize this advantage with the policy gradient with the where θ π is the trainable parameters of the actor network (original extractor). And the critic is trained to minimize the square loss: where θ ψ is the trainable parameters of the critic network.

Datasets
We evaluate the proposed approach on the CNN/Daily Mail (Hermann et al., 2015) and New York Times (Sandhaus, 2008) dataset, which are both standard corpora for multi-sentence abstractive summarization. Additionally, we test generalization of our model on DUC-2002 test set. CNN/Daily Mail dataset consists of more than 300K news articles and each of them is paired with several highlights. We used the standard splits of Hermann et al. (2015) for training, validation and testing (90,226/1,220/1,093 documents for CNN and 196,961/12,148/10,397 for Daily Mail). We did not anonymize entities. We followed the preprocessing methods in See et al. (2017) after splitting sentences by Stanford CoreNLP (Manning et al., 2014).
The New York Times dataset also consists of many news articles. We followed the dataset splits of Durrett et al. (2016); 100,834 for training and

Implementation Details
Our extractor is built on BERT BASE with finetuning, smaller version than BERT LARGE due to limitation of time and space. We set LSTM hidden size as 256 for all of our models. To initialize word embeddings for our abstractor, we use word2vec (Mikolov et al., 2013) of 128 dimensions trained on the same corpus. We optimize our model with Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9 and β 2 = 0.999. For extractor pre-training, we use learning rate schedule following (Vaswani et al., 2017) with warmup = 10000: lr = 2e −3 · min(steps −0.5 , steps · warmup −1.5 ).

Models
R-1 R-2 R-L lead-3 (See et al., 2017) 40.34 17.70 36.57 rnn-ext (Chen and Bansal, 2018)   And we set learning rate 1e −3 for abstractor and 4e −6 for RL training. We apply gradient clipping using L2 norm with threshold 2.0. For RL training, we use γ = 0.95 for the discount factor. To ease learning h stop , we set the reward for the stop action to λ · ROUGE-L summ F 1 (S, A), where λ is a stop coefficient set to 0.08. Our critic network shares the encoder with the actor (extractor) and has the same architecture with it except the output layer, estimating scalar for the state value. And the critic is initialized with the parameters of the pre-trained extractor where it has the same architecture.

Evaluation
We evaluate the performance of our method using different variants of ROUGE metric computed with respect to the gold summaries. On the CNN/Daily Mail and DUC-2002 dataset, we use standard ROUGE-1, ROUGE-2, and ROUGE-  L (Lin, 2004) on full length F 1 with stemming as previous work did (Nallapati et al., 2017;See et al., 2017;Chen and Bansal, 2018). On NYT50 dataset, following Durrett et al. (2016) and Paulus et al. (2018), we used the limited length ROUGE recall metric, truncating the generated summary to the length of the ground truth summary. Table 1 shows the experimental results on CNN/Daily Mail dataset, with extractive models in the top block and abstractive models in the bottom block. For comparison, we list the performance of many recent approaches with ours.

CNN/Daily Mail
Extractive Summarization As See et al. (2017) showed, the first 3 sentences (lead-3) in an article form a strong summarization baseline in CNN/Daily Mail dataset. Therefore, the very first objective of extractive models is to outperform the simple method which always returns 3 or 4 sentences at the top. However, as Table 2 shows, ROUGE scores of lead baselines and extractors from previous work in Sentence Rewrite framework (Chen and Bansal, 2018;Xu and Durrett, 2019) are almost tie. We can easily conjecture that the limited performances of their full model are due to their extractor networks. Our extractor network with BERT (BERT-ext), as a single model, outperforms those models with large margins. Adding reinforcement learning (BERT-ext + RL) gives higher performance, which is competitive with other extractive approaches using pretrained Transformers (see Table 1). This shows the effectiveness of our learning method.
Abstractive Summarization Our abstractive approaches combine the extractor with the abstractor. The combined model (BERT-ext + abs) without additional RL training outperforms the Sentence Rewrite model (Chen and Bansal, 2018) without reranking, showing the effectiveness of our extractor network. With the proposed RL  training procedure (BERT-ext + abs + RL), our model exceeds the best model of Chen and Bansal (2018). In addition, the result is better than those of all the other abstractive methods exploiting extractive approaches in them (Hsu et al., 2018;Chen and Bansal, 2018;Gehrmann et al., 2018).
Redundancy Control Although the proposed RL training inherently gives training signals that induce the model to avoid redundancy across sentences, there can be still remaining overlaps between extracted sentences. We found that the additional methods reducing redundancies can improve the summarization quality, especially on CNN/Daily Mail dataset.
We tried Trigram Blocking (Liu, 2019) for extractor and Reranking (Chen and Bansal, 2018) for abstractor, and we empirically found that the reranking only improves the performance. This helps the model to compress the extracted sentences focusing on disjoint information, even if there are some partial overlaps between the sentences. Our best abstractive model (BERT-ext + abs + RL + rerank) achieves the new state-of-theart performance for abstractive summarization in terms of average ROUGE score, with large margins on ROUGE-L.
However, we empirically found that the reranking method has no effect or has negative effect on NYT50 or DUC-2002 dataset. Hence, we don't apply it for the remaining datasets.
Combinatorial Reward Before seeing the effects of our summary-level rewards on final results, we check the upper bounds of different training signals for the full model. All the document sentences are paraphrased with our trained abstractor, and then we find the best set for each search method. Sentence-matching finds sentences with the highest ROUGE-L score for each sentence in the gold summary. This search method matches with the best reward from Chen and Bansal (2018). Greedy Search is the same method explained for extractor pre-training in section 4.1.

Models
Relevance Readability Total Sentence Rewrite (Chen and Bansal, 2018) 56 59 115 BERTSUM (Liu, 2019) 58 60 118 BERT-ext + abs + RL + rerank (ours) 66 61 127 which has the highest summary-level ROUGE-L score, from all the possible combinations of sentences. Due to time constraints, we limited the maximum number of sentences to 5. This method corresponds to our final return in RL training. Table 3 shows the summary-level ROUGE scores of previously explained methods. We see considerable gaps between Sentence-matching and Greedy Search, while the scores of Greedy Search are close to those of Combination Search. Note that since we limited the number of sentences for Combination Search, the exact scores for it would be higher. The scores can be interpreted to be upper bounds for corresponding training methods. This result supports our training strategy; pretraining with Greedy Search and final optimization with the combinatorial return.
Additionally, we experiment to verify the contribution of our training method. We train the same model with different training signals; Sentencelevel reward from Chen and Bansal (2018) and combinatorial reward from ours. The results are shown in Table 4. Both with and without reranking, the models trained with the combinatorial reward consistently outperform those trained with the sentence-level reward.

Human Evaluation
We also conduct human evaluation to ensure robustness of our training procedure. We measure relevance and readability of the summaries. Relevance is based on the summary containing important, salient information from the input article, being correct by avoiding contradictory/unrelated information, and avoiding repeated/redundant information. Readability is based on the summarys fluency, grammaticality, and coherence. To evaluate both these criteria, we design a Amazon Mechanical Turk experiment based on ranking method, inspired by Kiritchenko and Mohammad (2017). We randomly select 20 samples from the CNN/Daily Mail test set and ask the human testers (3 for each sample) to rank summaries (for relevance and readability) produced by 3 different models: our final model, that of Chen and Bansal (2018) and that of Liu (2019). 2, 1 and 0 points were given according to the ranking.

Models
R-1 R-2 R-L Extractive First sentences (Durrett et al., 2016) 28.60 17.30 -First k words (Durrett et al., 2016) 35.70 21.60 -Full (Durrett et al., 2016) 42.20 24.90 -BERTSUM (Liu, 2019) 46.66 26.35 42.62 Abstractive Deep Reinforced (Paulus et al., 2018) 42.94 26.02 -Two-Stage BERT (Zhang et al., 2019a)    The models were anonymized and randomly shuffled. Following previous work, the input article and ground truth summaries are also shown to the human participants in addition to the three model summaries. From the results shown in Table 5, we can see that our model is better in relevance compared to others. In terms of readability, there was no noticeable difference. Table 6 gives the results on NYT50 dataset. We see our BERT-ext + abs + RL outperforms all the extractive and abstractive models, except ROUGE-1 from Liu (2019). Comparing with two recent models that adapted BERT on their summarization models (Liu, 2019;Zhang et al., 2019a), we can say that we proposed another method successfully leveraging BERT for summarization. In addition, the experiment proves the effectiveness of our RL training, with about 2 point improvement for each ROUGE metric.

DUC-2002
We also evaluated the models trained on the CNN/Daily Mail dataset on the out-of-domain DUC-2002 test set as shown in Table 7. BERText + abs + RL outperforms baseline models with large margins on all of the ROUGE scores. This result shows that our model generalizes better.

Related Work
There has been a variety of deep neural network models for abstractive document summarization. One of the most dominant structures is the sequence-to-sequence (seq2seq) models with attention mechanism (Rush et al., 2015;Nallapati et al., 2016). See et al. (2017) introduced Pointer Generator network that implicitly combines the abstraction with the extraction, using copy mechanism (Gu et al., 2016;Zeng et al., 2016). More recently, there have been several studies that have attempted to improve the performance of the abstractive summarization by explicitly combining them with extractive models. Some notable examples include the use of inconsistency loss (Hsu et al., 2018), key phrase extraction (Li et al., 2018;Gehrmann et al., 2018), and sentence extraction with rewriting (Chen and Bansal, 2018). Our model improves Sentence Rewriting with BERT as an extractor and summary-level rewards to optimize the extractor. Reinforcement learning has been shown to be effective to directly optimize a non-differentiable objective in language generation including text summarization (Ranzato et al., 2016;Bahdanau et al., 2017;Paulus et al., 2018;Celikyilmaz et al., 2018;Narayan et al., 2018). Bahdanau et al. (2017) use actor-critic methods for language generation, using reward shaping (Ng et al., 1999) to solve the sparsity of training signals. Inspired by this, we generalize it to sentence extraction to give per step reward preserving optimality.

Conclusions
We have improved Sentence Rewriting approaches for abstractive summarization, proposing a novel extractor architecture exploiting BERT and a novel training procedure which globally optimizes summary-level ROUGE metric. Our approach achieves the new state-of-the-art on both CNN/Daily Mail and New York Times datasets as well as much better generalization on DUC-2002 test set.