BanditSum: Extractive Summarization as a Contextual Bandit

In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels. We call our approach BanditSum as it treats extractive summarization as a contextual bandit (CB) problem, where the model receives a document to summarize (the context), and chooses a sequence of sentences to include in the summary (the action). A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. We perform a series of experiments demonstrating that BanditSum is able to achieve ROUGE scores that are better than or comparable to the state-of-the-art for extractive summarization, and converges using significantly fewer update steps than competing approaches. In addition, we show empirically that BanditSum performs significantly better than competing approaches when good summary sentences appear late in the source document.


Introduction
Single-document summarization methods can be divided into two categories: extractive and abstractive. Extractive summarization systems form summaries by selecting and copying text snippets from the document, while abstractive methods aim to generate concise summaries with paraphrasing. This work is primarily concerned with extractive * Equal contribution. summarization. Though abstractive summarization methods have made strides in recent years, extractive techniques are still very attractive as they are simpler, faster, and more reliably yield semantically and grammatically correct sentences.
Many extractive summarizers work by selecting sentences from the input document (Luhn, 1958;Mihalcea and Tarau, 2004;Wong et al., 2008;Kågebäck et al., 2014;Yin and Pei, 2015;Cao et al., 2015;Yasunaga et al., 2017). Furthermore, a growing trend is to frame this sentence selection process as a sequential binary labeling problem, where binary inclusion/exclusion labels are chosen for sentences one at a time, starting from the beginning of the document, and decisions about later sentences may be conditioned on decisions about earlier sentences. Recurrent neural networks may be trained with stochastic gradient ascent to maximize the likelihood of a set of ground-truth binary label sequences (Cheng and Lapata, 2016;Nallapati et al., 2017). However, this approach has two well-recognized disadvantages. First, it suffers from exposure bias, a form of mismatch between training and testing data distributions which can hurt performance (Ranzato et al., 2015;Bahdanau et al., 2017;Paulus et al., 2018). Second, extractive labels must be generated by a heuristic, as summarization datasets do not generally include ground-truth extractive labels; the ultimate performance of models trained on such labels is thus fundamentally limited by the quality of the heuristic.
An alternative to maximum likelihood training is to use reinforcement learning to train the model to directly maximize a measure of summary quality, such as the ROUGE score between the generated summary and a ground-truth abstractive summary (Wu and Hu, 2018). This approach has become popular because it avoids exposure bias, and directly optimizes a measure of summary quality. However, it also has a number of downsides. For one, the search space is quite large: for a document of length T , there are 2 T possible extractive summaries. This makes the exploration problem faced by the reinforcement learning algorithm during training very difficult. Another issue is that due to the sequential nature of selection, the model is inherently biased in favor of selecting earlier sentences over later ones, a phenomenon which we demonstrate empirically in Section 7. The first issue can be resolved to a degree using either a cumbersome maximum likelihood-based pre-training step (using heuristically-generated labels) (Wu and Hu, 2018), or placing a hard upper limit on the number of sentences selected. The second issue is more problematic, as it is inherent to the sequential binary labeling setting.
In the current work, we introduce BANDITSUM, a novel method for training neural network-based extractive summarizers with reinforcement learning. This method does away with the sequential binary labeling setting, instead formulating extractive summarization as a contextual bandit. This move greatly reduces the size of the space that must be explored, removes the need to perform supervised pre-training, and prevents systematically privileging earlier sentences over later ones. Although the strong performance of Lead-3 indicates that good sentences often occur early in the source article, we show in Sections 6 and 7 that the contextual bandit setting greatly improves model performance when good sentences occur late without sacrificing performance when good sentences occur early.
Under this reformulation, BANDITSUM takes the document as input and outputs an affinity for each of the sentences therein. An affinity is a real number in [0, 1] which quantifies the model's propensity for including a sentence in the summary. These affinities are then used in a process of repeated sampling-without-replacement which does not privilege earlier sentences over later ones. BANDITSUM is free to process the document as a whole before yielding affinities, which permits affinities for different sentences in the document to depend on one another in arbitrary ways. In our technical section, we show how to apply policy gradient reinforcement learning methods to this setting.
The contributions of our work are as follows: • We propose a theoretically grounded method, based on the contextual bandit formalism, for training neural network-based extractive summarizers with reinforcement learning. Based on this training method, we propose the BANDITSUM system for extractive summarization.
• We perform experiments demonstrating that BANDITSUM obtains state-of-the-art performance on a number of datasets and requires significantly fewer update steps than competing approaches.
• We perform human evaluations showing that in the eyes of human judges, summaries created by BANDITSUM are less redundant and of higher overall quality than summaries created by competing approaches.
• We provide evidence, in the form of experiments in which models are trained on subsets of the data, that the improved performance of BANDITSUM over competitors stems in part from better handling of summary-worthy sentences that come near the end of the document (see Section 7).

Related Work
Extractive summarization has been widely studied in the past. Recently, neural network-based methods have been gaining popularity over classical methods (Luhn, 1958;Gong and Liu, 2001;Conroy and O'leary, 2001;Mihalcea and Tarau, 2004;Wong et al., 2008), as they have demonstrated stronger performance on large corpora. Central to the neural network-based models is the encoderdecoder structure. These models typically use either a convolution neural network (Kalchbrenner et al., 2014;Kim, 2014;Yin and Pei, 2015;Cao et al., 2015), a recurrent neural network (Chung et al., 2014;Cheng and Lapata, 2016;Nallapati et al., 2017), or a combination of the two (Narayan et al., 2018;Wu and Hu, 2018) to create sentence and document representations, using word embeddings (Mikolov et al., 2013;Pennington et al., 2014) to represent words at the input level. These vectors are then fed into a decoder network to generate the output summary. The use of reinforcement learning (RL) in extractive summarization was first explored by Ryang and Abekawa (2012), who proposed to use the TD(λ) algorithm to learn a value function for sentence selection. Rioux et al. (2014) improved this framework by replacing the learning agent with another TD(λ) algorithm. However, the performance of their methods was limited by the use of shallow function approximators, which required performing a fresh round of reinforcement learning for every new document to be summarized. The more recent work of Paulus et al. (2018) and Wu and Hu (2018) use reinforcement learning in a sequential labeling setting to train abstractive and extractive summarizers, respectively, while Chen and Bansal (2018) combines both approaches, applying abstractive summarization to a set of sentences extracted by a pointer network (Vinyals et al., 2015) trained via REINFORCE. However, pre-training with a maximum likelihood objective is required in all of these models.
The two works most similar to ours are Yao et al. (2018) and Narayan et al. (2018). Yao et al. (2018) recently proposed an extractive summarization approach based on deep Q learning, a type of reinforcement learning. However, their approach is extremely computationally intensive (a minimum of 10 days before convergence), and was unable to achieve ROUGE scores better than the best maximum likelihood-based approach. Narayan et al. (2018) uses a cascade of filters in order to arrive at a set of candidate extractive summaries, which we can regard as an approximation of the true action space. They then use an approximation of a policy gradient method to train their neural network to select summaries from this approximated action space. In contrast, BANDIT-SUM samples directly from the true action space, and uses exact policy gradient parameter updates.

Extractive Summarization as a Contextual Bandit
Our approach formulates extractive summarization as a contextual bandit which we then train an agent to solve using policy gradient reinforcement learning. A bandit is a decision-making formalization in which an agent repeatedly chooses one of several actions, and receives a reward based on this choice. The agent's goal is to quickly learn which action yields the most favorable distribution over rewards, and choose that action as often as possible. In a contextual bandit, at each trial, a context is sampled and shown to the agent, after which the agent selects an action and receives a reward; importantly, the rewards yielded by the actions may depend on the sampled context. The agent must quickly learn which actions are favorable in which contexts. Contextual bandits are a subset of Markov Decision Processes in which every episode has length one. Extractive summarization may be regarded as a contextual bandit as follows. Each document is a context, and each ordered subset of a document's sentences is a different action. Formally, assume that each context is a document d consisting of sentences s = (s 1 , . . . , s N d ), and that each action is a length-M sequence of unique sentence indices and M is an integer hyper-parameter. For each i, the extractive summary induced by i is given by (s i 1 , . . . , s i M ). An action i taken in context d is given a reward R(i, a), where a is the gold-standard abstractive summary that is paired with document d, and R is a scalar reward function quantifying the degree of match between a and the summary induced by i.
A policy for extractive summarization is a neural network p θ (·|d), parameterized by a vector θ, which, for each input document d, yields a probability distribution over index sequences. Our goal is to find parameters θ which cause p θ (·|d) to assign high probability to index sequences that induce extractive summaries that a human reader would judge to be of high-quality. We achieve this by maximizing the following objective function with respect to parameters θ: where the expectation is taken over documents d paired with gold-standard abstractive summaries a, as well as over index sequences i generated according to p θ (·|d).

Policy Gradient Reinforcement Learning
Ideally, we would like to maximize (1) using gradient ascent. However, the required gradient cannot be obtained using usual techniques (e.g. simple backpropagation) because i must be discretely sampled in order to compute R(i, a).
Fortunately, we can use the likelihood ratio gradient estimator from reinforcement learning and stochastic optimization (Williams, 1992;Sutton et al., 2000), which tells us that the gradient of this function can be computed as: where the expectation is taken over the same variables as (1). Since we typically do not know the exact document distribution and thus cannot evaluate the expected value in (2), we instead estimate it by sampling. We found that we obtained the best performance when, for each update, we first sample one document/summary pair (d, a), then sample B index sequences i 1 , . . . , i B from p θ (·|d), and finally take the empirical average: This overall learning algorithm can be regarded as an instance of the REINFORCE policy gradient algorithm (Williams, 1992).

Structure of p θ (·|d)
There are many possible choices for the structure of p θ (·|d); we opt for one that avoids privileging early sentences over later ones. We first decompose p θ (·|d) into two parts: π θ , a deterministic function which contains all the network's parameters, and µ, a probability distribution parameterized by the output of π θ . Concretely: Given an input document d, π θ outputs a realvalued vector of sentence affinities whose length is equal to the number of sentences in the document (i.e. π θ (d) ∈ R N d ) and whose elements fall in the range [0, 1]. The t-th entry π(d) t may be roughly interpreted as the network's propensity to include sentence s t in the summary of d.
Given sentence affinities π θ (d), µ implements a process of repeated sampling-withoutreplacement. This proceeds by repeatedly normalizing the set of affinities corresponding to sentences that have not yet been selected, thereby obtaining a probability distribution over unselected sentences, and sampling from that distribution to obtain a new sentence to include. This normalizeand-sample step is repeated M times, yielding M unique sentences to include in the summary.
At each step of sampling-without-replacement, we also include a small probability of sampling uniformly from all remaining sentences. This is used to achieve adequate exploration during training, and is similar to the -greedy technique from reinforcement learning.
Under this sampling scheme, we have the following expression for p θ (i|d): where z(d) = t π(d) t . For index sequences that have length different from M , or that contain duplicate indices, we have p θ (i|d) = 0. Using this expression, it is straightforward to use automatic differentiation software to compute ∇ θ log p θ (i|d), which is required for the gradient estimate in (3).

Baseline for Variance Reduction
Our sample-based gradient estimate can have high variance, which can slow the learning. One potential cause of this high variance can be seen by inspecting (3), and noting that it basically acts to change the probability of a sampled index sequence to an extent determined by the reward R(i, a). However, since ROUGE scores are always positive, the probability of every sampled index sequence is increased, whereas intuitively, we would prefer to decrease the probability of sequences that receive a comparatively low reward, even if it is positive. This can be remedied by the introduction of a so-called baseline which is subtracted from all rewards.
Using a baseline r, our sample-based estimate of ∇ θ J(θ) becomes: It can be shown that the introduction of r does not bias the gradient estimator and can significantly reduce its variance if chosen appropriately (Sutton et al., 2000). There are several possibilities for the baseline, including the long-term average reward and the average reward across different samples for one document-summary pair. We choose an approach known as self-critical reinforcement learning, in which the test-time performance of the current model is used as the baseline (Ranzato et al., 2015;Rennie et al., 2017;Paulus et al., 2018). More concretely, after sampling the document-summary pair (d, a), we greedily generate an index sequence using the current parameters θ: and calculate the baseline for the current update as r = R(i greedy , a). This baseline has the intuitively satisfying property of only increasing the probability of a sampled label sequence when the summary it induces is better than what would be obtained by greedy decoding.

Reward Function
A final consideration is a concrete choice for the reward function R(i, a). Throughout this work we use: )). (8) The above reward function optimizes the average of all the ROUGE variants (Lin, 2004) while balancing precision and recall.

Model
In this section, we discuss the concrete instantiations of the neural network π θ that we use in our experiments. We break π θ up into two components: a document encoder f θ1 , which outputs a sequence of sentence feature vectors (h 1 , . . . , h N d ) and a decoder g θ2 which yields sentence affinities: Encoder. Features for each sentence in isolation are first obtained by applying a word-level Bidirectional Recurrent Neural Network (BiRNN) to the embeddings for the words in the sentence, and averaging the hidden states over words. A separate sentence-level BiRNN is then used to obtain a representations h i for each sentence in the context of the document. Decoder. A multi-layer perceptron is used to map from the representation h t of each sentence through a final sigmoid unit to yield sentence affinities π θ (d).
The use of a bidirectional recurrent network in the encoder is crucial, as it allows the network to process the document as a whole, yielding representations for each sentence that take all other sentences into account. This procedure is necessary to deal with some aspects of summary quality such as redundancy (avoiding the inclusion of multiple sentences with similar meaning), which requires the affinities for different sentences to depend on one another. For example, to avoid redundancy, if the affinity for some sentence is high, then sentences which express similar meaning should have low affinities.

Experiments
In this section, we discuss the setup of our experiments. We first discuss the corpora that we used and our evaluation methodology. We then discuss the baseline methods against which we compared, and conclude with a detailed overview of the settings of the model parameters.

Corpora
Three datasets are used for our experiments: the CNN, the Daily Mail, and combined CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016). We use the standard split of Hermann et al. (2015) for training, validating, and testing and the same setting without anonymization on the three corpus as See et al. (2017). The Daily Mail corpus has 196,557 training documents, 12,147 validation documents and 10,397 test documents; while the CNN corpus has 90,266/1,220/1,093 documents, respectively.

Evaluation
The models are evaluated based on ROUGE (Lin, 2004). We obtain our ROUGE scores using the standard pyrouge package 1 for the test set evaluation and a faster python implementation of the ROUGE metric 2 for training and evaluating on the validation set. We report the F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L, which compute the uniform, bigram, and longest common subsequence overlapping with the reference summaries.

Model Settings
We use 100-dimensional Glove embeddings (Pennington et al., 2014) as our embedding initialization. We do not limit the sentence length, nor the maximum number of sentences per document. We use one-layer BiLSTM for word-level RNN, and two-layers BiLSTM for sentence-level RNN. The hidden state dimension is 200 for each direction on all LSTMs. For the decoder, we use a feedforward network with one hidden layer of dimension 100.
During training, we use Adam (Kingma and Ba, 2015) as the optimizer with the learning rate of 5e −5 , beta parameters (0, 0.999), and a weight decay of 1e −6 , to maximize the objective function defined in equation (1). We employ gradient clipping of 1 to regularize our model. At each iteration, we sample B = 20 times to estimate the gradient defined in equation 3. For our system, the reported performance is obtained within two epochs of training 3 .
At the test time, we pick sentences sorted by the predicted probabilities until the length limit is reached. The full-length ROUGE F1 score is used as the evaluation metric. For M , the number of sentences selected per summary, we use a value of 3, based on our validation results as well as on the settings described in Nallapati et al. (2017).

Experiment Results
In this section, we present quantitative results from the ROUGE evaluation and qualitative results based on human evaluation. In addition, we demonstrate the stability of our RL model by comparing the validation curve of BANDITSUM with SummaRuNNer (Nallapati et al., 2017) trained with a maximum likelihood objective.

Rouge Evaluation
We present the results of comparing BANDITSUM to several baseline algorithms 4 on the CNN/Daily   Tables 1 and 2. Compared to other extractive summarization systems, BANDITSUM achieves performance that is significantly better than two RL-based approaches, Refresh (Narayan et al., 2018) and DQN (Yao et al., 2018), as well as SummaRuNNer, the state-of-the-art maximum liklihood-based extractive summarizer (Nallapati et al., 2017). BANDITSUM performs a little better than RNES (Wu and Hu, 2018) in terms of ROUGE-1 and slightly worse in terms of ROUGE-2. However, RNES requires pre-training with the maximum likelihood objective on heuristicallygenerated extractive labels; in contrast, BANDIT-SUM is very light-weight and converges significantly faster. We discuss the advantage of framing the extractive summarization based on the contextual bandit (BANDITSUM) over the sequential binary labeling setting (RNES) in the discussion Section 7.

Mail corpus in
We also noticed that different choices for the policy gradient baseline (see Section 3.3) in BAN-DITSUM affect learning speed, but do not significantly affect asymptotic performance. Models trained with an average reward baseline learned most quickly, while models trained with three different baselines (greedy, average reward in a the test set provided by Narayan et al. (2018). Since their Lead score is a combination of Lead-3 for CNN and Lead-4 for Daily Mail, we recompute the Lead-3 scores for both CNN and Daily Mail with the preprocessing steps used in See et al. (2017). Additionally, our results are not directly comparable to results based on the anonymized dataset used by Nallapati et al. (2017). batch, average global reward) all perform roughly the same after training for one epoch. Models trained without a baseline were found to underperform other baseline choices by about 2 points of ROUGE score on average.

Human Evaluation
We also conduct a qualitative evaluation to understand the effects of the improvements introduced in BANDITSUM on human judgments of the generated summaries. To assess the effect of training with RL rather than maximum likelihood, in the first set of human evaluations we compare BANDITSUM with the state-of-the-art maximum likelihood-based model SummaRuN-Ner. To evaluate the importance of using an exact, rather than approximate, policy gradient to optimize ROUGE scores, we perform another human evaluation comparing BANDITSUM and Refresh, an RL-based method that uses the an approximation of the policy gradient.
We follow a human evaluation protocol similar to the one used in Wu and Hu (2018). Given a set of N documents, we ask K volunteers to evaluate the summaries extracted by both systems. For each document, a reference summary, and a pair of randomly ordered extractive summaries (one generated by each of the two models) is presented to the volunteers. They are asked to compare and rank the extracted summaries along three dimensions: overall, coverage, and non-redundancy.   To compare with SummaRuNNer, we randomly sample 57 documents from the test set of Daily-Mail and ask 5 volunteers to evaluate the extracted summaries. While comparing with Refresh, we use the 20 documents (10 CNN and 10 Daily-Mail) provided by Narayan et al. (2018) to 4 volunteers. Tables 3 and 4 show the results of human evaluation in these two settings. BANDIT-SUM is shown to be better than Refresh and Sum-maRuNNer in terms of overall quality and nonredundancy. These results indicate that the use of the true policy gradient, rather than the approximation used by Refresh, improves overall quality. It is interesting to observe that, even though BAN-DITSUM does not have an explicit redundancy avoidance mechanism, it actually outperforms the other systems on non-redundancy.

Learning Curve
Reinforcement learning methods are known for sometimes being unstable during training. However, this seems to be less of a problem for BAN-DITSUM, perhaps because it is formulated as a contextual bandit rather than a sequential labeling problem. We show this by comparing the validation curves generated by BANDITSUM and the state-of-the-art maximum likelihood-based model -SummaRuNNer (Nallapati et al., 2017) (Figure 1). From Figure 1, we observe that BANDITSUM converges significantly more quickly to good results than SummaRuNNer. Moreover, there is less variance in the performance of BANDITSUM.
One possible reason is that extractive summarization does not have well-defined supervised labels. There exists a mismatch between the provided labels and human-generated abstractive summaries. Hence, the gradient, computed from the maximum likelihood loss function, is not optimizing the evaluation metric of interest. Another important message is that both models are still far from the estimated upper bound 5 , which shows that there is still significant room for improvement.

Run Time
On CNN/Daily mail dataset, our model's timeper-epoch is about 25.5 hours on a TITAN Xp. We trained the model for 3 epochs, which took about 76 hours in total. For comparison, DQN took about 10 days to train on a GTX 1080 (Yao et al., 2018). Refresh took about 12 hours on a single GPU to train (Narayan et al., 2018). Note that this figure does not take into account the significant time required by Refresh for pre-computing ROUGE scores.

Discussion: Contextual Bandit Setting
Vs. Sequential Full RL Labeling We conjecture that the contextual bandit (CB) setting is a more suitable framework for modeling extractive summarization than the sequential binary labeling setting, especially in the cases when good summary sentences appear later in the document. The intuition behind this is that models based on the sequential labeling setting are affected by the order of the decisions, which biases towards selecting sentences that appear earlier in the document. By contrast, our CB-based RL model has more flexibility and freedom to explore the search space, as it samples the sentences without replacement based on the affinity scores. Note that although we do not explicitly make the selection decisions in a sequential fashion, the sequential information about dependencies between sentences is implicitly embedded in the affinity scores, which are produced by bidirectional RNNs. We provide empirical evidence for this conjecture by comparing BANDITSUM to the sequential RL model proposed by Wu and Hu (2018) (Figure 2) on two subsets of the data: one with good 5 The supervised labels for the upper bound estimation are obtained using the heuristic described in Nallapati et al. (2017). summary sentences appearing early in the article, while the other contains articles where good summary sentences appear late. Specifically, we construct two evaluation datasets by selecting the first 50 documents (D early , i.e., best summary occurs early) and the last 50 documents (D late , i.e., best summary occurs late) from a sample of 1000 documents that is ordered by the average extractive label index idx. Given an article with n sentences indexed from 1, . . . , n and a greedy extractive labels set with three sentences (i, j, k) 6 , the average index for the extractive label is computed by idx= (i + j + k)/3n. For each model, the results were obtained by averaging f across ten trials with 100 epochs in each trail. D early and D late consist of 50 articles each, such that the good summary sentences appear early and late in the article, respectively. We observe a significant advantage of BANDITSUM compared to RNES and RNES3 (based on the sequential binary labeling setting) on D late . Given these two subsets of the data, three different models (BANDITSUM, RNES and RNES3) are trained and evaluated on each of the two datasets without extractive labels. Since the original sequential RL model (RNES) is unstable without supervised pre-training, we propose the RNES3 model that is limited to select no more then three sentences. Starting with random initializations without supervised pre-training, we train each model ten times for 100 epochs and plot the learning curve of the average ROUGE-F1 score computed based on the trained model in Figure 2. We can clearly see that BANDITSUM finds a better so-lution more quickly than RNES and RNES3 on both datasets. Moreover, it displays a significantly speed-up in the exploration and finds the best solution when good summary sentences appeared later in the document (D late ).

Conclusion
In this work, we presented a contextual bandit learning framework, BANDITSUM , for extractive summarization, based on neural networks and reinforcement learning algorithms. BANDIT-SUM does not require sentence-level extractive labels and optimizes ROUGE scores between summaries generated by the model and abstractive reference summaries. Empirical results show that our method performs better than or comparable to state-of-the-art extractive summarization models which must be pre-trained on extractive labels, and converges using significantly fewer update steps than competing approaches. In future work, we will explore the direction of adding an extra coherence reward (Wu and Hu, 2018) to improve the quality of extracted summaries in terms of sentence discourse relation.