Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

Inspired by how humans summarize long documents, we propose an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i.e., compresses and paraphrases) to generate a concise overall summary. We use a novel sentence-level policy gradient method to bridge the non-differentiable computation between these two neural networks in a hierarchical way, while maintaining language fluency. Empirically, we achieve the new state-of-the-art on all metrics (including human evaluation) on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores. Moreover, by first operating at the sentence-level and then the word-level, we enable parallel decoding of our neural generative model that results in substantially faster (10-20x) inference speed as well as 4x faster training convergence than previous long-paragraph encoder-decoder models. We also demonstrate the generalization of our model on the test-only DUC-2002 dataset, where we achieve higher scores than a state-of-the-art model.


Introduction
The task of document summarization has two main paradigms: extractive and abstractive. The former method directly chooses and outputs the salient sentences (or phrases) in the original document (Jing and McKeown, 2000;Knight and Marcu, 2000;Martins and Smith, 2009;Berg-Kirkpatrick et al., 2011). The latter abstractive approach involves rewriting the summary (Banko et al., 2000;Zajic et al., 2004), and has seen substantial recent gains due to neural sequence-to-sequence models Nallapati et al., 2016;See et al., 2017;Paulus et al., 2018). Abstractive models can be more concise by performing generation from scratch, but they suffer from slow and inaccurate encoding of very long documents, with the attention model being required to look at all encoded words (in long paragraphs) for decoding each generated summary word (slow, one by one sequentially). Abstractive models also suffer from redundancy (repetitions), especially when generating multi-sentence summary.
To address both these issues and combine the advantages of both paradigms, we propose a hybrid extractive-abstractive architecture, with policy-based reinforcement learning (RL) to bridge together the two networks. Similar to how humans summarize long documents, our model first uses an extractor agent to select salient sentences or highlights, and then employs an abstractor network to rewrite (i.e., compress and paraphrase) each of these extracted sentences. To overcome the non-differentiable behavior of our extractor and train on available document-summary pairs without saliency label, we next use actorcritic policy gradient with sentence-level metric rewards to connect these two neural networks and to learn sentence saliency. We also avoid common language fluency issues (Paulus et al., 2018) by preventing the policy gradients from affecting the abstractive summarizer's word-level training, which is supported by our human evaluation study. Our sentence-level reinforcement learning takes into account the word-sentence hierarchy, which better models the language structure and makes parallelization possible. Our extractor combines reinforcement learning and pointer networks, which is inspired by Bello et al. (2017)'s attempt to solve the Traveling Salesman Problem. Our abstractor is a simple encoder-aligner-decoder model (with copying) and is trained on pseudo document-summary sentence pairs obtained via simple automatic matching criteria.
Thus, our method incorporates the abstractive paradigm's advantages of concisely rewriting sentences and generating novel words from the full vocabulary, yet it adopts intermediate extractive behavior to improve the overall model's quality, speed, and stability. Instead of encoding and attending to every word in the long input document sequentially, our model adopts a human-inspired coarse-to-fine approach that first extracts all the salient sentences and then decodes (rewrites) them (in parallel). This also avoids almost all redundancy issues because the model has already chosen non-redundant salient sentences to abstractively summarize (but adding an optional final reranker component does give additional gains by removing the fewer across-sentence repetitions).
Empirically, our approach is the new state-ofthe-art on all ROUGE metrics (Lin, 2004) as well as on METEOR (Denkowski and Lavie, 2014) of the CNN/Daily Mail dataset, achieving statistically significant improvements over previous models that use complex long-encoder, copy, and coverage mechanisms (See et al., 2017). The test-only DUC-2002 improvement also shows our model's better generalization than this strong abstractive system. In addition, we surpass the popular lead-3 baseline on all ROUGE scores with an abstractive model. Moreover, our sentence-level abstractive rewriting module also produces substantially more (3x) novel N -grams that are not seen in the input document, as compared to the strong flat-structured model of See et al. (2017). This empirically justifies that our RL-guided extractor has learned sentence saliency, rather than benefiting from simply copying longer sentences. We also show that our model maintains the same level of fluency as a conventional RNN-based model because the reward does not leak to our abstractor's word-level training. Finally, our model's training is 4x and inference is more than 20x faster than the previous state-of-the-art. The optional final reranker gives further improvements while maintaining a 7x speedup.
Overall, our contribution is three fold: First we propose a novel sentence-level RL technique for the well-known task of abstractive summarization, effectively utilizing the word-then-sentence hierarchical structure without annotated matching sentence-pairs between the document and ground truth summary. Next, our model achieves the new state-of-the-art on all metrics of multiple versions of a popular summarization dataset (as well as a test-only dataset) both extractively and abstractively, without loss in language fluency (also demonstrated via human evaluation and abstractiveness scores). Finally, our parallel decoding results in a significant 10-20x speed-up over the previous best neural abstractive summarization system with even better accuracy. 1

Model
In this work, we consider the task of summarizing a given long text document into several (ordered) highlights, which are then combined to form a multi-sentence summary. Formally, given a training set of document-summary pairs Furthermore, we assume there exists an abstracting function g defined as: where S i is the set of summary sentences in x i and D i the set of document sentences in y i . i.e., in any given pair of document and summary, every summary sentence can be produced from some document sentence. For simplicity, we omit subscript i in the remainder of the paper. Under this assumption, we can further define another latent function f : X → D n that satisfies f (x) = {d t } n j=1 and y = h(x) = [g(d 1 ), g(d 2 ), . . . , g(d n )], where [, ] denotes sentence concatenation. This latent function f can be seen as an extractor that chooses the salient (ordered) sentences in a given document for the abstracting function g to rewrite. Our overall model consists of these two submodules, the extractor agent and the abstractor network, to approximate the above-mentioned f and g, respectively.

Extractor Agent
The extractor agent is designed to model f , which can be thought of as extracting salient sentences from the document. We exploit a hierarchical neural model to learn the sentence representations of the document and a 'selection network' to extract sentences based on their representations.
Context-aware Sent. Reps. (from previous extraction step) CONV Embedded Word Vectors Convolutional Sentence Encoder Figure 1: Our extractor agent: the convolutional encoder computes representation r j for each sentence. The RNN encoder (blue) computes context-aware representation h j and then the RNN decoder (green) selects sentence j t at time step t. With j t selected, h jt will be fed into the decoder at time t + 1.

Hierarchical Sentence Representation
We use a temporal convolutional model (Kim, 2014) to compute r j , the representation of each individual sentence in the documents (details in supplementary). To further incorporate global context of the document and capture the long-range semantic dependency between sentences, a bidirectional LSTM- RNN (Hochreiter and Schmidhuber, 1997;Schuster et al., 1997) is applied on the convolutional output. This enables learning a strong representation, denoted as h j for the j-th sentence in the document, that takes into account the context of all previous and future sentences in the same document.

Sentence Selection
Next, to select the extracted sentences based on the above sentence representations, we add another LSTM-RNN to train a Pointer Network (Vinyals et al., 2015), to extract sentences recurrently. We calculate the extraction probability by: (2) where e t 's are the output of the glimpse operation (Vinyals et al., 2016):

Generated Sentence
Reward RL Agent

Policy Gradient Update
Observation

Document Sentences
Action (extract sent.) Figure 2: Reinforced training of the extractor (for one extraction step) and its interaction with the abstractor. For simplicity, the critic network is not shown. Note that all d's and s t are raw sentences, not vector representations.
In Eqn. 3, z t is the output of the added LSTM-RNN (shown in green in Fig. 1) which is referred to as the decoder. All the W 's and v's are trainable parameters. At each time step t, the decoder performs a 2-hop attention mechanism: It first attends to h j 's to get a context vector e t and then attends to h j 's again for the extraction probabilities. 2 This model is essentially classifying all sentences of the document at each extraction step. An illustration of the whole extractor is shown in Fig. 1.

Abstractor Network
The abstractor network approximates g, which compresses and paraphrases an extracted document sentence to a concise summary sentence. We use the standard encoder-aligner-decoder (Bahdanau et al., 2015;Luong et al., 2015). We add the copy mechanism 3 to help directly copy some outof-vocabulary (OOV) words (See et al., 2017). For more details, please refer to the supplementary.

Learning
Given that our extractor performs a nondifferentiable hard extraction, we apply standard policy gradient methods to bridge the backpropagation and form an end-to-end trainable (stochastic) computation graph. However, simply starting from a randomly initialized network to train the whole model in an end-to-end fashion is infeasible. When randomly initialized, the extractor would often select sentences that are not relevant, so it would be difficult for the abstractor to learn to abstractively rewrite. On the other hand, without a well-trained abstractor the extractor would get noisy reward, which leads to a bad estimate of the policy gradient and a sub-optimal policy. We hence propose optimizing each sub-module separately using maximumlikelihood (ML) objectives: train the extractor to select salient sentences (fit f ) and the abstractor to generate shortened summary (fit g). Finally, RL is applied to train the full model end-to-end (fit h).

Maximum-Likelihood Training for Submodules
Extractor Training: In Sec. 2.1.2, we have formulated our sentence selection as classification. However, most of the summarization datasets are end-to-end document-summary pairs without extraction (saliency) labels for each sentence. Hence, we propose a simple similarity method to provide a 'proxy' target label for the extractor. Similar to the extractive model of Nallapati et al.
Given these proxy training labels, the extractor is then trained to minimize the cross-entropy loss. 3 We use the terminology of copy mechanism (originally named pointer-generator) in order to avoid confusion with the pointer network (Vinyals et al., 2015). 4 Nallapati et al. (2017) selected sentences greedily to maximize the global summary-level ROUGE, whereas we match exactly 1 document sentence for each GT summary sentence based on the individual sentence-level score.
Abstractor Training: For the abstractor training, we create training pairs by taking each summary sentence and pairing it with its extracted document sentence (based on Eqn. 6). The network is trained as an usual sequence-to-sequence model to minimize the cross-entropy loss L(θ abs ) = − 1 M M m=1 logP θ abs (w m |w 1:m−1 ) of the decoder language model at each generation step, where θ abs is the set of trainable parameters of the abstractor and w m the m th generated word.

Reinforce-Guided Extraction
Here we explain how policy gradient techniques are applied to optimize the whole model. To make the extractor an RL agent, we can formulate a Markov Decision Process (MDP) 5 : at each extraction step t, the agent observes the current state c t = (D, d j t−1 ), samples an action j t ∼ π θa,ω (c t , j) = P (j) from Eqn. 2 to extract a document sentence and receive a reward 6 after the abstractor summarizes the extracted sentence d jt . We denote the trainable parameters of the extractor agent by θ = {θ a , ω} for the decoder and hierarchical encoder respectively. We can then train the extractor with policy-based RL. We illustrate this process in Fig. 2. The vanilla policy gradient algorithm, REIN-FORCE (Williams, 1992), is known for high variance. To mitigate this problem, we add a critic network with trainable parameters θ c to predict the state-value function V π θa,ω (c). The predicted value of critic b θc,ω (c) is called the 'baseline', which is then used to estimate the advantage function: A π θ (c, j) = Q π θa,ω (c, j) − V π θa,ω (c) because the total return R t is an estimate of actionvalue function Q(c t , j t ). Instead of maximizing Q(c t , j t ) as done in REINFORCE, we maximize A π θ (c, j) with the following policy gradient: And the critic is trained to minimize the square loss: L c (θ c , ω) = (b θc,ω (c t ) − R t ) 2 . This is 5 Strictly speaking, this is a Partially Observable Markov Decision Process (POMDP). We approximate it as an MDP by assuming that the RNN hidden state contains all past info. 6 In Eqn. 6, we use ROUGE-recall because we want the extracted sentence to contain as much information as possible for rewriting. Nevertheless, for Eqn. 7, ROUGE-F1 is more suitable because the abstractor g is supposed to rewrite the extracted sentence d to be as concise as the ground truth s. known as the Advantage Actor-Critic (A2C), a synchronous variant of A3C (Mnih et al., 2016). For more A2C details, please refer to the supp.
Intuitively, our RL training works as follow: If the extractor chooses a good sentence, after the abstractor rewrites it the ROUGE match would be high and thus the action is encouraged. If a bad sentence is chosen, though the abstractor still produces a compressed version of it, the summary would not match the ground truth and the low ROUGE score discourages this action. Our RL with a sentence-level agent is a novel attempt in neural summarization. We use RL as a saliency guide without altering the abstractor's language model, while previous work applied RL on the word-level, which could be prone to gaming the metric at the cost of language fluency. 7 Learning how many sentences to extract: In a typical RL setting like game playing, an episode is usually terminated by the environment. On the other hand, in text summarization, the agent does not know in advance how many summary sentence to produce for a given article (since the desired length varies for different downstream applications). We make an important yet simple, intuitive adaptation to solve this: by adding a 'stop' action to the policy action space. In the RL training phase, we add another set of trainable parameters v EOE (EOE stands for 'End-Of-Extraction') with the same dimension as the sentence representation. The pointer-network decoder treats v EOE as one of the extraction candidates and hence naturally results in a stop action in the stochastic policy. We set the reward for the agent performing EOE to ROUGE- ; whereas for any extraneous, unwanted extraction step, the agent receives zero reward. The model is therefore encouraged to extract when there are still remaining ground-truth summary sentences (to accumulate intermediate reward), and learn to stop by optimizing a global ROUGE and avoiding extra extraction. 8 Overall, this modification allows dy-7 During this RL training of the extractor, we keep the abstractor parameters fixed. Because the input sentences for the abstractor are extracted by an intermediate stochastic policy of the extractor, it is impossible to find the correct target summary for the abstractor to fit g with ML objective. Though it is possible to optimize the abstractor with RL, in out preliminary experiments we found that this does not improve the overall ROUGE, most likely because this RL optimizes at a sentence-level and can add across-sentence redundancy. We achieve SotA results without this abstractor-level RL. 8 We use ROUGE-1 for terminal reward because it is a better measure of bag-of-words information (i.e., has all the namic decisions of number-of-sentences based on the input document, eliminates the need for tuning a fixed number of steps, and enables a data-driven adaptation for any specific dataset/application.

Repetition-Avoiding Reranking
Existing abstractive summarization systems on long documents suffer from generating repeating and redundant words and phrases. To mitigate this issue, See et al. (2017) propose the coverage mechanism and Paulus et al. (2018) incorporate tri-gram avoidance during beam-search at testtime. Our model without these already performs well because the summary sentences are generated from mutually exclusive document sentences, which naturally avoids redundancy. However, we do get a small further boost to the summary quality by removing a few 'across-sentence' repetitions, via a simple reranking strategy: At sentence-level, we apply the same beam-search tri-gram avoidance (Paulus et al., 2018). We keep all k sentence candidates generated by beam search, where k is the size of the beam. Next, we then rerank all k n combinations of the n generated summary sentence beams. The summaries are reranked by the number of repeated N -grams, the smaller the better. We also apply the diverse decoding algorithm described in Li et al. (2016) (which has almost no computation overhead) so as to get the above approach to produce useful diverse reranking lists. We show how much the redundancy affects the summarization task in Sec. 6.2.
Our model shares some high-level intuition with extract-then-compress methods. Earlier attempts in this paradigm used Hidden Markov Models and rule-based systems (Jing and McKeown, 2000), statistical models based on parse trees (Knight and Marcu, 2000), and integer linear programming based methods (Martins and Smith, 2009;Gillick and Favre, 2009;Clarke and Lapata, 2010;Berg-Kirkpatrick et al., 2011). Recent approaches investigated discourse structures (Louis et al., 2010;Hirao et al., 2013;Kikuchi et al., 2014;Wang et al., 2015), graph cuts (Qian and Liu, 2013), and parse trees (Li et al., 2014;Bing et al., 2015). For neural models, Cheng and Lapata (2016) used a second neural net to select words from an extractor's output. Our abstractor does not merely 'compress' the sentences but generatively produce novel words. Moreover, our RL bridges the extractor and the abstractor for end-to-end training.
Reinforcement learning has been used to optimize the non-differential metrics of language generation and to mitigate exposure bias (Ranzato et al., 2016;Bahdanau et al., 2017). Henß et al. (2015) use Q-learning based RL for extractive summarization. Paulus et al. (2018) use RL policy gradient methods for abstractive summarization, utilizing sequence-level metric rewards with curriculum learning (Ranzato et al., 2016) or weighted ML+RL mixed loss (Paulus et al., 2018) for stability and language fluency. We use sentence-level rewards to optimize the extractor while keeping our ML trained abstractor decoder fixed, so as to achieve the best of both worlds.
Training a neural network to use another fixed network has been investigated in machine translation for better decoding (Gu et al., 2017a) and real-time translation (Gu et al., 2017b). They used a fixed pretrained translator and applied policy gradient techniques to train another task-specific network. In question answering (QA), Choi et al. (2017) extract one sentence and then generate the answer from the sentence's vector representation with RL bridging. Another recent work attempted a new coarse-to-fine attention approach on summarization (Ling and Rush, 2017) and found desired sharp focus properties for scaling to larger inputs (though without metric improvements). Very recently (concurrently), Narayan et al. Finally, there are some loosely-related recent works: Zhou et al. (2017) proposed selective gate to improve the attention in abstractive summarization. Tan et al. (2018) used an extract-thensynthesis approach on QA, where an extraction model predicts the important spans in the passage and then another synthesis model generates the final answer. Swayamdipta et al. (2017) attempted cascaded non-recurrent small networks on extractive QA, resulting a scalable, parallelizable model. Fan et al. (2017) added controlling parameters to adapt the summary to length, style, and entity preferences. However, none of these used RL to bridge the non-differentiability of neural models.

Experimental Setup
Please refer to the supplementary for full training details (all hyperparameter tuning was performed on the validation set). We use the CNN/Daily Mail dataset (Hermann et al., 2015) modified for summarization (Nallapati et al., 2016). Because there are two versions of the dataset, original text and entity anonymized, we show results on both versions of the dataset for a fair comparison to prior works. The experiment runs training and evaluation for each version separately. Despite the fact that the 2 versions have been considered separately by the summarization community as 2 different datasets, we use same hyper-parameter values for both dataset versions to show the generalization of our model. We also show improvements on the DUC-2002 dataset in a test-only setup.

Modular Extractive vs. Abstractive
Our hybrid approach is capable of both extractive and abstractive (i.e., rewriting every sentence) summarization. The extractor alone performs extractive summarization. To investigate the effect of the recurrent extractor (rnn-ext), we implement a feed-forward extractive baseline ff-ext (details in supplementary). It is also possible to apply RL to extractor without using the abstractor (rnn-ext + RL). 9 Benefiting from the high modularity of our model, we can make our summarization system abstractive by simply applying the abstractor on the extracted sentences. Our abstractor rewrites each sentence and generates novel words from a large vocabulary, and hence every word in our overall summary is generated from scratch; making our full model categorized into the abstractive paradigm. 10 We run experiments on separately trained extractor/abstractor (ff-ext + abs, rnn-ext + abs) and the reinforced full model (rnn-ext + abs + RL) as well as the final reranking version (rnn-ext + abs + RL + rerank).

Results
For easier comparison, we show separate tables for the original-text vs. anonymized versions - Table 1 and Table 2, respectively. Overall, our model achieves strong improvements and the new state-of-the-art on both extractive and abstractive settings for both versions of the CNN/DM dataset (with some comparable results on the anonymized version). Moreover, Table 3 shows the generalization of our abstractive system to an out-ofdomain test-only setup (DUC-2002), where our model achieves better scores than See et al. (2017).

Extractive Summarization
In the extractive paradigm, we compare our model with the extractive model from Nallapati et al. 9 In this case the abstractor function g(d) = d. 10 Note that the abstractive CNN/DM dataset does not include any human-annotated extraction label, and hence our models do not receive any direct extractive supervision.  (2017) and a strong lead-3 baseline. For producing our summary, we simply concatenate the extracted sentences from the extractors. From Table 1 and Table 2, we can see that our feed-forward extractor out-performs the lead-3 baseline, empirically showing that our hierarchical sentence encoding model is capable of extracting salient sentences. 11 The reinforced extractor performs the best, because of the ability to get the summary-level reward and the reduced train-test mismatch of feeding the previous extraction decision. The improvement over lead-3 is consistent across both tables. In Table 2, it outperforms the previous best neural extractive model (Nallapati et al., 2017). In Table 1, our model also outperforms a recent, con-   (2018), showing that our pointer-network extractor and reward formulations are very effective when combined with A2C RL.

Abstractive Summarization
After applying the abstractor, the ff-ext based model still out-performs the rnn-ext model. Both combined models exceed the pointer-generator model (See et al., 2017) without coverage by a large margin for all metrics, showing the effectiveness of our 2-step hierarchical approach: our method naturally avoids repetition by extracting multiple sentences with different keypoints. 12 Moreover, after applying reinforcement learning, our model performs better than the best model of See et al. (2017) and the best ML trained model of Paulus et al. (2018). Our reinforced model outperforms the ML trained rnn-ext + abs baseline with statistical significance of p < 0.01 on all metrics for both version of the dataset, indicating the effectiveness of the RL training. Also, rnn-ext + abs + RL is statistically significant better than See et al. (2017) for all metrics with p < 0.01. 13 In the supplementary, we show the learning curve of our RL training, where the average reward goes up quickly after the extractor learns the End-of-Extract action and then stabilizes. For all the above models, we use standard greedy decoding and find that it performs well.
Reranking and Redundancy Although the extract-then-abstract approach inherently will not generate repeating sentences like other neuraldecoders do, there might still be across-sentence redundancy because the abstractor is not aware of other extracted sentences when decoding one. Hence, we incorporate an optional reranking strategy described in Sec. 3.3. The improved ROUGE scores indicate that this successfully removes some remaining redundancies and hence produces more concise summaries. Our best abstractive 12 A trivial lead-3 + abs baseline obtains ROUGE of (37.37, 15.59, 34.82), which again confirms the importance of our reinforce-based sentence selection. 13 We calculate statistical significance based on the bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994)   model (rnn-ext + abs + RL + rerank) is clearly superior than the one of See et al. (2017). We are comparable on R-1 and R-2 but a 0.4 point improvement on R-L w.r.t. Paulus et al. (2018). 14 We also outperform the results of Fan et al. (2017) on both original and anonymized dataset versions. Several previous works have pointed out that extractive baselines are very difficult to beat (in terms of ROUGE) by an abstractive system (See et al., 2017;Nallapati et al., 2017). Note that our best model is one of the first abstractive models to outperform the lead-3 baseline on the originaltext CNN/DM dataset. Our extractive experiment serves as a complementary analysis of the effect of RL with extractive systems.

Human Evaluation
We also conduct human evaluation to ensure robustness of our training procedure. We measure relevance and readability of the summaries. Relevance is based on the summary containing important, salient information from the input article, being correct by avoiding contradictory/unrelated information, and avoiding repeated/redundant information. Readability is based on the summarys fluency, grammaticality, and coherence. To evaluate both these criteria, we design the following Amazon MTurk experiment: we randomly select 100 samples from the CNN/DM test set and ask the human testers (3 for each sample) to rank between summaries (for relevance and readability) produced by our model and that of See et al. (2017) (the models were anonymized and randomly shuffled), i.e. A is better, B is better, both are equally good/bad. Following previous work, the input article and ground truth summaries are also shown to the human participants in addition to the two model summaries. 15 From the results shown in Table 4, we can see that our model is better in both relevance and readability w.r.t. See et al. (2017). Speed Models total time (hr) words / sec (See et al., 2017) 12.9 14.8 rnn-ext + abs + RL 0.68 361.3 rnn-ext + abs + RL + rerank 2.00 (1.46 +0.54) 109.8 Table 5: Speed comparison with See et al. (2017).

Speed Comparison
Our two-stage extractive-abstractive hybrid model is not only the SotA on summary quality metrics, but more importantly also gives a significant speed-up in both train and test time over a strong neural abstractive system (See et al., 2017). 16 Our full model is composed of a extremely fast extractor and a parallelizable abstractor, where the computation bottleneck is on the abstractor, which has to generate summaries with a large vocabulary from scratch. 17 The main advantage of our abstractor at decoding time is that we can first compute all the extracted sentences for the document, and then abstract every sentence concurrently (in parallel) to generate the overall summary. In Table 5, we show the substantial test-time speed-up of our model compared to See et al. (2017). 18 We calculate the total decoding time for producing all summaries for the test set. 19 Due to the fact that the main test-time speed bottleneck of RNN language generation model is that the model is constrained to generate one word at a time, the total decoding time is dependent on the number of total words generated; we hence also report the decoded words per second for a fair comparison. Our model without reranking is extremely fast. From Table 5 we can see that we achieve a speed up of 18x in time and 24x in word generation rate. Even after adding the (optional) reranker, we still maintain a 6-7x speed-up (and hence a user can choose to use the reranking component depending on their downstream application's speed requirements). 20 16 The only publicly available code with a pretrained model for neural summarization which we can test the speed. 17 The time needed for extractor is negligible w.r.t. the abstractor because it does not require large matrix multiplication for generating every word. Moreover, with convolutional encoder at word-level made parallelizable by the hierarchical rnn-ext, our model is scalable for very long documents. 18 For details of training speed-up, please see the supp. 19 We time the model of See et al. (2017) using beam size of 4 (used for their best-reported scores). Without beam-search, it gets significantly worse ROUGE of (36.62, 15.12, 34.08), so we do not compare speed-ups w.r.t. that version. 20 Most of the recent neural abstractive summarization systems are of similar algorithmic complexity to that of See et al. (2017). The main differences such as the training objective (ML vs. RL) and copying (soft/hard) has negligible test runtime compared to the slowest component: the long-summary Novel N -gram (%) Models 1-gm 2-gm 3-gm 4-gm See et al. (2017) 0.1 2.2 6.0 9.7 rnn-ext + abs + RL + rerank 0.3 10.0 21.7 31.6 reference summaries 10.8 47.5 68.2 78.2 Table 6: Abstractiveness: novel n-gram counts.

Abstractiveness
We compute an abstractiveness score (See et al., 2017) as the ratio of novel n-grams in the generated summary that are not present in the input document. The results are shown in Table 6: our model rewrites substantially more abstractive summaries than previous work. A potential reason for this is that when trained with individual sentence-pairs, the abstractor learns to drop more document words so as to write individual summary sentences as concise as human-written ones; thus the improvement in multi-gram novelty.

Qualitative Analysis on Output Examples
We show examples of how our best model selects sentences and then rewrites them. In the supplementary Fig. 4 and Fig. 5, we can see how the abstractor rewrites the extracted sentences concisely while keeping the mentioned facts. Adding the reranker makes the output more compact globally. We observe that when rewriting longer text, the abstractor would have many facts to choose from ( Fig. 5 sentence 2) and this is where the reranker helps avoid redundancy across sentences.

Conclusion
We propose a novel sentence-level RL model for abstractive summarization, which makes the model aware of the word-sentence hierarchy.  (2014) to compute the representation of every individual sentence in the document. First, the words are converted to the distributed vector representation by a learned word embedding matrix W emb . The sequence of the word vectors from each sentence is then fed through 1-D single-layer convolution filters with various window sizes (3, 4, 5) to capture the temporal dependencies of nearby words and then followed by relu non-linear activation and max-over-time pooling. The convolutional representation r j for the jth sentence is then obtained by concatenating the outputs from the activations of all filter window sizes.

A.2 Abstractor
In this section we discuss the architecture choices for our abstractor network in Sec. 2.2. At a highlevel, it is a sequence-to-sequence model with attention and copy mechanism (but no coverage). Note that the abstractor network is a separate neural network from the extractor agent without any form of parameter sharing.

Sequence-Attention-Sequence Model
We use a standard encoder-aligner-decoder model (Bahdanau et al., 2015;Luong et al., 2015) with the bilinear multiplicative attention function (Luong et al., 2015), f att (h i , z j ) = h i W attn z j , for the context vector e j . We share the source and target embedding matrix W emb as well as output projection matrix as in Inan et al. (2017); Press and Wolf (2017); Paulus et al. (2018).

Copy Mechanism
We add the copying mechanism as in See et al. (2017) to extend the decoder to predict over the extended vocabulary of words in the input document. A copy probability p copy = σ(v ẑẑ j + v s z j + v w w j + b) is calculated by learnable parameters v's and b, and then is used to further compute a weighted sum of the probability of source vocabulary and the predefined vocabulary. At test time, an OOV prediction is replaced by the document word with the highest attention score.

A.3 Actor-Critic Policy Gradient
Here we discuss the details of the actor-critic policy gradient training. Given the MDP formulation described in Sec. 3.2 , the return (total discounted future reward) is R t = Ns t=1 γ t r(t + 1) for each recurrent step t. To learn a optimal policy π * that maximize the state-value function: we will make use of the action-value function We then take the policy gradient theorem and then substitute the action-value function with the Monte-Carlo sample: which runs a single episode and gets the return (estimate of action-value function) by sampling from the policy π θ , where N s is the total number of sentences the agent extracts. This gradient update is also known as the REINFORCE algorithm (Williams, 1992). The vanilla REINFORCE algorithm is known for high variance. To mitigate this problem we add a critic network with trainable parameters θ c having the same structure as the pointer-network's decoder (described in Sec. 2.1.2) but change the final output layer to regress the state-value function V π θa,ω (c). The predicted value b θc,ω (c) is called the baseline and is subtracted from the actionvalue function to estimate the advantage A π θ (c, j) = Q π θa,ω (c, j) − b θc,ω (c) where θ = {θ a , θ c , ω} denotes the set of all trainable parameters. The new policy gradient for our extractor can be estimated by substituting the action-value function in Eqn. 10 by the advantage and then use Monte-Carlo samples (use R t to esti- [∇ θa,ω logπ θ (c, j)A π θ (c, j)] Here we also show an interesting finding of the effect adding the EOE action. In Fig. 3, we can see that the average reward is low in the beginning but quickly goes up after the agent picks up the EOE action. The low beginning reward is because the agent does not choose the EOE action hence keep getting zero rewards when extracting extra sentences, which lowers the average.

A.4 Sentence Selection Baseline ff-ext
In this subsection, we describe the detailed network structure of the feed-forward extractor baseline (ff-ext). Following the hierarchical sentence representation described in Sec. 2.1.1, if we add another assumption that there exists a sequence j i1 , j i2 , . . . , j iNs where j i1 < j i2 < · · · < j iNs such that [d i1 , d i2 , · · · , d iN d ] = x i and [g(d j i1 ), g(d j i2 ), · · · , g(d j iNs )] = y i i.e., the extracted document are summarized in the order as is, we could apply the following feedforward structure for sentence selection. We first learn a document representation bŷ where N d , N s each denotes the number of sentences in the document x and the summary y respectively. And then we compute the extraction probability: