Bottom-Up Abstractive Summarization

Neural summarization produces outputs that are fluent and readable, but which can be poor at content selection, for instance often copying full sentences from the source document. This work explores the use of data-efficient content selectors to over-determine phrases in a source document that should be part of the summary. We use this selector as a bottom-up attention step to constrain the model to likely phrases. We show that this approach improves the ability to compress text, while still generating fluent summaries. This two-step process is both simpler and higher performing than other end-to-end content selection models, leading to significant improvements on ROUGE for both the CNN-DM and NYT corpus. Furthermore, the content selector can be trained with as little as 1,000 sentences making it easy to transfer a trained summarizer to a new domain.


Introduction
Text summarization systems aim to generate natural language summaries that compress the information in a longer text. Approaches using neural networks have shown promising results on this task with end-to-end models that encode a source document and then decode it into an abstractive summary. Current state-of-the-art neural abstractive summarization models combine extractive and abstractive techniques by using pointergenerator style models which can copy words from the source document (Gu et al., 2016;See et al., 2017). These end-to-end models produce fluent abstractive summaries but have had mixed success in content selection, i.e. deciding what to summarize, compared to fully extractive models.
There is an appeal to end-to-end models from a modeling perspective; however, there is evidence that when summarizing people follow a two-step  approach of first selecting important phrases and then paraphrasing them (Anderson and Hidi, 1988;Jing and McKeown, 1999). A similar argument has been made for image captioning. Anderson et al. (2017) develop a state-of-the-art model with a two-step approach that first pre-computes bounding boxes of segmented objects and then applies attention to these regions. This so-called bottom-up attention is inspired by neuroscience research describing attention based on properties in-herent to a stimulus (Buschman and Miller, 2007).
Motivated by this approach, we consider bottom-up attention for neural abstractive summarization. Our approach first selects a selection mask for the source document and then constrains a standard neural model by this mask. This approach can better decide which phrases a model should include in a summary, without sacrificing the fluency advantages of neural abstractive summarizers. Furthermore, it requires much fewer data to train, which makes it more adaptable to new domains.
Our full model incorporates a separate content selection system to decide on relevant aspects of the source document. We frame this selection task as a sequence-tagging problem, with the objective of identifying tokens from a document that are part of its summary. We show that a content selection model that builds on contextual word embeddings  can identify correct tokens with a recall of over 60%, and a precision of over 50%. To incorporate bottom-up attention into abstractive summarization models, we employ masking to constrain copying words to the selected parts of the text, which produces grammatical outputs. We additionally experiment with multiple methods to incorporate similar constraints into the training process of more complex end-to-end abstractive summarization models, either through multi-task learning or through directly incorporating a fully differentiable mask.
Our experiments compare bottom-up attention with several other state-of-the-art abstractive systems. Compared to our baseline models of See et al. (2017) bottom-up attention leads to an improvement in ROUGE-L score on the CNN-Daily Mail (CNN-DM) corpus from 36.4 to 38.3 while being simpler to train. We also see comparable or better results than recent reinforcement-learning based methods with our MLE trained system. Furthermore, we find that the content selection model is very data-efficient and can be trained with less than 1% of the original training data. This provides opportunities for domain-transfer and lowresource summarization. We show that a summarization model trained on CNN-DM and evaluated on the NYT corpus can be improved by over 5 points in ROUGE-L with a content selector trained on only 1,000 in-domain sentences.

Related Work
There is a tension in document summarization between staying close to the source document and allowing compressive or abstractive modification. Many non-neural systems take a select and compress approach. For example, Dorr et al. (2003) introduced a system that first extracts noun and verb phrases from the first sentence of a news article and uses an iterative shortening algorithm to compress it. Recent systems such as Durrett et al. (2016) also learn a model to select sentences and then compress them.
In contrast, recent work in neural network based data-driven extractive summarization has focused on extracting and ordering full sentences Dlikman and Last, 2016). Nallapati et al. (2016b) use a classifier to determine whether to include a sentence and a selector that ranks the positively classified ones. These methods often over-extract, but extraction at a word level requires maintaining grammatically correct output , which is difficult. Interestingly, key phrase extraction while ungrammatical often matches closely in content with human-generated summaries (Bui et al., 2016).
A third approach is neural abstractive summarization with sequence-to-sequence models (Sutskever et al., 2014;Bahdanau et al., 2014). These methods have been applied to tasks such as headline generation (Rush et al., 2015) and article summarization (Nallapati et al., 2016a). Chopra et al. (2016) show that attention approaches that are more specific to summarization can further improve the performance of models. Gu et al. (2016) were the first to show that a copy mechanism, introduced by Vinyals et al. (2015), can combine the advantages of both extractive and abstractive summarization by copying words from the source. See et al. (2017) refine this pointer-generator approach and use an additional coverage mechanism (Tu et al., 2016) that makes a model aware of its attention history to prevent repeated attention.
Most recently, reinforcement learning (RL) approaches that optimize objectives for summarization other than maximum likelihood have been shown to further improve performance on these tasks (Paulus et al., 2017;Li et al., 2018b;Celikyilmaz et al., 2018). Paulus et al. (2017) approach the coverage problem with an intra-attention in which a decoder has an attention over previously generated words. However RL-based training can be difficult to tune and slow to train. Our method does not utilize RL training, although in theory this approach can be adapted to RL methods.
Several papers also explore multi-pass extractive-abstractive summarization.
Nallapati et al. (2017) create a new source document comprised of the important sentences from the source and then train an abstractive system.  describe an extractive phase that extracts full paragraphs and an abstractive one that determines their order. Finally Zeng et al. (2016) introduce a mechanism that reads a source document in two passes and uses the information from the first pass to bias the second. Our method differs in that we utilize a completely abstractive model, biased with a powerful content selector.
Other recent work explores alternative approaches to content selection. For example, Cohan et al. (2018) use a hierarchical attention to detect relevant sections in a document, Li et al. (2018a) generate a set of keywords that is used to guide the summarization process, and Pasunuru and Bansal (2018) develop a loss-function based on whether salient keywords are included in a summary. Other approaches investigate the content-selection at the sentence-level. Tan et al. (2017) describe a graphbased attention to attend to one sentence at a time, Chen and Bansal (2018) first extract full sentences from a document and then compress them, and Hsu et al. (2018) modulate the attention based on how likely a sentence is included in a summary.

Background: Neural Summarization
Throughout this paper, we consider a set of pairs of texts (X , Y) where x ∈ X corresponds to source tokens x 1 , . . . , x n and y ∈ Y to a summary y 1 , . . . , y m with m n. Abstractive summaries are generated one word at a time. At every time-step, a model is aware of the previously generated words. The problem is to learn a function f (x) parametrized by θ that maximizes the probability of generating the correct sequences. Following previous work, we model the abstractive summarization with an attentional sequence-to-sequence model. The attention distribution p(a j |x, y 1:j−1 ) for a decoding step j, calculated within the neural network, represents an embedded soft distribution over all of the source tokens and can be interpreted as the current focus of the model.
The model additionally has a copy mecha- nism (Vinyals et al., 2015) to copy words from the source. Copy models extend the decoder by predicting a binary soft switch z j that determines whether the model copies or generates. The copy distribution is a probability distribution over the source text, and the joint distribution is computed as a convex combination of the two parts of the model, where the two parts represent copy and generation distribution respectively. Following the pointergenerator model of See et al. (2017), we reuse the attention p(a j |x, y 1:j−1 ) distribution as copy distribution, i.e. the copy probability of a token in the source w through the copy attention is computed as the sum of attention towards all occurrences of w. During training, we maximize marginal likelihood with the latent switch variable.

Bottom-Up Attention
We next consider techniques for incorporating a content selection into abstractive summarization, illustrated in Figure 2.

Content Selection
We define the content selection problem as a wordlevel extractive summarization task. While there has been significant work on custom extractive summarization (see related work), we make a simplifying assumption and treat it as a sequence tagging problem. Let t 1 , . . . , t n denote binary tags for each of the source tokens, i.e. 1 if a word is copied in the target sequence and 0 otherwise. While there is no supervised data for this task, we can generate training data by aligning the summaries to the document. We define a word x i as copied if (1) it is part of the longest possible subsequence of tokens s = x i−j:i:i+k , for integers j ≤ i; k ≤ (n − i), if s ∈ x and s ∈ y, and (2) there exists no earlier sequence u with s = u.
We use a standard bidirectional LSTM model trained with maximum likelihood for the sequence labeling problem. Recent results have shown that better word representations can lead to significantly improved performance in sequence tagging tasks (Peters et al., 2017). Therefore, we first map each token w i into two embedding channels e (w) i and e (c) i . The e (w) embedding represents a static channel of pre-trained word embeddings, e.g. GLoVE (Pennington et al., 2014). The e (c) are contextual embeddings from a pretrained language model, e.g. ELMo (Peters et al., 2018) which uses a character-aware token embedding (Kim et al., 2016) followed by two bidirectional LSTM layers h (1) i and h (2) i . The contextual embeddings are fine-tuned to learn a task-specific embedding e (c) i as a linear combination of the states of each LSTM layer and the token embedding, with γ and s 0,1,2 as trainable parameters. Since these embeddings only add four additional parameters to the tagger, it remains very data-efficient despite the high-dimensional embedding space. Both embeddings are concatenated into a single vector that is used as input to a bidirectional LSTM, which computes a representation h i for a word w i . We can then calculate the probability q i that the word is selected as σ(W s h i + b s ) with trainable parameters W s and b s .

Bottom-Up Copy Attention
Inspired by work in bottom-up attention for images (Anderson et al., 2017) which restricts attention to predetermined bounding boxes within an image, we use these attention masks to limit the available selection of the pointer-generator model.
As shown in Figure 1, a common mistake made by neural copy models is copying very long sequences or even whole sentences. In the baseline model, over 50% of copied tokens are part of copy sequences that are longer than 10 tokens, whereas this number is only 10% for reference summaries. While bottom-up attention could also be used to modify the source encoder representations, we found that a standard encoder over the full text was effective at aggregation and therefore limit the bottom-up step to attention masking.
Concretely, we first train a pointer-generator model on the full dataset as well as the content selector defined above. At inference time, to generate the mask, the content selector computes selection probabilities q 1:n for each token in a source document. The selection probabilities are used to modify the copy attention distribution to only include tokens identified by the selector. Let a i j denote the attention at decoding step j to encoder word i. Given a threshold , the selection is applied as a hard mask, such that To ensure that Eq. 1 still yields a correct probability distribution, we first multiply p(ã j |x, y 1:j−1 ) by a normalization parameter λ and then renormalize the distribution. The resulting normalized distribution can be used to directly replace a as the new copy probabilities.

End-to-End Alternatives
Two-step BOTTOM-UP attention has the advantage of training simplicity. In theory, though, standard copy attention should be able to learn how to perform content selection as part of the end-to-end training. We consider several other end-to-end approaches for incorporating content selection into neural training. Method 1: (MASK ONLY): We first consider whether the alignment used in the bottom-up approach could help a standard summarization system. Inspired by Nallapati et al. (2017), we investigate whether aligning the summary and the source during training and fixing the gold copy attention to pick the "correct" source word is beneficial. We can think of this approach as limiting the set of possible copies to a fixed source word. Here the training is changed, but no mask is used at test time.
Method 2 (MULTI-TASK): Next, we investigate whether the content selector can be trained alongside the abstractive system. We first test this hypothesis by posing summarization as a multi-task problem and training the tagger and summarization model with the same features. For this setup, we use a shared encoder for both abstractive summarization and content selection. At test time, we apply the same masking method as bottom-up attention.
Method 3 (DIFFMASK): Finally we consider training the full system end-to-end with the mask during training. Here we jointly optimize both objectives, but use predicted selection probabilities to softly mask the copy attention p(ã i j |x, y 1:j−1 ) = p(a i j |x, y 1:j−1 )×q i , which leads to a fully differentiable model. This model is used with the same soft mask at test time.

Inference
Several authors have noted that longer-form neural generation still has significant issues with incorrect length and repeated words than in short-form problems like translation. Proposed solutions include modifying models with extensions such as a coverage mechanism (Tu et al., 2016;See et al., 2017) or intra-sentence attention Paulus et al., 2017). We instead stick to the theme of modifying inference, and modify the scoring function to include a length penalty lp and a coverage penalty cp, and is defined as s(x, y) = log p(y|x)/lp(x) + cp(x; y).
Length: To encourage the generation of longer sequences, we apply length normalizations during beam search. We use the length penalty by Wu et al. (2016), which is formulated as lp(y) = (5 + |y|) α (5 + 1) α , with a tunable parameter α, where increasing α leads to longer summaries. We additionally set a minimum length based on the training data. Repeats: Copy models often repeatedly attend to the same source tokens, generating the same phrase multiple times. We introduce a new summary specific coverage penalty, Intuitively, this penalty increases whenever the decoder directs more than 1.0 of total attention within a sequence towards a single encoded token. By selecting a sufficiently high β, this penalty blocks summaries whenever they would lead to repetitions. Additionally, we follow (Paulus et al., 2017) and restrict the beam search to never repeat trigrams.

Data and Experiments
We evaluate our approach on the CNN-DM corpus (Hermann et al., 2015;Nallapati et al., 2016a), and the NYT corpus (Sandhaus, 2008), which are both standard corpora for news summarization. The summaries for the CNN-DM corpus are bullet points for the articles shown on their respective websites, whereas the NYT corpus contains summaries written by library scientists. CNN-DM summaries are full sentences, with on average 66 tokens (σ = 26) and 4.9 bullet points. NYT summaries are not always complete sentences and are shorter, with on average 40 tokens (σ = 27) and 1.9 bullet points. Recent work has used both the anonymized and the non-anonymized versions of on CNN-DM, so direct comparison can be difficult. Following See et al. (2017), we use the non-anonymized version of this corpus and truncate source documents to 400 tokens and the target summaries to 100 tokens in training and validation sets. For experiments with the NYT corpus, we use the preprocessing described by Paulus et al. (2017), and additionally remove author information and truncate source documents to 400 tokens instead of 800. These changes lead to an average of 326 tokens per article, a decrease from the 549 tokens with 800 token truncated articles. The target (non-copy) vocabulary is limited to 50,000 tokens for all models. The content selection model uses pre-trained GloVe embeddings of size 100, and ELMo with size 1024. The bi-LSTM has two layers and a hidden size of 256. Dropout is set to 0.5, and the model is trained with Adagrad, an initial learning rate of 0.15, and an initial accumulator value of 0.1. We limit the number of training examples to 100,000 on either corpus, which only has a small impact on performance. For the jointly trained content selection models, we use the same configuration as the abstractive model. For the base model, we re-implemented the Pointer-Generator model as described by See et al. (2017). To have a comparable number of parameters to previous work, we use an encoder with 256 hidden states for both directions in the onelayer LSTM, and 512 for the one-layer decoder. The embedding size is set to 128. We found that increasing model size or changing the model to the Transformer (Vaswani et al., 2017) can lead to slightly improved performance, but at the cost of increased training time and parameters. The model is trained with the same Adagrad configuration as the content selector. Additionally, the learning rate halves after each epoch once the validation perplexity does not decrease after an epoch. We do not use dropout and use gradient-clipping with a maximum norm of 2. All inference parameters are tuned on a 200 sentence subset of the validation set. Length penalty parameter α and copy mask differ across models and baselines, with α ranging from 0.6 to 1.4, and ranging from 0.1 to 0.2. The minimum length of the generated summary is set to 35 for CNN-DM and 6 for NYT. While the Pointer-Generator uses a beam size of 5 and does not improve with a larger beam, we found that bottom-up attention requires a larger beam size and set it to 10. The coverage penalty parameter β is set to 10, and the copy attention normalization parameter λ to 2 for both approaches.
We use AllenNLP  for the content selector, and the abstractive models are implemented in OpenNMT-py (Klein et al., 2017). 3 . Table 1 shows our main results on the CNN-DM corpus, with abstractive models shown in the top, and bottom-up attention methods at the bottom. We first observe that using a coverage inference penalty scores the same as a full coverage mechanism, without requiring any additional model parameters. We found that none of our end-to-end models lead to improvements, indicating that it is difficult to apply the masking during training without hurting the training process. The Mask Only model with increased supervision on the copy mechanism performs very similar to the Multi-Task model. On the other hand, bottom-up attention leads to a major improvement across all three scores. While we would expect better content selection to primarily improve ROUGE-1, the fact all three increase hints that the fluency is not being hurt specifically. Our cross-entropy trained approach even outperforms all of the reinforcementlearning based approaches in ROUGE-1 and 2, while the highest reported ROUGE-L score by Chen and Bansal (2018) falls within the 95% confidence interval of our results. Table 2 shows experiments with the same systems on the NYT corpus. We see that the 2 point improvement compared to the baseline Pointer-Generator maximum-likelihood approach carries over to this dataset. Here, the model outperforms  Table 2: Results on the NYT corpus, where we compare to RL trained models. * marks models and results by Paulus et al. (2017), and † results by Celikyilmaz et al. (2018). the RL based model by Paulus et al. (2017) in ROUGE-1 and 2, but not L, and is comparable to the results of (Celikyilmaz et al., 2018) except for ROUGE-L. The same can be observed when comparing ML and our Pointer-Generator. We suspect that a difference in summary lengths due to our inference parameter choices leads to this difference, but did not have access to their models or summaries to investigate this claim. This shows that a bottom-up approach achieves competitive results even to models that are trained on summary-specific objectives.

Results
The main benefit of bottom-up summarization seems to be from the reduction of mistakenly copied words. With the best Pointer-Generator models, the precision of copied words is 50.0% compared to the reference. This precision increases to 52.8%, which mostly drives the increase in R1. An independent-samples t-test shows that this improvement is statistically significant with t=14.7 (p < 10 −5 ). We also observe a decrease in average sentence length of summaries from 13 to 12 words when adding content selection compared to the Pointer-Generator while holding all other inference parameters constant.
Domain Transfer While end-to-end training has become common, there are benefits to a twostep method. Since the content selector only needs to solve a binary tagging problem with pretrained vectors, it performs well even with very limited training data. As shown in Figure 3, with only 1,000 sentences, the model achieves an AUC of over 74. Beyond that size, the AUC of the model increases only slightly with increasing training data.
To further evaluate the content selection, we consider an application to domain transfer. In this experiment, we apply the Pointer-Generator    trained on CNN-DM to the NYT corpus. In addition, we train three content selectors on 1, 10, and 100 thousand sentences of the NYT set, and use these in the bottom-up summarization. The results, shown in Table 3, demonstrates that even a model trained on the smallest subset leads to an improvement of almost 5 points over the model without bottom-up attention. This improvement increases with the larger subsets to up to 7 points. While this approach does not reach a comparable performance to models trained directly on the NYT dataset, it still represents a significant increase over the not-augmented CNN-DM model and produces summaries that are quite readable. We show two example summaries in Appendix A. This technique could be used for low-resource domains and for problems with limited data availability.

Analysis and Discussion
Extractive Summary by Content Selection? Given that the content selector is effective in conjunction with the abstractive model, it is interesting to know whether it has learned an effective extractive summarization system on its own. Table 4 shows experiments comparing content selec-  tion to extractive baselines. The LEAD-3 baseline is a commonly used baseline in news summarization that extracts the first three sentences from an article. Top-3 shows the performance when we extract the top three sentences by average copy probability from the selector. Interestingly, with this method, only 7.1% of the top three sentences are not within the first three, further reinforcing the strength of the LEAD-3 baseline. Our naive sentence-extractor performs slightly worse than the highest reported extractive score by Zhou et al. (2018) that is specifically trained to score combinations of sentences. The final entry shows the performance when all the words above a threshold are extracted such that the resulting summaries are approximately the length of reference summaries. The oracle score represents the results if our model had a perfect accuracy, and shows that the content selector, while yielding competitive results, has room for further improvements in future work. This result shows that the model is quite effective at finding important words (ROUGE-1) but less effective at chaining them together (ROUGE-2). Similar to Paulus et al. (2017), we find that the decrease in ROUGE-2 indicates a lack of fluency and grammaticality of the generated summaries. A typical example looks like this: a man food his first hamburger wrongfully for 36 years. michael hanline, 69, was convicted of murder for the shooting of truck driver jt mcgarry in 1980 on judge charges. This particular ungrammatical example has a ROUGE-1 of 29.3. This further highlights the benefit of the combined approach where bottom- Bottom-Up Attention 0.5 53.3 24.8 6.5 Table 5: %Novel shows the percentage of words in a summary that are not in the source document. The last three columns show the part-of-speech tag distribution of the novel words in generated summaries.
up predictions are chained together fluently by the abstractive system. However, we also note that the abstractive system requires access to the full source document. Distillation experiments in which we tried to use the output of the contentselection as training-input to abstractive models showed a drastic decrease in model performance.
Analysis of Copying While Pointer-Generator models have the ability to abstract in summary, the use of a copy mechanism causes the summaries to be mostly extractive. Table 5 shows that with copying the percentage of generated words that are not in the source document decreases from 6.6% to 2.2%, while reference summaries are much more abstractive with 14.8% novel words. Bottom-up attention leads to a further reduction to only a half percent. However, since generated summaries are typically not longer than 40-50 words, the difference between an abstractive system with and without bottom-up attention is less than one novel word per summary. This shows that the benefit of abstractive models has been less in their ability to produce better paraphrasing but more in the ability to create fluent summaries from a mostly extractive process. Table 5 also shows the part-of-speech-tags of the novel generated words, and we can observe an interesting effect. Application of bottom-up attention leads to a sharp decrease in novel adjectives and nouns, whereas the fraction of novel words that are verbs sharply increases. When looking at the novel verbs that are being generated, we notice a very high percentage of tense or number changes, indicated by variation of the word "say", for example "said" or "says", while novel nouns are mostly morphological variants of words in the source. Figure 4 shows the length of the phrases that are being copied. While most copied phrases in the reference summaries are in groups of 1 to 5 words, the Pointer-Generator copies many very long sequences and full sentences of over 11 words. Since the content selection mask interrupts most long copy sequences, the model has to either generate the unselected words using only the generation probability or use a different word instead. While we observed both cases quite frequently in generated summaries, the fraction of very long copied phrases decreases. However, either with or without bottom-up attention, the distribution of the length of copied phrases is still quite different from the reference.
Inference Penalty Analysis We next analyze the effect of the inference-time loss functions. Table 6 presents the marginal improvements over the simple Pointer-Generator when adding one penalty at a time. We observe that all three penalties improve all three scores, even when added on top of the other two. This further indicates that the unmodified Pointer-Generator model has already learned an appropriate representation of the abstractive summarization problem, but is limited by ineffective content selection and inference methods.

Conclusion
This work presents a simple but accurate content selection model for summarization that identifies phrases within a document that are likely included in its summary. We showed that this content selector can be used for a bottom-up attention that restricts the ability of abstractive summarizers to copy words from the source. The combined bottom-up summarization system leads to improvements in ROUGE scores of over two points on both the CNN-DM and NYT corpora. A  comparison to end-to-end trained methods showed that this particular problem cannot be easily solved with a single model, but instead requires finetuned inference restrictions. Finally, we showed that this technique, due to its data-efficiency, can be used to adjust a trained model with few data points, making it easy to transfer to a new domain. Preliminary work that investigates similar bottom-up approaches in other domains that require a content selection, such as grammar correction, or data-to-text generation, have shown some promise and will be investigated in future work.

Examples
Generated summary Reference green bay packers successful season is largely due to quarterback brett favre S2S ahman green rushed for 000 yards in 00-00 victory over the giants . true , dorsey levens , good enough to start for most teams but now green 's backup , contributed kickoff returns of 00 , 00 and 00 yards . Content Selection playoff-bound green bay packers beat the giants in the 00-00 victory . the packers won three games and six of each other .
Reference paul byers , pioneer of visual anthropology , dies at age 00 S2S paul byers , an early practitioner of mead , died on dec. 00 at his home in manhattan . he enlisted in the navy , which trained him as a cryptanalyst and stationed him in australia . Content Selection paul byers , an early practitioner of anthropology , pioneered with margaret mead .

A Domain Transfer Examples
We present two generated summaries for the CNN-DM to NYT domain transfer experiment in Table 7. S2S refers to a Pointer-Generator with Coverage Penalty trained on CNN-DM that scores 20.6 ROUGE-L on the NYT dataset. The content-selection improves this to 27.7 ROUGE-L without any fine-tuning of the S2S model.