Attention Head Masking for Inference Time Content Selection in Abstractive Summarization

How can we effectively inform content selection in Transformer-based abstractive summarization models? In this work, we present a simple-yet-effective attention head masking technique, which is applied on encoder-decoder attentions to pinpoint salient content at inference time. Using attention head masking, we are able to reveal the relation between encoder-decoder attentions and content selection behaviors of summarization models. We then demonstrate its effectiveness on three document summarization datasets based on both in-domain and cross-domain settings. Importantly, our models outperform prior state-of-the-art models on CNN/Daily Mail and New York Times datasets. Moreover, our inference-time masking technique is also data-efficient, requiring only 20% of the training samples to outperform BART fine-tuned on the full CNN/DailyMail dataset.


Introduction
Large pre-trained Transformers have achieved stateof-the-art results on various summarization datasets with a fine-tuning phase to streamline the summarization pipeline (Lewis et al., 2020;Yan et al., 2020). Yet, it is still unclear how one can use large models more effectively for abstractive summarization . For example, prior work shows that informing content selection via attention weight updating in recurrent neural networks can further boost summarizer performance (Gehrmann et al., 2018). However, with multi-heads attentions at all layers in Transformers (Vaswani et al., 2017), highlighting salient content becomes non-trivial.
In this work, we propose an inference-time attention head masking mechanism that works on encoder-decoder attentions to underscore salient content from the source and improve the quality of abstractive summaries. Based on this mechanism, we first demonstrate the relation between encoderdecoder attentions and content selection behaviors, on three summarization datasets of CNN/DailyMail (CNN/DM), New York Times (NYT), and XSum. Second, we study whether multiple heads at the same layer collectively guide the summarization. Partial masking is found to be most effective, indicating a strong collaborative effect and the importance of head selection.
Based on these observations, we evaluate attention head masking on summarization benchmarks with salience labels provided by externally trained content selectors. On all three datasets, our model consistently outperforms fine-tuned BART (Lewis et al., 2020) and several top performing Transformer-based abstractive summarization models (Zhang et al., 2019b;Yan et al., 2020). Summaries generated by our model are also considered to have better informativeness by human judges. Moreover, we illustrate that attention head masking is data-efficient: on CNN/DM, BART fine-tuned on less than 20% of the training data outperforms a version trained on the full set. Finally, we show that our method is effective under a cross-domain setting. With a content selector trained on NYT, BART fine-tuned on CNN/DM gains more than three points of ROUGE scores when tested on NYT articles. 1 2 Related Work Large Pre-trained Models for Summarization. Many recent advancements in text summarization have been achieved by large pre-trained language models (Zhang et al., 2019a;Liu and Lapata, 2019;Song et al., 2019;Zhang et al., 2019b). In particular, BART has demonstrated impressive performance on summarization, and is used as the base model in this work. Nonetheless, all prior attempts take pre-trained models as is and conduct fine-tuning on target datasets, without knowing if it is the most effective usage. In contrast, we bring insights into the relation between attentions and content selection via masking operations to further improve summarization performance.
Content Selection for Abstractive Summarization. Content selection is a crucial step, where salient information is first detected and then summarized into concise abstracts (Chen and Bansal, 2018;Xu and Durrett, 2019). To minimize the propagation of selection errors, content selection is modeled as an extra component and learned within an end-to-end trained model (Zhou et al., 2017;Li et al., 2018;Gehrmann et al., 2018). To the best of our knowledge, we are the first to apply masks on selected layers and attention heads in Transformers for content selection in summarization. Moreover, our masking mechanism is only activated during inference, without any model modification.
Analyzing Multi-head Attentions has attracted growing interests in the NLP community (Clark et al., 2019;Kovaleva et al., 2019). Among the work that is relevant to encoder-decoder attentions, Michel et al. (2019) and Voita et al. (2019) observe that only a small portion of heads is relevant for translation and encoder-decoder attentions tend to be more important than self-attentions. Meanwhile, word alignments for machine translation are induced from encoder-decoder attention weights (Li et al., 2019;Kobayashi et al., 2020). However, none of prior work employs attentions to improve generation quality. As far as we are aware, this is the first work that studies the content selection effects of encoder-decoder attentions and uses them to guide better summary generation.

Attention Head Masking
We adopt large pre-trained sequence-to-sequence Transformer models (BART, specifically) for abstractive summarization. Transformer is built with multi-head attentions. Attentions are computed per step based on a query q along with the key and value matrices, K and V: where d k is a scaling factor and m is for padding or masking future tokens (when the value is −∞).
Masking Operation. We propose attention head masking in encoder-decoder attentions, which blocks attentions to unimportant tokens, to better  concentrate multi-head attentions on salient input tokens. Importantly, it is activated during inference. Concretely, we add anm inside the softmax operator of Eq. 1, with implementation displayed in Fig. 1. The size ofm is the same as the input length. If the i-th token is tagged as salient, the corresponding element inm is set to 0 (attendable to the attention heads), and −∞ otherwise (hidden from these heads). The saliency labels can be predicted by an externally trained content selector.

Encoder-decoder Attentions and Content Selection
In this section, we first probe into the content selection behavior of each single head ( § 4.1), and then study the synergism among heads at the same layer ( § 4.2). In § 4.3, we analyze the attentions' focus. Our analysis is conducted on CNN/DM (Hermann et al., 2015), NYT (Consortium and Company, 2008), and XSum (Narayan et al., 2018). We follow Lewis et al. (2020) for data preprocessing and train/validation/test splits on CNN/DM and XSum, and adopt the setups in Paulus et al. (2018) for NYT, except that we keep entities and numbers. For experiments in this section, we create an analysis set of 1,000 random samples from the validation split of each dataset to reduce computational cost.

Content Selection Effects
First, we study the feasibility of using encoderdecoder attentions to inform content selection and subsequently boost summary informativeness. Concretely, we apply attention head masking based on oracle content selection labels (henceforth or- acle masking). Oracle labels are constructed by aligning a reference summary to the source article, where we iteratively find the longest common subsequences between the two.
Taking a fine-tuned BART model, we apply oracle masking on each head at each layer when decoding on the analysis set. The ROUGE score obtained in this setting is denoted as r ora . We then apply uniform encoder-decoder attention weights over the source to build a baseline that mimics no content selection, inspired by Wiegreffe and Pinter (2019). This yields a ROUGE score of r uni . The content select effect per head can thus be calculated as the ROUGE improvement, i.e., r ora − r uni .
Overall, it is more effective to constrain attentions to salient content at the top layers, according to the results on CNN/DM in Fig. 2. Specifically, with oracle masking, the top layer yields the most ROUGE-1 improvement. We observe similar trends on NYT and XSum (figures are in Appendix C). This indicates the feasibility of leveraging attention head masking to improve summary informativeness.

Synergism Analysis
Next, we study whether masking multiple heads can further boost content selection and whether they form synergy. On the left of Fig. 3, we show content selection effect by gradually applying oracle masking on more heads at each layer, with heads sorted based on individual ROUGE improvements. Notably, the most ROUGE-1 improvement is achieved by masking 15 (out of 16) heads at the top layer, suggesting a strong collaborative effect on content selection by masking multiple heads. We further compare the ROUGE score gains between oracle masking on all heads and the sum of individual effects, illustrated on the right of Fig. 3. The discrepancies between the two values suggest that the heads may not be independent at pinpointing salient content. In Appendix D, we reach similar results on NYT and XSum.
Based on the above observations, we argue that it is necessary to select layers and heads accordingly to achieve the best content selection effect, with more summarization results reported in § 5.

Attention Focus
We further provide a fine-grained study on what types of words the heads attend to. Concretely, we consider each word generated during decoding, denoted as y. Given an attention head, we follow the highest attention weight to identify the input word x ("attendee"). We study several categories of attendee x: (1) word in the reference (SALIENT); (2) CONTENT word; (3) the FIRST and LAST words in the document. For SALIENT and CONTENT, we further consider two subcategories: x = y (COPY) and x = y (NON-COPY). We then tally the occurrences of each type of attendees per head at each layer on the analysis set.
We show the percentages of COPY and NON-COPY SALIENT attendees, COPY CONTENT attendees, and FIRST attendees on CNN/DM in Fig. 4. As can be seen, top layers tend to focus on input tokens that will be generated as is, while bottom layers attend to salient words that are not used for current generation. Additionally, bottom layers fre-

Summarization Results with Attention Head Masking
In this section, we show how to leverage attention head masking and a content selector to improve summary informativeness on three datasets. We first train a binary sequence tagger for each dataset to label salient tokens in the source, used for system masking for attention heads. Our sequence tagger is a RoBERTa (Liu et al., 2019) encoder followed by a double layer multilayer perceptron (MLP) with a hyperbolic tangent activation function in between. To obtain the probability for each token, the MLP output is further fed into a sigmoid activation function. Details for training and decoding are in Appendix A. The decision boundary for the sequence tagger is selected according to the F1 score calculated between the predicted tags and the ground-truth labels on the validation set. We search for the best decision boundary from 0.1 to 0.4, with a step size of 0.01. The final decision boundaries used for taggers trained on CNN/DM, NYT, XSum are 0.20, 0.24, and 0.18, achieving ROUGE-1 F1 of 43.70, 44.10, and 31.56, respectively.  To select which heads at which layers to mask, we employ a greedy selection strategy. On the analysis set, we gradually apply system masking on four heads with most ROUGE improvement according to the study in § 4.1, and we select the heads that achieve the highest sum of ROUGE-1 F1 and ROUGE-2 F1. We apply four heads each time to reduce computational cost of hyperparameter searching. Heads selected for each dataset are in Appendix B.
In-domain Results. Table 1 shows that applying our attention head masking technique on BART obtains significantly better results on CNN/DM and NYT, compared to several top performing abstractive summarization models trained with large Transformers. The improvement is more pronounced for CNN/DM than the other two datasets. We believe this is due to the difference in abstractiveness among the three datasets. CNN/DM has more extractive summaries compared to the other datasets (Grusky et al., 2018), suggesting attention head masking is more effective on extractive datasets. Notably, PEGASUS is pre-trained with 3.8TB of news articles, the BART model used in our work is only pre-trained with 160GB of a com-   bination of news, books, stories, and web text. The large size of the pre-training data might be a big contributor to the better performance by PEGASUS on XSum.
For human evaluation, we hire three fluent English speakers to rate 50 pairs of summaries generated with and without attention head masking based on BART for informativeness and faithfulness. Informativeness measures how well the summary captures salient content from the article, while faithfulness indicates whether the summary correctly reflects the content in the source article. The annotators are asked to determine if attention head masking improves any of the two aspects. As shown in Table 2 where all ratings by three judges are considered, summaries generated with attention head masking are considered to have better informativeness, but no substantial improvement on faithfulness is observed.
Limited Training Data. Next, we study if our masking technique is still effective if given limited training samples. We use the limited training samples to train both the summarizer and the content selector. As can be seen in Fig. 5

Conclusion
We propose attention head masking that constrains encoder-decoder attention heads to attend to salient tokens, to inform content selection in abstractive summarization. With this technique, we first demonstrate the relation between encoder-decoder attentions and content selection behaviors. With system masks predicted by external content selectors, we show that attention head masking can consistently improve ROUGE scores over competitive summarization models on three benchmarks. Summaries generated with attention head masking are also preferred by human judges more frequently. Additional experiments demonstrate that our method is more data-efficient and effective on both in-domain and cross-domain settings.

A Training and Decoding Settings
When training the sequence taggers, we minimize the average binary cross-entropy of each token's selection probability relative to the ground-truth label. The parameters of the RoBERTa encoder are fixed. We set the learning rate to 5 × 10 −4 and batch size to 128. Unless specified, all the models in this paper are trained with Adam (Kingma and Ba, 2015) optimizer and training will be stopped if there is no improvement on the validation set for 2 consecutive epochs. For BART models, we follow the instructions provided by Fairseq (Ott et al., 2019) to set the training hyperparameters on CNN/DM and XSum. We use the same hyperparameters for CNN/DM and NYT, except that we adopt a linear learning rate decay of 30,000 steps in total for NYT.
During test, we use a beam size of 5, 5, 6 for CNN/DM, NYT, and XSum, respectively. To reduce computational cost, we use beam size 1 for our analysis experiments on all datasets. The length penalties are 2.0, 1.5 and 1.0 for CNN/DM, NYT, and XSum, following Lewis et al. (2020). We set the minimal and maximal lengths during decoding as: 55 and 140 for CNN/DM, 0 and 140 for NYT, and 10 and 60 for XSum.

B Head Selection
For CNN/DM, we apply masking to all heads at layer 1. The ROUGE-1/2/L F1 on the analysis set are 36.43/16.02/33.59.

C Content Selection Effects on XSum and NYT
The content selection effects for BART models fine-tuned on XSum and NYT, measured by the ROUGE improvement from the uniform attention weight setting to the oracle masking setting, are shown in Fig. 6. On all three datasets, it is more effective to constrain attentions to salient content at the top layers. Especially, the top layer yields the most ROUGE-1 improvement. Moreover, the ROUGE improvement by a specific head varies among different datasets.

D Additional Results for Synergism Analysis
We show the synergism analysis for models finetuned on XSum and NYT in Fig. 7. They both echo the observation on CNN/DM that multiple heads  have strong collaborative effects and heads may not be independent at pinpointing different salient content.

E Attention Focus
We show the percentages of each type of attendees on the analysis sets of XSum, NYT, and CNN/DM in Fig. 8, Fig. 9, and Fig. 10, respectively. We find that heads have similar focus for salient words, content words, and the last word across different datasets. Interestingly, the attention focus for the first word on NYT is different from other datasets. On NYT, many articles start with all capitalized words, which might become the focus of some heads.