Self-Attention Guided Copy Mechanism for Abstractive Summarization

Copy module has been widely equipped in the recent abstractive summarization models, which facilitates the decoder to extract words from the source into the summary. Generally, the encoder-decoder attention is served as the copy distribution, while how to guarantee that important words in the source are copied remains a challenge. In this work, we propose a Transformer-based model to enhance the copy mechanism. Specifically, we identify the importance of each source word based on the degree centrality with a directed graph built by the self-attention layer in the Transformer. We use the centrality of each source word to guide the copy process explicitly. Experimental results show that the self-attention graph provides useful guidance for the copy distribution. Our proposed models significantly outperform the baseline methods on the CNN/Daily Mail dataset and the Gigaword dataset.


Introduction
The explosion of information has expedited the rapid development of text summarization technology, which can help us to grasp the key points from miscellaneous information quickly. There are broadly two types of summarization methods: extractive and abstractive. Extractive approaches select the original text segments in the input to form a summary, while abstractive approaches "create" novel sentences based on natural language generation techniques.
Source: two u.s. senators are blocking 11 of president barack obama 's nominees for senior administration posts at the pentagon and justice department in protest over a proposal to house guantanamo detainees at the fort leavenworth prison in their midwestern home state of kansas Reference: us senators bar obama nominees protest guantanamo Transformer: 1 us senators block pentago justice nominees Transformer + Copy: us senators block 11 from pentago justice posts Transformer + Guided Copy: us senators block obama nominees over guantanamo Top Words from Self-attention: nominees, obama, senators, pentagon, guantanamo Table 1: Yellow shades represent overlap with reference. The above summary generated by standard copy mechanism miss some importance words, such as "obama" and "nominees". on many NLP tasks, including machine translation (Vaswani et al., 2017;Dehghani et al., 2019), sentence classification (Devlin et al., 2019;Cohan et al., 2019), and text summarization (Song et al., 2019;. One of the most successful frameworks for the summarization task is Pointer-Generator Network (See et al., 2017) that combines extractive and abstractive techniques with a pointer (Vinyals et al., 2015) enabling the model to copy words from the source text directly. Although, copy mechanism has been widely used in summarization task, how to guarantee that important tokens in the source are copied remains a challenge. In our experiments, we find that the transformer-based summarization model with the copy mechanism may miss some important words. As shown in Table 1, words like "nominees" and "obama" are ignored by the standard copy mechanism. To tackle this problem, we intend to get some clues about the importance of words from the self-attention graph.
We propose a Self-Attention Guided Copy mechanism (SAGCopy) that aims to encourage the summarizer to copy important source words. Selfattention layer in the Transformer (Vaswani et al., 2017) builds a directed graph whose vertices represent the source words and edges are defined in terms of the relevance score between each pair of source words by dot-product attention (Vaswani et al., 2017) between the query Q and the key K. We calculate the centrality of each source words based on the adjacency matrices. A straightforward method is using TextRank (Mihalcea and Tarau, 2004) algorithm that assumes a word receiving more relevance score from others are more likely to be important. This measure is known as the indegree centrality. We also adopt another measure assuming that a word sends out more relevance score to others is likely to be more critical, namely outdegree centrality, to calculate the source word centrality.
We utilize the centrality score as guidance for copy distribution. Specifically, we extend the dotproduct attention to a centrality-aware function. Furthermore, we introduce an auxiliary loss computed by the divergence between the copy distribution and the centrality distribution, which aims to encourage the model to focus on important words.
Our contribution is threefold: • We present a guided copy mechanism based on source word centrality that is obtained by the indegree or outdegree centrality measures.
• We propose a centrality-aware attention and a guidance loss to encourage the model to pay attention to important source words.
• We achieve state-of-the-art on the public text summarization dataset.

Related Work
Neural network based models (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2017;Tan et al., 2017;Gehrmann et al., 2018;Zhu et al., 2019;Li et al., 2020b,a) achieve promising results for the abstractive text summarization. Copy mechanism Gu et al., 2016;See et al., 2017;Zhou et al., 2018) enables the summarizers with the ability to copy from the source into the target via pointing (Vinyals et al., 2015). Recently, pre-training based methods (Devlin et al., 2019;Radford et al., 2018) have attracted growing attention and achieved state-of-the-art performances in many NLP tasks, and pre-training encoder-decoder Transformers (Song et al., 2019;Dong et al., 2019;Lewis et al., 2019;Xiao et al., 2020;Bao et al., 2020) show great successes for the summarization task. In this work, we explore the copy module upon the Transformer-based summarization model.

Background
We first introduce the copy mechanism. In Pointer-Generator Networks (PGNet) (See et al., 2017), the source text x are fed into a bidirectional LSTM (BiLSTM) encoder, producing a sequence of encoding hidden state h: On each step t, a unidirectional LSTM decoder receives the word embedding of the previous word to produce decoder state s: where c t is a context vector generated based on the attention distribution (Bahdanau et al., 2015): The vocabulary distribution P vocab over all words in the target vocabulary is calculated as follows: By incorporating a generating-copying switch p gen ∈ [0, 1], the final probability distribution of the ground-truth target word y t is: p gen = sigmoid(w T a c t + u T a s t + v T a y t−1 ) (8) Copy distribution P copy determines where to attend in time step t. In the most previous work, encoder-decoder attention weight α t is serves as the copy distribution (See et al., 2017): The loss function L is the average negative log likelihood of the ground-truth target word y t for each timestep t:  Figure 1: The framework of our proposed model. Based on the encoder self-attention graph, we calculate the centrality score for each source word to guide the copy module.

Model
In this section, we present our approach to enhance the copy mechanism. First, we briefly describe the Transformer model with the copy mechanism. Then, we introduce two methods to calculate the centrality scores for the source words based on the encoder self-attention layer. Finally, we incorporate the centrality score into the copy distribution and the loss function. The framework of our model is shown in Figure 1.

Transformer with the Copy Mechanism
Scaled dot-product attention (Vaswani et al., 2017) is widely used in self-attention networks: where d k is the number of columns of query matrix Q, key matrix K and value matrix V . We take the encoder-decoder attentions in the last decoder layer as the copy distribution: Note that for the multi-head attention, we obtain the copy distributions with the sum of multiple heads.

Self-Attention-Based Centrality
We introduce two approaches, i.e., indegree centrality and outdegree centrality, to calculate the centrality score for each source word based on the last encoder self-attention layer of the Transformer.
Centrality approaches are proposed to investigates the importance of nodes in social networks (Freeman, 1978;Bonacich, 1987;Borgatti and Everett, 2006;Kiss and Bichler, 2008;Li et al., 2011). Degree centrality is one of the simplest centrality measures that can be distinguished as indegree centrality and outdegree centrality (Freeman, 1978), which are determined based on the edges coming into and leaving a node, respectively.
Indegree centrality of a word is proportional to the number of relevance scores incoming from other words, which can be measured by the sum of the indegree scores or by graph-based extractive summarization methods (Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Zheng and Lapata, 2019).
Outdegree centrality of a word is proportional to the number of relevance scores outgoing to other words, which can be computed by the sum of the outdegree scores.
Formally, let G = (V, D) be a directed graph representing self-attention, where vertices V is the word set and edge D i,j is represented by the encoder self-attention weight from the word x i to the word x j , where i D i,j = 1. Next, we introduce the approaches to calculate the word centrality with the graph G.
We first construct a transition probability matrix T as follows: A basic indegree centrality is defined as: Alternatively, TextRank (Mihalcea and Tarau, 2004) that is inspired by PageRank algorithm (Page et al., 1999) calculates indegree centrality of the source words iteratively based on the Markov chain: where score i is indegree centrality score for vertex V i with initial score set to 1/|V |. We can get a stationary indegree centrality distribution by computing score = T · score iteratively, and we take at most three iterations in our implementation. Outdegree centrality measures how much a word i contributes to other words in the directed graph: Next, we incorporate the source word centrality score into the decoding process.

Guided Copy Mechanism
The motivation is that word centrality indicates the salience of the source words, which can provide the copy prior knowledge that can guide the copy module to focus on important source words.
We use word centrality score as an extra input to calculate the copy distribution as follows: where score i is the indegree or outdegree centrality score for the i-th word in source text. The above implementation can be referred to as centralityaware dot-product attention. Moreover, we expect that important source words can draw enough encoder-decoder attention. Thus, we adopt a centrality-aware auxiliary loss to encourage the consistency between the overall copy distribution and the word centrality distribution based on the Kullback-Leibler (KL) divergence:

Experimental Setting
We evaluate our model in CNN/Daily Mail dataset (Hermann et al., 2015) and Gigaword dataset (Rush et al., 2015). Our experiments are conducted with 4 NVIDIA P40 GPU. We adopt 6 layer encoder and 6 layers decoder with 12 attention heads, and h model = 768. Byte Pair Encoding (BPE) (Sennrich et al., 2016) word segmentation is used for data pre-processing. We warm-start the model parameter with MASS pre-trained base model 1 and trains about 10 epoches for convergence. During decoding, we use beam search with a beam size of 5.

Experimental Results
We compare our proposed Self-Attention Guided Copy (SAGCopy) model with the following comparative models.
Lead-3 uses the first three sentences of the article as its summary.
Bottom-Up (Gehrmann et al., 2018) is a sequence-to-sequence model augmented with a bottom-up content selector.
MASS (Song et al., 2019) is a sequence-tosequence pre-trained model based on the Transformer.
ABS (Rush et al., 2015) relies on an CNN encoder and a NNLM decoder.
SEASS  controls the information flow from the encoder to the decoder with the selective encoding strategy.
SeqCopyNet (Zhou et al., 2018) extends the copy mechanism that can copy sequences from the source.
We adopt ROUGE (RG) F 1 score (Lin, 2004) as the evaluation metric. As shown in Table 2 and Table 3, SAGCopy with both outdegree and indegree centrality based guidance significantly outperform the baseline models, which prove the effectiveness of self-attention guided copy mechanism. The basic indegree centrality (indegree-1) is more favorable, considering the ROUGE score and computation complexity.
Besides ROUGE evaluation, we further investigate the guidance from the view of the loss function. For each sample in the Gigaword test set, we measure the KL divergence between the centrality score and the copy distribution, and we calculate the ROUGE-1 and ROUGE-2 scores. Figure 2 demonstrates that lower KL divergence yields a  Table 2: ROUGE F 1 scores on the CNN/Daily Mail dataset. Results with * mark are taken from the corresponding papers. Indegree-i denote indegree centrality obtained by TextRank with i-iteration. Note that Indegree-1 is the basic indegree centrality that is equivalent to TextRank with 1-iteration.  higher ROUGE score, showing that our loss function is reasonable. Additionally, we visualize the self-attention weights learned from our model in Figure 3, which demonstrates the guidance process.

Human Evaluation
We conduct human evaluations to measure the quantify of the summaries for importance and readability.
We randomly selected 100 samples from the Gigaword test set. The annotators are required to give a comparison between two model summaries that are presented anonymously. The results in Table 4 show that SAGCopy significantly outperforms MASS+Copy in terms of Importance and is comparative in terms of Readability.

Without Guidance
Figure 3: The guidance process for SAGCopy Indegree model, showing that the keyword "northern" is correctly copied for our model.

Conclusion
In this paper, we propose the SAGCopy summarization model that acquires guidance signals for the copy mechanism from the encoder self-attention graph. We first calculate the centrality score for each source word. Then, we incorporate the importance score into the copy module. The experimental results show the effectiveness of our model. For future work, we intend to apply our method to other Transformer-based summarization models.