Contrastive Attention Mechanism for Abstractive Sentence Summarization

We propose a contrastive attention mechanism to extend the sequence-to-sequence framework for abstractive sentence summarization task, which aims to generate a brief summary of a given source sentence. The proposed contrastive attention mechanism accommodates two categories of attention: one is the conventional attention that attends to relevant parts of the source sentence, the other is the opponent attention that attends to irrelevant or less relevant parts of the source sentence. Both attentions are trained in an opposite way so that the contribution from the conventional attention is encouraged and the contribution from the opponent attention is discouraged through a novel softmax and softmin functionality. Experiments on benchmark datasets show that, the proposed contrastive attention mechanism is more focused on the relevant parts for the summary than the conventional attention mechanism, and greatly advances the state-of-the-art performance on the abstractive sentence summarization task. We release the code at https://github.com/travel-go/ Abstractive-Text-Summarization.


Introduction
Abstractive sentence summarization aims at generating concise and informative summaries based on the core meaning of source sentences. Previous endeavors tackle the problem through either rule-based methods (Dorr et al., 2003) or statistical models trained on relatively small scale training corpora (Banko et al., 2000). Following its successful applications on machine translation (Sutskever et al., 2014;Bahdanau et al., 2015), the sequence-to-sequence framework is also applied on the abstractive sentence summarization task using large-scale sentence summary corpora (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., * Equal contribution. 2016), obtaining better performance compared to the traditional methods.
One central component in state-of-the-art sequence to sequence models is the use of attention for building connections between the source sequence and target words, so that a more informed decision can be made for generating a target word by considering the most relevant parts of the source sequence (Bahdanau et al., 2015;Vaswani et al., 2017). For abstractive sentence summarization, such attention mechanisms can be useful for selecting the most salient words for a short summary, while filtering the negative influence of redundant parts.
We consider improving abstractive summarization quality by enhancing target-to-source attention. In particular, a contrastive mechanism is taken, by encouraging the contribution from the conventional attention that attends to relevant parts of the source sentence, while at the same time penalizing the contribution from an opponent attention that attends to irrelevant or less relevant parts. Contrastive attention was first proposed in computer vision (Song et al., 2018a), which is used for person re-identification by attending to person and background regions contrastively. To our knowledge, we are the first to use contrastive attention for NLP and deploy it in the sequence-to-sequence framework.
In particular, we take Transformer (Vaswani et al., 2017) as the baseline summarization model, and enhance it with a proponent attention module and an opponent attention module. The former acts as the conventional attention mechanism, while the latter can be regarded as a dual module to the former, with similar weight calculation structure, but using a novel softmin function to discourage contributions from irrelevant or less relevant words.
To our knowledge, we are the first to investigate Transformer as a sequence to sequence summarizer. Results on three benchmark datasets show that it gives highly competitive accuracies compared with RNN and CNN alternatives. When equipped with the proposed contrastive attention mechanism, our Transformer model achieves the best reported results on all data. The visualization of attentions shows that through using the contrastive attention mechanism, our attention is more focused on relevant parts than the baseline. We release our code at XXX.

Related Work
Automatic summarization has been investigated in two main paradigms: the extractive method and the abstractive method. The former extracts important pieces of source document and concatenates them sequentially (Jing and McKeown, 2000;Knight and Marcu, 2000;Neto et al., 2002), while the latter grasps the core meaning of the source text and re-state it in short text as abstractive summary (Banko et al., 2000;Rush et al., 2015). In this paper, we focus on abstractive summarization, and especially on abstractive sentence summarization. Previous work deals with the abstractive sentence summarization task by using either rule based methods (Dorr et al., 2003), or statistical methods utilizing a source-summary parallel corpus to train a machine translation model (Banko et al., 2000), or a syntax based transduction model (Cohn and Lapata, 2008;Woodsend et al., 2010).
In recent years, sequence-to-sequence neural framework becomes predominant on this task by encoding long source texts and decoding into short summaries together with the attention mechanism. RNN is the most commonly adopted and extensively explored architecture (Chopra et al., 2016;Li et al., 2017). A CNN-based architecture is recently employed by Gehring et al. (2017) using ConvS2S, which applies CNN on both encoder and decoder. Later, Wang et al. (2018) build upon ConvS2S with topic words embedding and encoding, and train the system with reinforcement learning.
The most related work to our contrastive attention mechanism is in the field of computer vision. Song et al. (2018a) first propose the contrastive attention mechanism for person re-identification. In their work, based on a pre-provided person and background segmentation, the two regions are contrastively attended so that they can be easily discriminated. In comparison, we apply the contrastive attention mechanism for sentence level summarization by contrastively attending to relevant parts and irrelevant or less relevant parts. Furthermore, we propose a novel softmax softmin functionality to train the attention mechanism, which is different to Song et al. (2018a), who use mean squared error loss for attention training.
Other explorations with respect to the characteristics of the abstractive summarization task include copying mechanism that copies words from source sequences for composing summaries (Gu et al., 2016;Song et al., 2018b), the selection mechanism that elaborately selects important parts of source sentences (Zhou et al., 2017;Lin et al., 2018), the distraction mechanism that avoids repeated attention on the same area (Chen et al., 2016), and the sequence level training that avoids exposure bias in teacher forcing methods (Ayana et al., 2016;Li et al., 2018;Edunov et al., 2018). Such methods are built on conventional attention, and are orthogonal to our proposed contrastive attention mechanism.

Approach
We use two categories of attention for summary generation. One is the conventional attention that attends to relevant parts of source sentence, the other is the opponent attention that contrarily attends to irrelevant or less relevant parts. Both categories of attention output probability distributions over summary words, which are jointly optimized by encouraging the contribution from the conventional attention and discouraging the contribution from the opponent attention. Figure 1 illustrates the overall networks. We use Transformer architecture as our basis, upon which we build the contrastive attention mechanism. The left part is the original Transformer. We derive the opponent attention from the conventional attention which is the encoder-decoder attention of the original Transformer, and stack several layers on top of the opponent attention as shown in the right part of Figure 1. Both parts contribute to the summary generation by producing probability distributions over the target vocabulary, respectively. The left part outputs the conventional probability based on the conventional attention as the original Transformer does, while the right part outputs the opponent probability based on the opponent attention. The two probabilities in Figure 1 are jointly optimized in a novel way as explained in Section 3.3.

Transformer for Abstractive Sentence Summarization
Transformer is an attention network based sequence-to-sequence architecture (Vaswani et al., 2017), which encodes the source text into hidden vectors and decodes into the target text based on the source side information and the target generation history. In comparison to the RNN based architecture and the CNN based architecture, both the encoder and the decoder of Transformer adopt attention as main function.
Let X and Y denote the source sentence and its summary, respectively. Transformer is trained to maximize the probability of Y given X: is the conventional probability of the current summary word y i given the source sentence and the summary generation history. P c is computed based on the attention mechanism and the stacked deep layers as shown in the left part of Figure 1.

Attention Mechanism
Scaled dot-product attention is applied in Transformer: where Q, K, V denotes query vector, key vectors, and value vectors, respectively. d k denotes the dimension of one vector of K. Softmax function outputs the attention weights distributed over V . attention(Q, K, V ) is a vector of weighted sum of elements of V , and represents current context information.
We focus on the encoder-decoder attention, which builds the connection between source and target by informing the decoder which area of the source text should be attended to. Specifically, in the encoder-decoder attention, Q is the single vector coming from the current position of the decoder, K and V are the same sequence of vectors that are the outcomes of the encoder at all source positions. Softmax function distributes the attention weights over the source positions.
The attentions in Transformer adopts the multihead implementation, in which each head computes attention as Equation (1) but with smaller Q, K, V whose dimension is 1/h times of their original dimension respectively. The attentions from h heads are concatenated together and linearly projected to compose the final attention. In this way, multi-head attention provides a multiview of attention behavior beneficial for the final performance.

Deep Layers
The "N×" plates in Figure 1 stands for the stacked N identical layers. On the source side, each layer of the stacked N layers contains two sublayers: the self-attention mechanism, and the fully connected feed-forward network. Each sublayer employs residual connection that adds input to outcome of sublayer, then layer normalization is employed on the outcome of the residual connection.
On the target summary side, each layer contains an additional sublayer of the encoder-decoder attention between the self-attention sublayer and the feed-forward sublayer. At the top of the decoder, the softmax layer is applied to convert the decoder output to summary word generation probabilities.

Opponent Attention
As illustrated in Figure 1, the opponent attention is derived from the conventional encoder-decoder attention. Since the multi-head attention is employed in Transformer, there are N×h heads in total in the conventional encoder-decoder attention, where N denotes the number of layers, h denotes the number of heads in each layer. These heads exhibit diverse attention behaviors, posing a challenge of determining which head to derive the opponent attention, so that it attends to irrelevant or less relevant parts. Figure 2 illustrates the attention weights of two sampled heads. The attention weights in (a) well reflect the word level relevant relation between the source sentence and the target summary, while attention weights in (b) do not. We find that such behavior characteristic of each head is fixed. For example, head (a) always exhibits the relevant relation across different sentences and different runs. Based on depicting heatmaps of all heads for a few sentences, we choose the head that corresponds well to the relevant relation between source and target to derive the opponent attention 1 .
Specifically, let α c denote the conventional encoder-decoder attention weights of the head which is used for deriving the opponent attention: where q and k are from the head same to that of α c . Let α o denote the opponent attention weights. It is obtained through the opponent function applied on α c followed by the softmax function: The opponent function in Equation (3) performs a masking operation, which finds the maximum weight in α c , and replaces it with the negative 1 Given manual alignments between source and target of sampled sentence-summary pairs, we select the head that has the lowest alignment error rate (AER) of its attention weights. infinity value, so that the softmax function outputs zero given the negative infinity value input. Then the maximum weight in α c is set zero in α o after the opponent and softmax functions. In this way, the most relevant part of the source sequence, which receives maximum attention in the conventional attention weights α c , is masked and neglected in α o . Instead, the remaining less relevant or irrelevant parts are extracted into α o for the following contrastive training and decoding.
We also tried other methods to calculate the opponent attention weights, such as α o = softmax(1 − α c ) (Song et al., 2018a) 2 or α o = softmax(1/α c ), which aims to make α o contrary to α c , but they underperform the masking opponent function on all benchmark datasets. So we present only the masking opponent in the following sections.
After α o is obtained via Equation (3), the opponent attention is: where v is from the head same to that of q and k in computing α c .
Compared to the conventional attention attention c , which summarizes current relevant context, attention o summarizes current irrelevant or less relevant context. They constitute a contrastive pair, and contribute together for the final summary word generation.

Opponent Probability
The opponent probability P o (y i |y i−1 1 , X) is computed by stacking several layers on top of attention o , and a softmin layer in the end as shown in the right part of Figure (1). In particular, where W is the matrix of the linear projection sublayer.
attention o contributes to P o via Equation (4-7) step by step. The LayerNorm and FeedForward layers with residual connection is similar to the original Transformer, while a novel softmin function is introduced in the end to invert the contribution from attention o : where v = W z 3 , i.e., the input vector to the softmin function in Equation (7). Softmin normalizes v so that scores of all words in the summary vocabulary sum to one. We can see that the bigger the v i , the smaller the P o,i is. Softmin functions contrarily to softmax. As a result, when we try to maximize P o (y i = y|y i−1 1 , X), where y is the gold summary word, we effectively search for an appropriate attention o to generate the lowest v g , where g is the index of y in v. It means that the more irrelevant is attention o to the summary, the lower the v g can be obtained, resulting in higher P o .

Training and Decoding
During training, we jointly maximize the conventional probability P c and the opponent probability P o : J = log(P c (y i |y i−1 1 , X) + λlog(P o (y i |y i−1 1 , X) (9) where λ is the balanced weight. The conventional probability is computed as the original Transformer does, basing on sublayers of feed-forward, linear projection, and softmax stacked over the conventional attention as illustrated in the left part of Figure 1. The opponent probability is based on similar sublayers stacked over the opponent attention, but with softmin as the last sublayer as illustrated in the right part of Figure 1.
Due to the contrary properties of softmax and softmin, jointly maximizing P c and P o actually maximizes the contribution from the conventional attention for summary word generation, while at the same time minimizes the contribution from the opponent attention 3 . In other words, the training objective is to let the relevant part attended by attention c contribute more to the summarization, while let the irrelevant or less relevant parts attended by attention o contribute less.
During decoding, we aim to find maximum J of Equation (9) in the beam search process.

Experiments
We conduct experiments on abstractive sentence summarization benchmark datasets to demonstrate the effectiveness of the proposed contrastive attention mechanism.

Datasets
In this paper, we evaluate our proposed method on three abstractive text summarization benchmark datasets. First, we use the annotated Gigaword corpus and preprocess it identically to Rush et al. (2015), which results in around 3.8M training samples, 190K validation samples and 1951 test samples for evaluation. The source-summary pairs are formed through pairing the first sentence of each article with its headline. We use DUC-2004 as another English data set only for testing in our experiments. It contains 500 documents, each containing four human-generated reference summaries. The length of the summary is capped at 75 bytes. The last data set we used is a large corpus of Chinese short text summarization (LCSTS) (Hu et al., 2015), which is collected from the Chinese microblogging website Sina Weibo. We follow the data split of the original paper, with 2.4M sourcesummary pairs from the first part of the corpus for training, 725 pairs from the last part with high annotation score for testing.

Experimental Setup
We employ Transformer as our basis architecture 4 . Six layers are stacked in both the encoder and decoder, and the dimensions of the embedding vectors and all hidden vectors are set 512. The inner layer of the feed-forward sublayer has the dimensionality of 2048. We set eight heads in the multihead attention. The source embedding, the target embedding and the linear sublayer are shared in our experiments. Byte-pair encoding is employed in the English experiment with a shared sourcetarget vocabulary of about 32k tokens (Sennrich et al., 2015).
Regarding the contrastive attention mechanism, the opponent attention is derived from the head System Gigaword DUC2004 R-1 R-2 R-L R-1 R-2 R-L ABS (Rush et al., 2015) 29  (Edunov et al., 2018) 36.70 17.88 34.29 ---DRGD (Li et al., 2017) 36 whose attention is most synchronous to word alignments of the source-summary pair. In our experiments, we select the fifth head of the third layer for deriving the opponent attention in the English experiments, and select the second head of the third layer in the Chinese experiments. All dimensions in the contrastive architecture are set 64. The λ in Equation (9) is tuned on the development set in each experiment. During training, we use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε= 10 −9 . The initial learning rate is 0.0005. The inverse square root schedule is applied for initial warm up and annealing (Vaswani et al., 2017). During training, we use a dropout rate of 0.3 on all datasets.
During evaluation, we employ ROUGE (Lin, 2004) as our evaluation metric. Since standard Rouge package is used to evaluate the English summarization systems, we also follow the method of Hu et al. (2015) to map Chinese words into numerical IDs in order to evaluate the performance on the Chinese data set.

English Results
The experimental results on the English evaluation sets are listed in Table 1. We report the full-length F-1 scores of ROUGE-1 (R-1), ROUGE2 (R-2), and ROUGE-L (R-L) on the evaluation set of the annotated Gigaword, while report the recall-based scores of the R-1, R-2, and R-L on the evaluation set of DUC2004 to follow the setting of the previous works.
The results of our works are shown at the bot-tom of Table 1. The performances of the related works are reported in the upper part of Table 1 for comparison. ABS and ABS+ are the pioneer works of using neural models for abstractive text summarization. RAS-Elman extends ABS/ABS+ with attentive CNN encoder. words-lvt5k-1sent uses large vocabulary and linguistic features such as POS and NER tags. RNN MRT , Actor-Critic, StructuredLoss are sequence-level training methods to overcome the problem of the usual teacher-forcing methods. DRGD uses recurrent latent random model to improve summarization quality. FactAware generates summary words conditioned on both the source text and the fact descriptions extracted from OpenIE or dependencies. Besides the above RNN-based related works, CNN-based architectures of ConvS2S and ConvS2S ReinforceTopic are included for comparison. Table 1 shows that we build a strong baseline using Transformer alone which obtains the state-of-the-art performance on Gigaword evaluation set, and obtains comparable performance to the state-of-the-art on DUC2004. When we introduce the contrastive attention mechanism into Transformer, it significantly improves the performance of Transformer, and greatly advances the state-of-the-art on both Gigaword evaluation set and DUC2004, as shown in the row of "Trans-former+Contrastive Attention". Table 2 presents the evaluation results on LC-STS. The upper rows list the performances of the related works, the bottom rows list the perfor-System R-1 R-2 R-L RNN context (Hu et al., 2015) 29.90 17.40 27.20 CopyNet (Gu et al., 2016) 34.40 21.60 31.30 RNNMRT (Ayana et al., 2016) 38.20 25.20 35.40 RNN distraction (Chen et al., 2016) 35.20 22.60 32.50 DRGD (Li et al., 2017) 36.99 24.15 34.21 Actor-Critic  37.51 24.68 35.02 Global (Lin et al., 2018) 39  mances of our Transformer baseline and the integration of the contrastive attention mechanism into Transformer. We only take character sequences as source-summary pairs and evaluate the performance based on reference characters for strict comparison to the related works. Table 2 shows that Transformer also sets a strong baseline on LCSTS that surpasses the performances of the previous works. When Transformer is equipped with our proposed contrastive attention mechanism, the performance is significantly improved and drastically advances the state-of-the-art on LCSTS.

Effect of the Contrastive Attention
Mechanism on Attentions Figure 3 shows the attention weights before and after using the contrastive attention mechanism. We depict the averaged attention weights of all heads in one layer in Figure 3a and 3b to study how it contributes to the conventional probability computation, and depict the opponent attention weights in Figure 3c to study its contribution to the opponent probability. Since we select the fifth head of the third layer to derive the opponent attention in English experiment, the studies are carried out on the third layer. Figure 3a is from the baseline Transformer, Figure 3b is from "Transformer + ContrastiveAttention". We can see that "Transformer + Con-trastiveAttention" is more focused on the source parts that are most relevant to the summary than the baseline Transformer, which scatters attention weights on summary word neighbors or even functional words such as "-lrb-" and "the". "Transformer + ContrastiveAttention" cancels such scattered attentions by using the contrastive attention mechanism. (a) is the average attention weights of the third layer of the baseline Transformer, (b) is that of "Trans-former+ContrastiveAttention", and (c) is the opponent attention derived from the fifth head of the third layer. Figure 3c depicts the opponent attention weights. They are optimized during training to generate the lowest score which is fed into softmin to get the highest opponent probability P o . The more irrelevant to the summary word the opponent is, the lower the score can be obtained, thus resulting in higher P o . Figure 3c shows that the attentions are formed over irrelevant parts with varied weights as the result of maximizing P o during training.

Effect of the Opponent Probability in Decoding
We study the contribution of the opponent probability P o by dropping it during decoding to see if it hurts the performance. Table 4 shows that dropping P o significantly harms the performance of "Transformer + ContrastiveAtt". The performance difference between the model dropping P o and the baseline Transformer is marginal, indicating that adding the opponent probability P o is key for achieving the performance improvement.

Explorations on Deriving the Opponent Attention
Masking More Attention Weights for Deriving the Opponent Attention   In Section 3.2.1, we mask the most salient word that has the maximum weight of α c to derive the opponent attention. In this subsection, we experimented with masking more weights of α c by two ways: 1) masking top k weights, 2) dynamically masking. In the dynamically masking method, we order the weights from big to small at first, then go on masking two neighbors until the ratio between them is over a threshold. The threshold is 1.02 based on training and tuning on the development set. The upper rows of Table 3 presents the performance comparison between masking maximum weight and masking more weights. It shows that masking maximum weight performs better, indicating that masking the most salient weight leaves more irrelevant or less relevant words to compute the opponent probability P o , which is more reliable than that computed from less remaining words after masking more weights.
Selecting Non-synchronous Head or Averaged Head for Deriving the Opponent Attention As explained in Section 3.2.1, the opponent attention is derived from the head that is most synchronous to the word alignments between source sentence and summary. We denote it "synchronous head". We also explored deriving the opponent attention from the fifth head of the first layer, which is non-synchronous to the word alignments as illustrated in Figure 2b. Its result is presented in the "non-synchronous head" row. In addition, the attention weights averaged on all heads of the third layer are used to derive the opponent attention. We denote it "averaged head".
As shown in the middle part of Table 3, both "non-synchronous head" and "averaged head" underperform "synchronous head". "nonsynchronous head" performs worst, and even worse than the Transformer baseline on Gigaword. This indicates that it is better to compose the opponent attention from irrelevant parts that can be easily located in the synchronous head. "averaged head" performs slightly worse than "synchronous head", and is also slower due to the involved all heads. Table 5 shows the qualitative results. The highlights in the baseline Transformer manifest the incorrect areas extracted by the baseline system.

Qualitative Study
In contrast, the highlights in Trans-former+ContrastiveAtt show that correct contents are extracted since the contrastive system distinguish relevant parts from irrelevant parts on the source side and made attending to correct areas more easily.

Conclusion
We proposed a contrastive attention mechanism for abstractive sentence summarization, using both the conventional attention that attends to the relevant parts of the source sentence, and a novel opponent attention that attends to irrelevant or less relevant parts for the summary word generation. Both categories of the attention constitute a contrastive pair, and we encourage contribution from the conventional attention and penalize con-Src:press freedom in algeria remains at risk despite the release on wednesday of prominent newspaper editor mohamed UNK after a two-year prison sentence , human rights organizations said . Ref:algerian press freedom at risk despite editor 's release UNK picture Transformer:press freedom remains at risk in algeria rights groups say Transformer+ContrastiveAtt:press freedom remains at risk despite release of algerian editor Src:denmark 's poul-erik hoyer completed his hat-trick of men 's singles badminton titles at the european championships , winning the final here on saturday Ref:hoyer wins singles title Transformer:hoyer completes hat-trick Transformer+ContrastiveAtt:hoyer wins men 's singles title Src:french bank credit agricole launched on tuesday a public cash offer to buy the ## percent of emporiki bank it does not already own , in a bid valuing the greek group at #.# billion euros ( #.# billion dollars ) . Ref:credit agricole announces #.#-billion-euro bid for greek bank emporiki Transformer:credit agricole launches public cash offer for greek bank Transformer+ContrastiveAtt:french bank credit agricole bids #.# billion euros for greek bank tribution from the opponent attention through joint training. Using Transformer as a strong baseline, experiments on three benchmark data sets show that the proposed contrastive attention mechanism significantly improves the performance, advancing the state-of-the-art performance for the task.