Improving Abstractive Document Summarization with Salient Information Modeling

Comprehensive document encoding and salient information selection are two major difficulties for generating summaries with adequate salient information. To tackle the above difficulties, we propose a Transformer-based encoder-decoder framework with two novel extensions for abstractive document summarization. Specifically, (1) to encode the documents comprehensively, we design a focus-attention mechanism and incorporate it into the encoder. This mechanism models a Gaussian focal bias on attention scores to enhance the perception of local context, which contributes to producing salient and informative summaries. (2) To distinguish salient information precisely, we design an independent saliency-selection network which manages the information flow from encoder to decoder. This network effectively reduces the influences of secondary information on the generated summaries. Experimental results on the popular CNN/Daily Mail benchmark demonstrate that our model outperforms other state-of-the-art baselines on the ROUGE metrics.


Introduction
Document summarization is a fundamental task of natural language generation which condenses the given documents and generates fluent summaries with salient information automatically. Recent successes of neural sequence-tosequence (seq2seq) models (Luong et al., 2015;Wu et al., 2016;Tu et al., 2016) enable the endto-end framework for natural language generation, which inspires the research on abstractive summarization. Abstractive document summarization employs an end-to-end language model to encode a document into high-dimensional representations and then decode the representations into an abstractive summary. Though promis-Documents: a [duke student] has [admitted to hanging a noose made of rope] from a tree near a student union , [university officials] said thursday . the prestigious private school did n't identify the student , citing federal privacy laws . in a news release , it said the student was [no longer] on campus and [will face] student conduct [review] . the [student was identified during an investigation] by campus police and the office of student affairs and admitted to placing the noose on the tree early wednesday , the university said . ... at a forum held on the steps of duke chapel , close to where [the noose was discovered at 2 a.m]. , hundreds of people gathered . " you came here for the reason that you want to say with me , ' this is no duke we will accept . ...
Reference summary: student is no longer on duke university campus and will face disciplinary review . school officials identified student during investigation and the person admitted to hanging the noose , duke says . the noose , made of rope , was discovered on campus about 2 a.m. Table 1: Example of a document and its corresponding reference summary. We consider the reference summary contains all salient information and mark the words or phrases appearing in the document in [red].
ing improvements have been achieved recently (Li et al., 2018c;Kryściński et al., 2018), there are still many problems are not studied well, such as the incompetence of salient information modeling.
Modeling salient information contains the procedure of information representation and discrimination. Generally, the most essential prerequisite for a practical document summarization model is that the generated summaries should contain adequate salient information of the original documents. However, previous seq2seq models are still incapable of achieving convincing performance, which are restricted by the following two difficulties.
The first difficulty lies in the procedure of encoding. Considering a document is a long sequence of multiple sentences, the semantics of each token in document contain the dependencies with other distant tokens and its local context information. They both contribute to producing high-quality summaries with adequate salient information. The lack of long-term dependencies among tokens often leads to generating incomplete summaries (Li et al., 2018c). Unfortunately, traditional seq2seq encoders (recurrent or convolution based) are deficient in modeling dependencies among distant segments (Bengio et al., 1994;Li et al., 2018c). In recent years, the Transformer model (Vaswani et al., 2017) reveals remarkable performance in many similar tasks (Devlin et al., 2018) due to exploiting long-term dependencies, but recent studies point out this model may overlook local context occasionally (Yang et al., 2018). The absence of local context information accounts for inadequate details of salient information. Therefore, it is challenging to encode global information and local context comprehensively for each token in documents, which requires the capability of capturing long-term dependencies and local semantics at the same time. The second difficulty is to distinguish salient information from long documents precisely. In the example shown in Table 1, salient segments account for only a small part of the whole document, which is laborious for naive seq2seq models to distinguish important information from much secondary information. The summaries generated by these models usually lose salient information of original documents or even contain repetitions (Li et al., 2018c).
In this paper, we propose the Extended Transformer model for Abstractive Document Summarization (ETADS) to tackle the above issues. Specifically, we design a novel focusattention mechanism and saliency-selection network equipped in the encoder and decoder respectively: (1) To comprehensively encode the documents, we design a focus-attention mechanism, where a learnable Gaussian focal bias is employed as a regularization term on attention scores. This focal bias implicitly aggregates attention on local continuous scopes to emphasize the corresponding part of document. (2) To distinguish salient information in documents, we design an independent saliency-selection network to manage the information flow from encoder to decoder explicitly. The saliency-selection network employs a gate mechanism to assign a salient score for each token in source documents according to their encoded representations. We consider the lower-score tokens are relatively insignificant and reduce their likelihood of appearing in final summaries. Finally, we conduct extensive experiments on the CNN/Daily Mail dataset which is prevailing and widely used for document summarization task. The experimental results show that ETADS achieves stateof-the-art ROUGE scores and outperforms many strong baselines.

Related Work
With the development of seq2seq model on neural translation task, more and more researchers take note of its great potential in text summarization area (Fan et al., 2017;Ling and Rush, 2017;Cheng and Lapata, 2016), especially for abstractive methods. Rush et al. (2015) is the first to apply seq2seq model with attention mechanism to abstractive summarization and achieve promising improvement. Nallapati et al. (2016) modify the basic model with RNN-based encoder and decoder and propose several techniques.  further propose to improve the novelty of generated summaries and design a distractionbased attentional model. Li et al. (2017) creatively incorporate the variational auto-encoder into the seq2seq model to learn the latent structure information. However, these models are nearly designed for abstractive sentence summarization, which focus on encoding and mining salient information on sentence-level and lead to unsatisfactory performances for document summarization.
Some recent work improves the performance of neural abstractive models on document summarization task from various aspects. To better grasp the essential meaning for summarization,  propose not only to pay attention to specific regions and content of input documents with attention models, but also distract them to traverse between different content. Tan et al. (2017) propose a graph-based attention mechanism in a hierarchical encoder-decoder framework to generate multi-sentence summary. Gehrmann et al. (2018) presents a content selection model for summarization that identifies phrases within a document that are likely included in its summary. To produce more informative summaries, (Gu et al., 2016) is the first to show that the copy mechanism (Vinyals et al., 2015) can alleviate the Out-Of-Vocabulary problem by copying words from the source documents. See et al. (2017) rebuild this pointer-generator network and incorporate an additional coverage mechanism into the decoder. Li et al. (2018b) notice the necessity of explicit information selection and they build a gated global information filter and local sentence selection mechanism. Moreover, reinforcement learning (RL) approaches have been shown to further improve performance on these tasks (Celikyilmaz et al., 2018;Li et al., 2018a). Pasunuru and Bansal (2018) develop a loss-function based on whether salient segments are included in a summary. However, the optimization of RL-based models can be difficult to tune and slow to train.

Model
In this section, we describe our approach from three aspects: (i) the Transformer-based encoderdecoder framework, (ii) the focus-attention mechanism for the encoder to emphasize the local context, and (iii) the saliency-selection network for the decoder to select salient information.

Encoder-Decoder Framework
Given a document X = (x 1 , x 2 , ..., x m ), the encoder maps its corresponding symbol representations E = (e 1 , e 2 , ..., e m ) to a sequence of continuous representations Z = (z 1 , z 2 , ..., z m ), where m is the length of document. The decoder then decode Z into continuous representations S = (s 1 , s 2 , ..., s n ) and generates abstractive summary Y = (y 1 , y 2 , ..., y n ) one token a time, where n is the length of summary. V s and V t are the source/target vocabularies and x i ∈ V s , y j ∈ V t . E is the sum of word embedding representations and position embedding representations, where e i ∈ R de . Both embedding representations are initialized as (Vaswani et al., 2017) and learned during the process of optimization.

Encoder
The encoder is composed of a stack of N identical layers, and each layer has two sub-layers. The first is the self-attention sub-layer and the second is the feed-forward sub-layer. The residual connection is employed around each of the two sublayers, followed by layer normalization. Given the example input t, the output of each sub-layer can be formalized as LayerNorm(t + SubLayer(t)). For encoder, the SubLayer(t) can be replaced with ATT(t) or FFN(t), which represents the preoutput of self-attention sub-layer or feed-forward sub-layer respectively. The details of each sublayer are presented as follows.
The self-attention sub-layer takes the output of previous layer as the input. Formally, the input for the self-attention sub-layer of the l-th layer is Z l−1 ∈ R m×dm , where d m is the dimension of output. Specially, Z 0 = E and the output of encoder Z = Z N . In the process of computation, three matrices query Q l ∈ R m×dm , key K l ∈ R m×dm and value V l ∈ R m×dm are obtained firstly by the linear projections from Z l−1 with three different metrics W Q l ∈ R dm×dm , W K l ∈ R dm×dm and W V l ∈ R dm×dm . Then the preoutput of self-attention sub-layer can be computed with the scaled dot-product attention mechanism: and the final output A l of this sub-layer is obtained with residual connection and layer normalization. Moreover, the self-attention sub-layer can be further extended into multi-head manner. Namely, The feed-forward sub-layer takes the output of self-attention sub-layer A l as the input and the computation of pre-output FFN(A l ) is straightforward with a position-wise fully connected feedforward network: l ∈ R d f and b 2 l ∈ R dm are two learnable biases. The final output of feedforward sub-layer Z l is also the output for the l-th layer which is obtained after residual connection and layer normalization.

Decoder
The decoder in our framework has a similar stacked structure with N identical layers. In addition to the two sub-layers introduced above, the decoder inserts another self-attention sub-layer in between, which performs multi-head attention over the output of the encoder. For clarity, we use the "bridge sub-layer" to refer to this additional self-attention sub-layer and BATT(Z, t) to represent the pre-output of this sub-layer, where Z is the encoder output and t is a example of encoded partial generated summary. The calculation of BATT(Z, t) is similar to the Eq.(1). Specifically, for the l-th bridge sub-layer in the decoder, key K l and value V l are obtained by linear projections from Z. Apart from the additional sub-layer, the rest of computation process is the same as the encoder, and the output of last layer H N is considered as the final decoder output H.
Finally, for the i-th decoding step, we compute a distribution over the V t for target elements y i by projecting the output of decoder stack S i via a linear layer with weights W o ∈ R dm×T and bias

Focus-Attention Mechanism
To take full advantage of documents information during the process of encoding, we design a focus-attention mechanism and build it in the selfattention sub-layers of the encoder, which is depicted as Figure 1. The "dotted boxes" indicate that the corresponding modules can be adapted into the multi-head manner.
The focus-attention mechanism models a focal bias as a regularization term on attention scores which is determined by the position of center and effective coverage scope. In the l-th self-attention sub-layer, since the query Q l , key K l and value V l are obtained by linear projections from the input Z l−1 , so that they contain similar information in different semantic space. To reduce the amount of calculation, we only utilize the query matrices Q l to compute the position vector and coverage scope vector. Specifically, for the i-th encoding step in l-th layer, the center position scalar µ i l ∈ R and the coverage scope scalar σ i l ∈ R are calculated by two linear projections, namely: where W p ∈ R dm×dm , and W g ∈ R dm×dm are two shared weight matrices. U c ∈ R dm and U d ∈ R dm According to the definition of Gaussian distribution, the focal bias for the i-th step f i l ∈ R m can be easily obtained withμ i l andσ i l as follows: where P j is the absolute position of word x j in the document. f i,j l ∈ [−∞, 0] measures the distance between word x j and the center positionμ i l . Eventually, this focal bias is added to the attention energy of encoder layers before softmax normalization.
where ⊕ denotes the addition. Moreover, we further adapt the focus-attention mechanism into the multi-head manner as Eq.2. Accordingly, the distinct focal biases are assigned for each head and different weight matrices are utilized in the process of computation.

Saliency-Selection Network
Abstractive document summarization is a special NLP generation task which requires to reduce the influence of secondary information and integrate salient segments to produce a condensed summary. Traditional seq2seq models often have limited performance on distinguishing salient segments (Tan et al., 2017), which emphasizes the necessity of customized selection network. In this work, we design the saliency-selection network for information selection, which is depicted as Figure 2.
Concretely, we measure the saliency of each word in the document by assigning a saliency score and make a soft selection. For the i-th decoding step in l-th layer, the saliency-selection network takes query matrices Q i l ∈ R d m and key matrices K l ∈ R m×dm as the input, where m is the length of the input document. Then, the network computes saliency score g i l ∈ R m as: where W h ∈ R dm×dm and W s ∈ R dm×dm are two learnable weight matrices. g i,j l ∈ [0, 1] measures the saliency of the j-th token in document for the i-th position in summary. Furthermore, we incorporate the computed saliency score g l into the attention network of bridge sub-layer by: Moreover, we also adopt the saliency-selection network into the multi-head manner, which allows to model saliency from different perspectives at different positions.

Objective Function
Our goal is to maximize the output summary probability given the input document. Therefore, we optimize the negative log-likelihood loss function: where θ is the model parameter, and (X, Y ) is a document-summary pair in training set τ , then log p(y i |y 1 , ..., y i−1 , X; θ) (12) where p(y i |y 1 , ..., y i−1 , X; θ) is calculated by the decoder.

Experiments
In this section, we introduce the experiment setup, the implementation details, the baseline models and the experimental results.

Setup
We conduct the experiments on a large-scale corpus of CNN/Daliy Mail, which has been widely used for the explorations on document summarization. The corpus is originally constructed by collecting human generated highlights for new stories in CNN and Daily Mail website (Hermann et al., 2015). We use the scripts supplied by Nallapati et al. (2016) to further obtain the CNN/Daily Mail dataset. This dataset contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs. We use the same non-anonymized version of dataset as See et al. (2017) which requires no pre-processing 1 . The average number of sentences in documents and summaries are 42.1 and 3.8, respectively. We assume the length of all documents should not exceed 400 tokens and all summaries should not exceed 100 tokens. The word dictionary shared by documents and summaries contains 50,000 most popular tokens in documents.
In our model, we set the number of encoder/decoder layers N = 4 and the number of heads h = 8. The dimensions of the signal representation d e and output d m are set to 512, and the dimension of intermediate output d f is set to 2048. Besides, the dropout rate is set to 0.8 in the process of training. We implement our model in Py-Torch 2 1.0. In all experiment, the batch size is set to 4096. We use the Adam optimizer (Kingma and Ba, 2014) to train our model with β 1 = 0.9, β 2 = 0.998 and = 10 −8 . The learning rate varies every step with the Noam decay strategy (Vaswani et al., 2017) and the warmup threshold is 8000. The maximum norm of gradient-clipping is set to 2. In the end, we conduct our experiment on one machine with 4 NVIDIA Titan Xp GPUs and the training process lasts 200,000 steps for each model.
We use the beam search algorithm (Sutskever et al., 2014) with coverage technique (Tu et al., 2016) to generate multiple summary candidates in parallel to obtain better results, the coverage weight is set to 1. For fear of favoring shorter generated summaries, we utilize length penalty (Wu et al., 2016) during the process of inference. We set the beam size to 10, the length penalty parameter α to 0.9 and β to 5. The minimum length of the generated summary is set to 35 and the batch size for inference is set to 1.
Following the previous studies, we use the ROUGE scores (Lin and Hovy, 2003) to evaluate the performance of our model with Python implementation 3 and standard options. ROUGE scores measure the quality of summary by computing overlapping lexical units with references, such as uni-gram, bi-gram and longest common subsequence (LCS). F-measures ROUGE-1 (unigram), ROUGE-2 (bi-gram) and ROUGE-L (LCS) are reported as the evaluation metrics.

Baselines
In this work, we compare our approach with these following state-of-the-art baselines: 2 https://pytorch.org/ 3 https://github.com/falcondai/pyrouge  To save space, we use "PG+cov" and "Bottom-Up" to denote the baseline "Pointer-Generator+coverage" and "Bottom-Up Summarization". The symbol "+" stands for the corresponding module is added on the "Basic model" which is a vanilla Transformer with 4 identical layers.
words-1vt2k-temp-att: Nallapati et al. (2016) build this model with the basic seq2seq encoderdecoder architecture and attention mechanism, which is a pioneering effort for much other work.
Pointer-generator+coverage: To deal with Out-Of-Vocabulary words (OOV words) and repeating problem, See et al. (2017) combine the pointer network into the RNN-based seq2seq model and design a coverage mechanism.
ConvS2S: Gehring et al. (2017) creatively utilize convolution neural networks to build seq2seq model and achieve high performance on many tasks, including abstractive summarization.
Explicit-Selection: Li et al. (2018b) propose to extend the basic seq2seq model with an information selection layer to explicitly control information flow.
ROUGESal+Ent(RL): Pasunuru and Bansal (2018) address main difficulties via a reinforcement learning approach with two novel reward functions.
Bottom-Up Summarization: This work combines extractive and abstractive summarization by firstly using a data-efficient content selector to over-determine phrase related (Gehrmann et al., 2018).

Results
The experimental results are given in Table 2. Overall, ETADS achieves advantages of ROUGE F1 scores over all of the other baselines (reported in their own articles) and two extensions we proposed both improve the performances based on the basic model. Concretely, we design the focusattention mechanism to improve the capability of capturing the local context information and further encode the document comprehensively. Therefore, the basic model with focus-attention mechanism is expected to achieve improvement in producing summaries with continuous salient segments. The significant improvement on ROUGE-L verifies our hypothesis. Besides, we notice that the improvements provided by the basic model with saliency-selection network particularly lie in ROUGE-1 F1 scores. We consider the reason may lie in the saliency-selection network is more sensitive to the short segments due to the separate saliency measuring process.
Comparing with the two classical RNN-based baselines words-1vt2k-temp-att and Pointer-generator+coverage and one CNN-based baseline ConvS2S, our basic model is capable of achieving equivalent performance. We believe it should give credit to the capability of modeling long-term dependencies.
When compared with more recent work, Explicit-Selection equips a selection layer similar to our saliencyselection network to mine salient information. Despite being aware of this problem, our saliencyselection network achieves better performance with the help of stacked architecture. The performance of reinforcement learning based model ROUGEEsal+Ent is worse than our model obviously. The strongest baseline Bottom-Up Summarization combines the advantages of CNNbased model and RNN-based model but is also slightly inferior to our model.

Case Study
To further illustrate the effectiveness of our proposed ETADS vividly and analyze the reasons of improving the performance, we compare the generated summaries by baselines words-1vt2ktemp-att, Bottom-Up Summarization and our ETADS approach. For the case in Table 3, the input document focuses on analyzing the latest financial report of the Apple company and further discusses the impact of the new Apple Watch on retail revenue. The performance of words-1vt2k-temp-att is unsatisfactory, three generated sentences are irrelevant to the main concepts and Reference summary: apple sold more than 61 million iphones in the quarter . apple did n't report any results for the new apple watch . believed around 2 million watches have been sold , according to estimates .
words-1vt2k-temp-att: the iphone is still the engine behind apple 's phenomenal success . apple has vied with south korea 's samsung for the no. 1 position in the global smartphone market . apple ceo tim cook has said he 's optimistic about new markets such as [china china china china china ...]

Bottom-Up Summarization:
[apple sold more than 61 million iphones in the quarter] , accounting for more than two-thirds of its $ 58 billion in revenue for the quarter and the lion 's share of $ 13.6 billion in profit -and up 40 % from a year ago . $ 200 billion in cash , up from around $ 150 billion for one year . revenue from mac computers rose 2 % to $ 5.6 billion .

ETADS:
[apple sold more than 61 million iphones in the quarter .] it was a 40 percent increase over the number of iphones sold in the first three months of 2014 .
[apple did n't report any results for the new apple watch] , which it began selling this month , after the quarter ended . even contains repetitions at the end of the summary. Abstractive summary generated by baseline Bottom-Up Summarization is much more better, which indicates the effectiveness of modifications. However, the generated summary only contains partial salient information of the document. ETADS achieves the best performance in this case due to two of the generated sentences containing salient information and without repetitions. The above results verify that the extensions in our model improve the capability of document summarization from not only quantitative but also qualitative perspectives.

Discussion
In this section, we first validate the robustness of our model with different encoder/decoder architectures and then discuss the different deploy strategies for our extensions.

Architecture Robustness
We conduct experiments to see how the model's performance is affected by the stacked architecture. We perform a set of experiments which adjust the structures of the encoder and decoder to   2, 4 and 6 layers respectively. Experimental results on the test set in Table 4 show that there is no notable difference between 4 layers or 6 layers for encoder or decoder. However, the number of parameters is significantly increased nearly 1/4 for 6 layers, which means more time is needed for convergence. Employing 2 layers for either the encoder or decoder leads to rapid performance degradation. From the aspect of efficiency and effectiveness, we decide to equip 4 layers for the encoder and decoder eventually.

Deployment Strategies
In this section, we discuss the different deployment strategies for our extensions on the encoderdecoder framework.
Firstly, we deploy the saliency-selection network on different layers to discuss strategies of saliency-selection deployment. As we mentioned before, the major difficulty of this salient information selection procedure is to comprehend the relative semantic meanings and make the correct selection, which significantly affects the precision scores. Therefore, it is proper to use precision  Table 6: ROUGE recall scores on the CNN/Daliy Mail test set. "-" to indicate the basic model which does not contain focus-attention mechanism. Other symbols express same meaning with Table 5 scores to measure effectiveness. From Table 5, it can be observed that the improvements brought by our saliency-selection network do not increase with layers linearly. In the shallow layers, the saliency-selection network contributes to notable improvement which is close to the best results we achieved. However, for the deeper layers, the improvement brought by the saliency-selection network is limited. We believe it can be attributed to the characteristics of our encoder-decoder framework. Self-attention sub-layer effectively reduces the cost of long-term information fusion, which leads to difficult to comprehend the original semantic information. The saliency-selection network we proposed is not competent to distinguish noise information when the original semantic information becoming confusing.
Furthermore, we discuss the strategies for focus-attention mechanism with ROUGE recall scores. The results of Table 6 demonstrate a similar phenomenon to Table 5 where improvements mainly come from shallow layers. We believe it is a trade-off between local context and global information. Focus-attention mechanism aims to gather attention to the local context around a center which deviates from the original goal. (Vaswani et al., 2017;Shi et al., 2016) indicate that there exists a consensus in the NLP community that shallow layers of a stacked model are sensitive to local context and deeper layers modeling global semantics. Therefore, as the module designed to capture local context, we believe it is reasonable to obtain more promotion where it is equipped on shallower layers which is also a side proof of effectiveness.

Conclusion
In this paper, we propose a novel framework for abstractive document summarization with extended Transformer model. The proposed model consists of a concise pipeline. First, the stacked encoder with focus-attention mechanism captures long-term dependencies and local context of input document comprehensively. Then the decoder with saliency-selection network distinguishes and condenses the salient information into the output. Finally, an inference algorithm produces the abstractive summaries. Our experiments show that the proposed model achieves a significant improvement for abstractive document summarization over previous state-of-the-art baselines.