Abstractive Document Summarization with a Graph-Based Attentional Neural Model

Abstractive summarization is the ultimate goal of document summarization research, but previously it is less investigated due to the immaturity of text generation techniques. Recently impressive progress has been made to abstractive sentence summarization using neural models. Unfortunately, attempts on abstractive document summarization are still in a primitive stage, and the evaluation results are worse than extractive methods on benchmark datasets. In this paper, we review the difficulties of neural abstractive document summarization, and propose a novel graph-based attention mechanism in the sequence-to-sequence framework. The intuition is to address the saliency factor of summarization, which has been overlooked by prior works. Experimental results demonstrate our model is able to achieve considerable improvement over previous neural abstractive models. The data-driven neural abstractive method is also competitive with state-of-the-art extractive methods.


Introduction
Document summarization is a task to generate a fluent, condensed summary for a document, and keep important information. As a useful technique to alleviate the information overload people are facing today, document summarization has been extensively investigated. Efforts on document summarization can be categorized to extractive and abstractive methods. Extractive methods produce the summary of a document by extracting sentences from the original document. They have the advantage of producing fluent sentences and preserving the meaning of original documents, but also inevitably face the drawbacks of information redundancy and incoherence between sentences. Moreover, extraction is far from the way humans write summaries.
On the contrary, abstractive methods are able to generate better summaries with the use of arbitrary words and expressions, but generating abstractive summaries is much more difficult in practice. Abstractive summarization involves sophisticated techniques including meaning representation, content organization, and surface realization. Each of these techniques has large space to be improved (Yao et al., 2017). Due to the immaturity of natural language generation techniques, fully abstractive approaches are still at the beginning and cannot always ensure grammatical abstracts.
Recent neural networks enable an end-to-end framework for natural language generation. Success has been witnessed on tasks like machine translation and image captioning, together with the abstractive sentence summarization (Rush et al., 2015). Unfortunately, the extension of sentence abstractive methods to the document summarization task is not straightforward. Encoding and decoding for a long sequence of multiple sentences, currently still lack satisfactory solutions (Yao et al., 2017). Recent abstractive document summarization models are yet not able to achieve convincing performance, with a considerable gap from extractive methods.
In this paper, we review the key factors of document summarization, i.e., the saliency, fluency, coherence, and novelty requirements of the generated summary. Fluency is what neural generation models are naturally good at, but the other factors are less considered in previous neural abstractive models. A recent study (Chen et al., 2016) starts to consider the factor of novelty, using a distraction mechanism to avoid redundancy. As far as we know, however, saliency has not been addressed by existing neural abstractive models, despite its importance for summary generation.
In this work, we study how neural summarization models can discover the salient information of a document. Inspired by the graph-based extractive summarization methods, we introduce a novel graph-based attention mechanism in the encoderdecoder framework. Moreover, we investigate the challenges of accepting and generating long sequences for sequence-to-sequence (seq2seq) models, and propose a new hierarchical decoding algorithm with a reference mechanism to generate the abstractive summaries. The proposed method is able to tackle the constraints of saliency, nonredundancy, information correctness, and fluency under a unified framework.
We conduct experiments on two large-scale corpora with human generated summaries. Experimental results demonstrate that our approach consistently outperforms previous neural abstractive summarization models, and is also competitive with state-of-the-art extractive methods.
We organize the paper as follows. Section 2 introduces related work. Section 3 describes our method. In Section 4 we present the experiments and have discussion. Finally in Section 5 we conclude this paper.

Extractive Summarization Methods
Document summarization can be categorized to extractive methods and abstractive methods. Extractive methods extract sentences from the original document to form the summary. Notable early works include (Edmundson, 1969;Carbonell and Goldstein, 1998;McDonald, 2007). In recent years much progress has also been made under traditional extractive frameworks (Li et al., 2013;Dasgupta et al., 2013;Nishikawa et al., 2014).
Neural networks have also been widely investigated on the extractive summarization task. Earlier works explore to use deep learning techniques in the traditional framework (Kobayashi et al., 2015;Yin and Pei, 2015;Cao et al., 2015a,b). More recent works predict the extraction of sentences in a more data-driven way. Cheng and Lapata (2016) propose an encoder-decoder approach where the encoder learns the representation of sentences and documents while the decoder classifies each sentence using an attention mechanism. Nal-lapati et al. (2017) propose a recurrent neural network (RNN)-based sequence model for extractive summarization of documents. Neural sentence extractive models are able to leverage large-scale training data and achieve performance better than traditional extractive summarization methods.

Abstractive Summarization Methods
Abstractive summarization aims at generating the summary based on understanding the input text. It involves multiple subproblems like simplification, paraphrasing, and fusion. Previous research is mostly restricted in one or a few of the subproblems or specific domains (Woodsend and Lapata, 2012;Thadani and McKeown, 2013;Cheung and Penn, 2014;Pighin et al., 2014;Sun et al., 2015).
As for neural network models, success is achieved on sentence abstractive summarization. Rush et al. (2015) train a neural attention model on a large corpus of news documents and their headlines, and later Chopra et al. (2016) extend their work with an attentive recurrent neural network framework. Nallapati et al. (2016) introduce various effective techniques in the RNN seq2seq framework. These neural sentence abstraction models are able to achieve state-of-the-art results on the DUC competition of generating headlinelevel summaries for news documents.
Some recent works investigate neural abstractive models on the document summarization task. Cheng and Lapata (2016) also adopt a word extraction model, which is restricted to use the words of the source document to generate a summary, although the performance is much worse than the sentence extractive model. Nallapati et al. (2016) extend the sentence summarization model by trying a hierarchical attention architecture and a limited vocabulary during the decoding phase. However these models still investigate few properties of the document summarization task. Chen et al. (2016) first attempt to explore the novelty factor of summarization, and propose a distraction-based attentional model. Unfortunately these state-ofthe-art neural abstractive summarization models are still not competitive to extractive methods, and there are several problems remain to be solved.

Overview
In this section we introduce our method. We adopt an encoder-decoder framework, which is widely used in machine translation (Bahdanau et al., 2014) and dialog systems (Mou et al., 2016), etc. In particular, we use a hierarchical encoderdecoder framework similar to , as shown in Figure 1. The main distinction of this work is that we introduce a graph-based attention mechanism which is illustrated in Figure 1b, and we propose a hierarchical decoding algorithm with a reference mechanism to tackle the difficulty of abstractive summary generation. In the following parts, we will first introduce the encoder-decoder framework, and then describe the graph-based attention and the hierarchical decoding algorithm.

Encoder
The goal of the encoder is to map the input document to a vector representation. A document d is a sequence of sentences d = {s i }, and a sentence s i is a sequence of words s i = {w i,k }. Each word w i,k is represented by its distributed representation e i,k , which is mapped by a word embedding matrix E v . We adopt a hierarchical encoder framework, where we use a word encoder enc word to encode the words of a sentence s i into the sentence representation, and use a sentence encoder enc sent to encode the sentences of a document d into the document representation. The input to the word encoder is the word sequence of a sentence, appended with an "<eos>" token indicating the end of a sentence. The word encoder sequentially updates its hidden state after receiving each word, as h i,k = enc word (h i,k−1 , e i,k ). The last hidden state (after the word encoder receives "<eos>") is denoted as h i,−1 , and used as the embedding representation of the sentence s i , denoted as x i . A sentence encoder is used to sequentially receive the embeddings of the sentences, given by h i = enc sent (h i−1 , x i ). A pseudo sentence of an "<eod>" token is appended at the end of the document to indicate the end of the whole document. The hidden state after the sentence encoder receives "<eod>" is treated as the representation of the input document c = h −1 .
We use the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) as both the word encoder enc word and sentence encoder enc sent . In particular, we adopt the variant of LSTM structure in (Graves, 2013).

Decoder with Attention
The decoder is used to generate output sentences {s j } according to the representation of the input sentences. We also use an LSTM-based hierarchical decoder framework to generate the summary, because the summary typically comprises several sentences. The sentence decoder dec sent receives the document representation c as the initial state h 0 = c, and predicts the sentence representations sequentially, by h j = dec sent (h j−1 , x j−1 ), where x j−1 is the encoded representation of the previously generated sentence s j−1 . The word decoder dec word receives a sentence representation h j as the initial state h j,0 = h j , and predicts the word representations sequentially, by h j,k = dec word (h j,k−1 , e j,k−1 ), where e j,k−1 is the embedding of the previously generated word. The predicted word representations are mapped to vectors of the vocabulary size dimension, and then normalized by a softmax layer as the probability distribution of generating the words in the vocabulary. A word decoder stops when it generates the "<eos>" token and similarly the sentence decoder stops when it generates the "<eod>" token.
In primitive decoder models, c is the same for generating all the output words, which requires c to be a sufficient representation for the whole input sequence. The attention mechanism (Bahdanau et al., 2014) is usually introduced to alleviate the burden of remembering the whole input sequence, and to allow the decoder to pay different attention to different parts of input at different generation states. The attention mechanism sets a different c j when generating sentence j, by c j = i α j i h i . α j i indicates how much the i-th original sentence s i contributes to generating the j-th sentence. α j i is usually computed as: where η is the function modeling the relation between h i and h j . η can be defined using various and even a non-linear function achieved by a multi-layer neural network. In this paper we use η (a, b) = a T M b where M is a parameter matrix.

Graph-based Attention Mechanism
Traditional attention computes the importance score of a sentence s i , when generating sentence s j , according to the relation between the hidden state h i and current decoding state h j , as shown  in Figure 1a. This attention mechanism is useful in scenarios like machine translation and image captioning, because the model is able to learn a relevance mapping between the input and output. However, for document summarization, it is not easy for the model to learn how to summarize the salient information of a document, i.e., which sentences are more important to a document.
To tackle this challenge, we learn from graphbased extractive summarization models TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004), which are based on the PageRank (Page et al., 1999) algorithm. These unsupervised graph-based models show good ability to identify important sentences in a document. The underlying idea is that a sentence is important in a document if it is heavily linked with many important sentences (Wan, 2010).
In graph-based extractive summarization, a graph G is constructed to rank the original sentences. The vertices V are the set of n sentences to be considered, and the edges E are the relations between the sentences, which are typically modeled by the similarity of sentences. Let W ∈ R n×n be the adjacent matrix. Then the saliency scores of the sentences are determined by making use of the global information on the graph recursively, as: where f = [f 1 , . . . , f n ] ∈ R n denotes the rank scores of the n sentences. f (t) denotes the rank scores at the t-th iteration. D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th column of W . Assume we use h i as the representation of s i , and W (i, j) = h T i M h j , where M is a parameter matrix to be learned. λ is a damping factor. y ∈ R n with all elements equal to 1 /n. The solution of f can be calculated using the closedform: In the graph model, the importance score of a sentence s i is determined by the relation between h i and the {h l } of all other sentences. Relatively, in traditional attention mechanisms, the importance (attention) score α j i is determined by the relation between h i and h j , regardless of other original sentences. In our model we hope to combine the two effects, and compute the rank scores of the original sentences regarding h j , so that the importance scores of original sentences are different when decoding different state h j , denoted by f j . In our model we use the scores f j to compute the attention. Therefore, h j should be considered in the graph model. Inspired by the query-focused graph-based extractive summarization model (Wan et al., 2007), we realize this by applying the idea of topic-sensitive PageRank (Haveliwala, 2002), which is to rank the sentences with the concern of their relevance to the topic. We treat the current decoding state h j as the topic and add it into the graph as the 0-th pseudo-sentence. Given a topic T , the topic-sensitive PageRank is similar to Eq. 3 except that y becomes: Therefore y T is always a one hot vector and only y 0 = 1, indicating the 0-th sentence is s j . Denote W j as the new adjacent matrix added with h j , and D j as the new diagonal matrix corresponding to W j . Then the convergence score vector f j contains the importance scores for all the input sentences when generating sentence s j , as: The new scores f j can be used to compute the graph-based attention when decoding h j , to find the sentences which are both globally important and relevant to current decoding state h j . Inspired by (Chen et al., 2016) we adopt a distraction mechanism to compute the final attention value α j i , which subtracts the rank scores of the previous step, to penalize the model from attending to previously attended sentences, and also help to normalize the ranked scores f j . The graph-based attention is finally computed as: where f 0 is initialized with all elements equal to 1 /n. The graph-based attention will only focus on those sentences ranked higher over the previous decoding step, so that it concentrates more on the sentences which are both salient and novel. Both Eq. 5 and Eq. 6 are differentiable; thus we can use the graph-based attention function Eq. 6 to replace the traditional attention function Eq. 1, and the neural model using the graph-based attention can also be trained using traditional gradientbased methods.

Model Training
The loss function L of the model is the negative log likelihood of generating summaries over the training set D: where X = x 1 , . . . , x |X| and Y = y 1 , . . . , y |Y | denote the word sequences of a document and its summary respectively, including the "<eos>" and "<eod>" tokens for structure information. Then and log p (y τ | {y 1 , . . . , y τ −1 } , c; θ) is modeled by the LSTM encoder and decoder. We use the Adamax (Kingma and Ba, 2014) gradient-based optimization method to optimize the model parameters θ.

Decoding Algorithm
We find there are several problems during the generation of summary, including out-of-vocabulary (OOV) words, information incorrectness, error accumulation and repetition. These problems make the generated abstractive summaries far from satisfactory. In this work, we propose a hierarchical decoding algorithm with a reference mechanism to tackle these difficulties, which effectively improves the quality of generated summaries.
As OOV words frequently occur in name entities, we can first identify the entities of a document using NLP toolkit like Stanford CoreNLP 1 . Then we prefix every entity with an "@entity" token and a number indicating how many words the entity has. We hope the entity prefixes can help better deal with entities which have more than one word, and help improve the accuracy of recovering OOV words in entities. After decoding we recover the OOV words by matching entities in the original document according to the contexts.
For the hierarchical decoder, a major challenge is that same sentences or phrases are often repeated in the output. A beam search strategy may help to alleviate the repetition in a sentence, but the repetition in the whole generated summary is remained a problem. The word-level beam search is not easy to be extended to the sentence level. The reason is that the K-best sentences generated by a word decoder will mostly be similar to each other, which is also noticed by Li et al. (2016).
In this paper we propose a hierarchical beam search algorithm with a reference mechanism. The hierarchical algorithm comprises K-best word-level beam search and N -best sentencelevel beam search.
At the word level, the only difference to vanilla beam search is that we add an additional term to the scorep(y τ ) of generating word y τ , and now score( where Y τ −1 = {y 1 , . . . , y τ −1 } andp(y τ ) = log p (y τ |Y τ −1 , c; θ). s * is an original sentence to refer to. ref is a function which calculates the ratio of bigram overlap between two texts. The added term aims to favor the generated word y τ with improving the bigram overlap between current generated summary Y τ −1 and the target orig- inal sentence s * . At the word decoder level, the reference mechanism helps to both improve the information correctness and avoid redundancy. Because the reference score is based on the bigram overlap improvement to the whole generated summary Y τ −1 , the awareness of previously generated sentences also helps alleviate sentence-level redundancy. A factor γ is introduced to control the influence of the reference mechanism. Note that because of the non-optimal search, the generated sentence will still be different to the original sentence even with an extremely large γ.
At the sentence level, N -best sentence beam is to keep the N generated sentences by referring to N different original sentences, which have the highest attention scores and have not been used as a reference. With referring to N different sentences, the N candidate sentences are guaranteed diverse. Sentence-level beam search is realized by maximizing the accumulated score of all the sentences generated.

Dataset
We conduct experiments on two large-scale corpora of CNN and DailyMail, which have been widely used in neural document summarization tasks. The corpora are originally constructed in (Hermann et al., 2015) by collecting human generated abstractive highlights from the news stories in the CNN and DailyMail website. The statistics and split of the two datasets are listed in Table 1.

Implementation
We use the corpora which are already provided with labeled entities (Nallapati et al., 2016). The documents and summaries are first lowercased and tokenized, and all digit characters are replaced with the "#" symbol, similar to (Nallapati et al., 2016(Nallapati et al., , 2017. We keep the 40,000 most frequently occurring words and other words are replaced with the "<OOV>" token. We use Theano 2 for implementation. For the word encoder and decoder we use three layers of LSTM, and for the sentence encoder and decoder we use one layer of LSTM. The dimension of hidden vectors are all 512. We use pre-trained GloVe (Pennington et al., 2014) vectors 3 for the initialization of word vectors, which will be further trained in the model. The dimension of word vectors is 100. λ is set to 0.9. The parameters of Adamax are set to those provided in (Kingma and Ba, 2014). The batch size is set to 8 documents, and an epoch is set containing 10,000 randomly sampled documents. Convergence is reached within 200 epochs on the DailyMail dataset and 120 epochs on the CNN dataset. It takes about one day for every 30 epochs on a GTX-1080 GPU card. γ is tuned on the validation set and the best choice is 300. The beam sizes for word decoder and sentence decoder are 15 and 2, respectively.

Evaluation
We adopt the widely used ROUGE (Lin, 2004) toolkit for evaluation. We first compare with the reported results in (Chen et al., 2016) including various traditional extractive methods and a state-of-the-art abstractive model (Distraction-M3) on the CNN dataset, as shown in Table 2. Uni-GRU is a non-hierarchical seq2seq baseline model. In Table 3 we compare our method with the results of state-of-the-art neural summarization methods reported in recent papers. Extractive models include NN-SE (Cheng and Lapata, 2016) and SummaRuNNer (Nallapati et al., 2017), while SummaRuNNer-abs is also an extractive model similar to SummaRuNNer but is trained directly on the abstractive summaries. Moreover, we include several baselines for comparison, including the baselines reported in (Cheng and Lapata, 2016) although they are tested on 500 samples of the test set. LREG is a feature based method using linear regression. NN-ABS is a neural abstractive baseline which is a simple hierarchical extension of (Rush et al., 2015). NN-WE is the abstractive model which restricts the generation of words from the original document. Lead-3 is a strong extractive baseline that uses the lead three sentences as the summary.
In Table 4 we compare our model with the abstractive attentional encoder-decoder models in    (Nallapati et al., 2016), which leverage several effective techniques and achieve state-of-the-art performance on sentence abstractive summarization tasks. The words-lvt2k and words-lvt2k-ptr are flat models and words-lvt2k-hieratt is a hierarchical extension. Results in Table 2 show our abstractive method is able to outperform traditional extractive methods and the distraction-based abstractive model. The results in Tables 3 and 4 show that our method has considerable improvement over neural abstractive baselines, and is able to outperform stateof-the-art neural extractive methods. An interesting observation is the results of the hierarchical model in Table 4 are lower than the flat models, which may demonstrate the difficulty for a traditional attention model to identify the important information in a document.
We also conducted human evaluation on 20 random samples from the DailyMail test set and compared the summaries generated by our method with the outputs of Lead-3, NN-SE (Cheng and

Method
Rouge-1 Rouge-2 Rouge-L words-lvt2k 32.5 11.8 29.5 words-lvt2k-ptr 32.1 11.7 29.2 words-lvt2k-hieratt 31.8 11.6 28.7 Our Method 38.1 13.9 34.0  Lapata, 2016) and Distraction (Chen et al., 2016). The output summaries of NN-SE are provided by the authors, and the output summaries of Distraction are achieved by running the code provided by the authors on the DailyMail dataset. Three participants were asked to compare the generated summaries with the human summaries, and assess each summary from four independent perspectives: (1) How informative the summary is?
(2) How concise the summary is? (3) How coherent (between sentences) the summary is? (4) How fluent, grammatical the sentences of a summary are? Each property is assessed with a score from 1 (worst) to 5 (best). The average results are presented in Table 5.
As shown in Table 5, our method consistently outperforms the previous state-of-the-art abstractive method Distraction. Compared with extractive methods, our method is able to generate more informative and concise summaries, which shows the advantage of abstractive methods. The Distraction method in fact usually produces the shortest summaries, but the conciseness score is low mainly because sometimes it generates repeated sentences. The repetition also causes Distraction to achieve a low coherence score. Concerning coherence and fluency, our abstractive method achieves slightly better scores than NN-SE, while not surprisingly Lead-3 gets the best scores. The fluency scores show the good ability of the abstractive model to generate fluent and grammatical sentences.

Model Validation
We conduct experiments to see how the model's performance is affected by the choice of the hyperparameters. For efficiency we test on 500 random samples from the DailyMail test set. Figure  2a shows the maximum average Rouge-2 F1-score achieved when the model is trained using different λ values within 200 and 300 epochs. When using a larger λ, the performance is better and the convergence is faster. When λ = 1.0 the model fails to train because of running into a singular matrix. Figure 2b shows the results achieved when using different γ values in the hierarchical decoding algorithm. γ = 0 is the baseline of the traditional decoding algorithm which does not refer to the original document. The poor results indicate that even the model is able to learn to identify the salient information in the original document, the performance is limited by the model's ability of generating a long output sequence. That may be a reason why simple extensions of seq2seq models fail on the abstractive document summarization task. The performance is significantly improved using a reasonable γ, and the optimal γ value is consistent with the one chosen on the validation set. When using an extremely large γ, the permanence begins to decrease, because the model will copy too much from the original document, and at this time the generated text also becomes less fluent. Results show that introducing the reference mechanism in the hierarchical beam search is very effective. The γ factor significantly affects the results, but the optimal value is easy to be decided on a validation set.
We also conduct ablation experiments on the CNN dataset to verify the effectiveness of the proposed model. Results on the CNN test set are shown in Table 6. "w/o GraphAtt" is to replace  the graph-based attention by a traditional attention function. "w/o SentenceBeam" is to remove the sentence-level beam search. "w/o BeamSearch" is to remove both the sentence-level and word-level beam search, and use a greedy decoding algorithm with the reference mechanism. As seen from Table 6, the graph-based attention mechanism is significantly better than traditional attention mechanism for the document summarization task. Beam search helps significantly improve the generated summaries. Our proposed decoding algorithm enables a sentence-level beam search, which helps improve the generated summaries with multiple sentences.

Case Study
We show the case study of a sample 4 from the Dai-lyMail test set in Figure 3. We show the "@entity" and number here although they are removed in the evaluation. We compare our result with the output by a model using traditional attention as Baseline Attention. We also show the output generated by a Baseline Decoder, which sets γ = 0 and does not use the sentence-level beam search, to study the difficulty for a traditional decoder to generate multiple sentences. Many observations can be found in Figure 3. The lead three sentences mainly focus on the money information and are not sufficient. As for the Baseline Decoder, first it usually ends the generation too early. The "<eod>" token indicates where the original output stops. When we force the decoder not to end here, the model shows the ability to continue producing the important information. However, two flaws are presented. First is the repetition of "## -year -Gold Summary: @entity 2 mary day , ## , claimed over £ ##,### in benefits despite not being eligible . she had £ ##,### savings in the bank which meant she was not entitled . day used taxpayers ' money to go on luxury holidays to @entity 1 indian resort of @entity 1 goa . pleaded guilty to dishonestly claiming benefits and has paid back money .
Lead3: a benefits cheat who pocketed almost £ ##,### of taxpayers ' money and spent it on a string of luxury holidays despite having £ ##,### in the bank has avoided jail . @entity 2 mary day , ## , of @entity 1 swanage in @entity 1 dorset , used taxpayers ' money to go on luxury holidays to the @entity 1 indian resort of @entity 1 goa for up to a month each time . day fraudulently claimed £ ##,### of income support and disability allowance despite having £ ##,### of her own savings in the bank .
## -year -old pleaded guilty to two counts of fraud .
Our Method: @entity 2 mary day , ## , used taxpayers ' money to go on luxury holidays to the @entity 1 indian resort of @entity 1 goa . despite having £ ##,### of her own savings in the bank , she claimed £ ##,### of income support and disability allowance . she pleaded guilty and had given the sentence for three months in prison , but suspended the sentence for ## months . old". Because the word decoder is unaware of the history generated sentences, it repeats generating the sequence as the subject all the time. Second, more importantly, is the information incorrectness.
The "## -month -old" is not appropriate to describe the heroine, and the "six -month prison sentence" is in fact "three months". Information incorrectness occurs because, for a decoder, it aims at generating a fluent sentence according to the input representation. However, no favor of consistent with the original input is concerned. The proposed hierarchical decoding algorithm helps to alleviate the two problems. The awareness of all the generated sentences helps prevent from always generating some important information. The favor of bigram overlapping with the original sentences helps generate more correct sentences. For example the model is able to correctly distinguish between the "three-month sentence" and the "##month suspend". In conclusion, our method is able to identify the most important information in the original document, and the decoding algorithm we propose is able to generate a more discourse-fluent and information-correct abstractive summary. The visualization of the graph-based attention when our method generates the presented example I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 <eod>   O1   O2 O3 <eod> Figure 4: Attention heatmap when generating the example summary. I i and O i indicate the i-th sentence of the input and output, respectively. is shown in Figure 4. It seems that the graph-based attention mechanism is able to find the important sentences in the input document, and the distraction mechanism makes the decoder focus on different sentences during decoding. Gradually the decoder attends to "<eod>" until it stops.

Conclusion and Future Work
In this paper we tackle the challenging task of abstractive document summarization, which is still less investigated to date. We study the difficulty of the abstractive document summarization task, and address the need of finding salient content from the original document, which is overlooked by previous studies. We propose a novel graph-based attention mechanism in a hierarchical encoderdecoder framework, and propose a hierarchical beam search algorithm to generate multi-sentence summary. Extensive experiments verify the effectiveness of the proposed method. Experimental results on two large-scale datasets demonstrate our method achieves state-of-the-art abstractive document summarization performance. It is also able to achieve competitive results with state-of-the-art neural extractive summarization models.
There is lots of future work we can do. An appealing direction is to investigate the neural abstractive method on the multi-document summarization task, which is more challenging and lacks training data. Further endeavor may be needed.