A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.


Introduction
Existing large-scale summarization datasets consist of relatively short documents. For example, articles in the CNN/Daily Mail dataset (Hermann et al., 2015) are on average about 600 words long. Similarly, existing neural summarization models have focused on summarizing sentences and short documents. In this work, we propose a model for effective abstractive summarization of longer documents. Scientific papers are an example of documents that are significantly longer than news articles (see Table 1). They also follow a standard discourse structure describing the problem, methodology, experiments/results, and finally conclusions (Suppe, 1998).
Most summarization works in the literature focus on extractive summarization. Examples of prominent approaches include frequency-based methods (Vanderwende et al., 2007), graph-based methods (Erkan and Radev, 2004), topic modeling (Steinberger and Jezek, 2004), and neural models (Nallapati et al., 2017). Abstractive summarization is an alternative approach where the generated summary may contain novel words and phrases and is more similar to how humans summarize documents (Jing, 2002). Recently, neural methods have led to encouraging results in abstractive summarization (Nallapati et al., 2016;See et al., 2017;Paulus et al., 2017;Li et al., 2017). These approaches employ a general framework of sequence-to-sequence (seq2seq) models (Sutskever et al., 2014) where the document is fed to an encoder network and another (recurrent) network learns to decode the summary. While promising, these methods focus on summarizing news articles which are relatively short. Many other document types, however, are longer and structured. Seq2seq models tend to struggle with longer sequences because at each decoding step, the decoder needs to learn to construct a context vector capturing relevant information from all the tokens in the source sequence (Shao et al., 2017).
Our main contribution is an abstractive model for summarizing scientific papers which are an example of long-form structured document types. Our model includes a hierarchical encoder, capturing the discourse structure of the document and a discourse-aware decoder that generates the summary. Our decoder attends to different discourse sections and allows the model to more accurately represent important information from the source resulting in a better context vector. We also introduce two large-scale datasets of long and structured scientific papers obtained from arXiv and PubMed to support both training and evaluating models on the task of long document summarization. Evaluation results show that our method outperforms state-of-the-art summarization models 1 .

Background
In the seq2seq framework for abstractive summarization, an input document x is encoded using a Recurrent Neural Network (RNN) with h (e) i being the hidden state of the encoder at timestep i. The last step of the encoder is fed as input to another RNN which decodes the output one token Figure 1: Overview of our model. The word-level RNN is shown in blue and section-level RNN is shown in green. The decoder also consists of an RNN (orange) and a "predict" network for generating the summary. At each decoding time step t (here t=3 is shown), the decoder forms a context vector c t which encodes the relevant source context (c 0 is initialized as a zero vector). Then the section and word attention weights are respectively computed using the green "section attention" and the blue "word attention" blocks. The context vector is used as another input to the decoder RNN and as an input to the "predict" network which outputs the next word using a joint pointer-generator network.
at a time. Given an input document along with the corresponding ground-truth summary y, the model is trained to output a summaryŷ that is close to y. The output at timestep t is predicted using the decoder input x t , decoder hidden state h (d) t , and some information about the input sequence. This framework is the general seq2seq framework employed in many generation tasks including machine translation (Sutskever et al., 2014;Bahdanau et al., 2014) and summarization (Nallapati et al., 2016;Chopra et al., 2016). Attentive decoding The attention mechanism maps the decoder state and the encoder states to an output vector, which is a weighted sum of the encoder states and is called context vector (Bahdanau et al., 2014). Incorporating this context vector at each decoding timestep (attentive decoding) is proven effective in seq2seq models. Formally, the context vector c t is defined as: i are the attention weights calculated as follows: where softmax i means that the denominator's sum in the softmax function is over i. The score function can be defined in bilinear, additive, or multiplicative ways (Luong et al., 2015). We use the additive scoring function: where v a is a weight vector and linear is a linear mapping function. I.e., linear(X X X1, X X X2) = W W W1 X X X1 + W W W2 X X X2 + b ( 3) where W W W 1 and W W W 2 are weight matrices and b is the bias vector.

Model
We now describe our discourse-aware summarization model (shown in Figure 1). Encoder Our encoder extends the RNN encoder to a hierarchical RNN that captures the document discourse structure. We first encode each discourse section and then encode the document. Formally, we encode the document as a vector d according to the following: denotes a function which is a recurrent neural network whose output is the final state of the network encoding the entire sequence. N is the number of sections in the document and h (s) j is representation of section j in the document consisting of a sequence of tokens.
are dense embeddings corresponding to the tokens w (j,i) and M is the maximum section length. The parameters of RNN sec are shared for all the discourse sections. We use a single layer bidirectional LSTM (following the LSTM formulation of Graves et al. (2013)) for both RNN doc and RNN sec ; further extension to multilayer LSTMs is straightforward. We combine the forward and backward LSTM states to a single state using a simple feed-forward network: shows the concatenation operation. Throughout, when we mention the RNN (LSTM) state, we are referring to this combined state of both forward and backward RNNs (LSTMs). Discourse-aware decoder When humans summarize a long structured document, depending on the domain and the nature of the document, they write about important points from different discourse sections of the document. For example, scientific paper abstracts typically include the description of the problem, discussion of the methods, and finally results and conclusions (Suppe, 1998). Motivated by this observation, we propose a discourse-aware attention method. Intuitively, at each decoding timestep, in addition to the words in the document, we also attend to the relevant discourse section (the "section attention" block in Figure 1). Then we use the discourse-related information to modify the word-level attention function. Specifically, the context vector representing the source document is: (j,i) shows the encoder state of word i in discourse section j and α (t) (j,i) shows the corresponding attention weight to that encoder state. The scalar weights α (t) (j,i) are obtained according to: The score function is the additive attention function (Equation 2) and the weights β (t) j are updated according to: At each timestep t, the decoder state h (d) t and the context vector c t are used to estimate the probability distribution of next word y t : where V is a vocabulary weight matrix and softmax is over the entire vocabulary.
Copying from source There has been a surge of recent works in sequence learning tasks to address the problem of unkown token prediction by allowing the model to occasionally copy words directly from source instead of generating a new token (Gu et al., 2016;See et al., 2017;Paulus et al., 2017;Wiseman et al., 2017). Following these works, we add an additional binary variable z t to the decoder, indicating generating a word from vocabulary (z t =0) or copying a word from the source (z t =1). The probability is learnt during training according to the following equation: t , c t , x t )) (8) Then the next word y t is generated according to: The joint probability is decomposed as: p(yt, zt=z) = pc(yt|y1:t−1) p(zt=z|y1:t−1), z=1 pg(yt|y1:t−1) p(zt=z|y1:t−1), z=0 p g is the probability of generating a word from the vocabulary and is defined according to Equation 7.
p c is the probability of copying a word from the source vector x and is defined as the sum of the word's attention weights. Specifically, the probability of copying a word x is defined as: (j,i) (9) Decoder coverage In long sequences, the neural generation models tend to repeat phrases where the softmax layer predicts the same phrase multiple times over multiple timesteps. To address this issue, following See et al. (2017), we track attention coverage to avoid repeatedly attending to the same steps. This is done with a coverage vector cov (t) , the sum of attention weight vectors at previous timesteps: The coverage implicitly includes information about the attended document discourse sections. We incorporate the decoder coverage as an additional input to the attention function:

Related work
Neural abstractive summarization models have been studied in the past (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016) and later extended by source copying (Miao and Blunsom, 2016;See et al., 2017), reinformcement learning (Paulus et al., 2017), and sentence salience information (Li et al., 2017). One model variant of Nallapati et al. (2016) is related to our model in using sentence-level information in attention. However, our model is different as it contains a hierarchical encoder, uses discourse sections in the decoding step, and has a coverage mechanism. Similarly, Ling and Rush (2017) proposed a coarseto-fine attention model that uses hard attention to find the text chunks of importance and then only attend to words in that chunk. In contrast, we consider all the discourse sections using soft attention. The closest model to ours is that of See et al. (2017) and Paulus et al. (2017) who used a joint pointer-generator network for summarization. However, our model extends theirs by (i) a hierarchical encoder for modeling long documents and (ii) a discourse-aware decoder that captures the information flow from all discourse sections of the document. Finally, in a recent work, Liu et al.  is on multi-document summarization. Our datasets are obtained from scientific papers. Scientific document summarization has been recently received extended attention (Qazvinian et al., 2013;Goharian, 2015, 2017b,a). In contrast to ours, existing approaches are extractive and rely on external information such as citations, which may not be available for all papers.

Data
Seq2seq models typically have a large number of parameters and thus they require large training data with ground truth summaries. Researchers have constructed such training data from news articles (e.g., CNN, Daily Mail and New York Times articles), where the abstracts or highlights of news articles are considered as ground truth summaries (Nallapati et al., 2016;Paulus et al., 2017). However, news articles are relatively short and not suitable for the task of long-from document summarization. Following these works, we take scientific papers as an example of long documents with discourse information, where their abstracts can be used as ground-truth summaries. We introduce two datasets collected from scientific repositories, arXiv.org and PubMed.com.
The choice of scientific papers for our dataset is motivated by the fact that scientific papers are examples of long documents that follow a standard discourse structure and they already come with ground truth summaries, making it possible to train supervised neural models. We follow existing work in constructing large-scale summarization datasets that take news article abstracts as ground truth.
We remove the documents that are excessively long (e.g., theses) or too short (e.g., tutorial announcements), or do not have an abstract or discourse structure. We use the level-1 section headings as the discourse information. For arXiv, we use the L A T E X files and convert them to plain text using Pandoc (https://pandoc.org) to preserve the discourse section information. We remove figures and tables using regular expressions to only preserve the textual information. We also normalize math formulas and citation markers with special tokens. We analyze the document section names and identify the most common concluding sections names (e.g. conclusion, concluding remarks, summary, etc). We only keep the sections up to the conclusion section of the document and we remove sections after the conclusion.
The statistics of our datasets are shown in Table 1. In our datasets, both document and summary lengths are significantly larger than the existing large-scale summarization datasets. We retain about 3% (5%) of PubMed (ArXiv) as validation data and about another 3% (5%) for test; the rest is used for training.

Experiments
Setup Similar to the majority of published research in the summarization literature (Chopra et al., 2016;Nallapati et al., 2016;See et al., 2017), evaluation was done using the ROUGE automatic summarization evaluation metric (Lin, 2004) with full-length F-1 ROUGE scores. We lowercase all tokens and perform sentence and word tokenization using spaCy (Honnibal and Johnson, 2015). Implementation details We use Tensorflow 1.4 for implementing our models. We use the hyperparameters suggested by See et al. (2017). In particular, we use two bidirectional LSTMs with cell size of 256 and embedding dimensions of 128. Embeddings are trained from scratch and we did not find any gain using pre-trained embeddings. The vocabulary size is constrained to 50,000; using larger vocabulary size did not result in any improvement. We use mini-batches of size 16 and we limit the document length to 2000 and section length to 500 tokens, and number of sections to 4. We use batch-padding and dynamic unrolling to handle variable sequence lengths in LSTMs. Training was done using Adagrad optimizer with learning rate 0.15 and an initial accumulator value of 0.1. The maximum decoder size was 210 tokens which is in line with average abstract length in our datasets. We first train the model without coverage and added it at the last two epochs to help the model converge faster. We train the models on NVIDIA Titan X Pascal GPUs. Training is performed for about 10 epochs and each training step takes about 3.2 seconds. We used beam search at   decoding time with beam size of 4. We train the abstractive baselines for about 250K iterations as suggested by their authors.
Comparison We compare our method with several well-known extractive baselines as well as state-of-the-art abstractive models using their open-sourced implementations, when available; we follow the same training setup described in the corresponding papers. The compared methods are: LexRank (Erkan and Radev, 2004), SumBasic (Vanderwende et al., 2007), LSA (Steinberger and Jezek, 2004), Attn-Seq2Seq (Nallapati et al., 2016;Chopra et al., 2016), Pntr-Gen-Seq2Seq (See et al., 2017). The first three are extractive models and last two are abstractive. Pntr-Gen-Seq2Seq extends Attn-Seq2Seq by using a joint pointer network during decoding. For Pntr-Gen-Seq2Seq we use their reported hyperparameters to ensure that the result differences are not due to hyperparameter tuning. Results Our main results are shown in Tables 2  and 3. Our model significantly outperforms the state-of-the-art abstractive methods, showing its effectiveness on both datasets. We observe that in our ROUGE-1 score is respectively about 4 and 3 points higher than the abstractive model Pntr-Gen-Seq2Seq for the arXiv and PubMed datasets, providing a significant improvement. Our method also outperforms most of the extractive methods except for LexRank in one of the ROUGE scores. We note that since extractive methods copy salient sentences from the document, it is usually easier Our method: cascade hash tables are a common data structure used in large set of data storage and retrieval . such a time variation is essentially caused by possibly many collisions during keys hashing . in this paper , we present a set of hash schemes called cascade hash tables which consist of several levels ( @xmath2 ) of hash tables with different size . after constant probes , if an item ca 'nt find a free slot in limited probes in any hash table , it will try to find a cell in the second level , or subsequent lower levels . with this simple strategy , these hash tables will have descendant load factors , therefore lower collision probabilities . Figure 2: Example of a generated summary for them to achieve higher ROUGE scores. Figure 2 illustrates the effectiveness of our model extensions in capturing various discourse information from the papers. It can be observed that the state-of-the-art Pntr-Gen-Seq2Seq model generates a summary that mostly focuses on introducing the problem, whereas our model generates a summary that includes more information about the methodology and impacts of the target paper. This indicates that the context vector in our model compared with Pntr-Gen-Seq2Seq is better able to capture important information from the source by attending to various discourse sections.

Conclusions and future work
This work was the first attempt at addressing neural abstractive summarization of single, long documents. We presented a neural sequence-tosequence model that is able to effectively summarize long and structured documents such as scientific papers. While our results are encouraging, there is still much room for improvement for this challenging task; our new datasets can help the community to further explore this problem.
We note that following the convention in the summarization research, our quantitative evaluation is performed by ROUGE automatic metric. While ROUGE is an effective evaluation framework, nuances in the coherence or coverage of the summaries are not captured with it. It is non-trivial to evaluate such qualities especially for long document summarization; future work can design expert human evaluations to explore these nuances.