Leveraging Graph to Improve Abstractive Multi-Document Summarization

Graphs that capture relations between textual units have great benefits for detecting salient information from multiple documents and generating overall coherent summaries. In this paper, we develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents such as similarity graph and discourse graph, to more effectively process multiple input documents and produce abstractive summaries. Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents. Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries. Furthermore, pre-trained language models can be easily combined with our model, which further improve the summarization performance significantly. Empirical results on the WikiSum and MultiNews dataset show that the proposed architecture brings substantial improvements over several strong baselines.


Introduction
Multi-document summarization (MDS) brings great challenges to the widely used sequence-tosequence (Seq2Seq) neural architecture as it requires effective representation of multiple input documents and content organization of long summaries. For MDS, different documents may contain the same content, include additional information, and present complementary or contradictory information (Radev, 2000). So different from single document summarization (SDS), cross-document links are very important in extracting salient information, detecting redundancy and generating overall coherent summaries for MDS. Graphs that capture relations between textual units have great benefits to MDS, which can help generate more informative, concise and coherent summaries from multiple documents. Moreover, graphs can be easily constructed by representing text spans (e.g. sentences, paragraphs etc.) as graph nodes and the semantic links between them as edges. Graph representations of documents such as similarity graph based on lexical similarities (Erkan and Radev, 2004) and discourse graph based on discourse relations (Christensen et al., 2013), have been widely used in traditional graph-based extractive MDS models. However, they are not well studied by most abstractive approaches, especially the end-to-end neural approaches. Few work has studied the effectiveness of explicit graph representations on neural abstractive MDS.
In this paper, we develop a neural abstractive MDS model which can leverage explicit graph representations of documents to more effectively process multiple input documents and distill abstractive summaries. Our model augments the end-toend neural architecture with the ability to incorporate well-established graphs into both the document representation and summary generation processes. Specifically, a graph-informed attention mechanism is developed to incorporate graphs into the document encoding process, which enables our model to capture richer cross-document relations. Furthermore, graphs are utilized to guide the summary generation process via a hierarchical graph attention mechanism, which takes advantage of the explicit graph structure to help organize the summary content. Benefiting from the graph modeling, our model can extract salient information from long documents and generate coherent summaries more effectively. We experiment with three types of graph representations, including similarity graph, topic graph and discourse graph, which all significantly improve the MDS performance. Additionally, our model is complementary to most pre-trained language models (LMs), like BERT (Devlin et al., 2019), RoBERTa  and XLNet . They can be easily combined with our model to process much longer inputs. The combined model adopts the advantages of both our graph model and pre-trained LMs. Our experimental results show that our graph model significantly improves the performance of pre-trained LMs on MDS.
The contributions of our paper are as follows: • Our work demonstrates the effectiveness of graph modeling in neural abstractive MDS. We show that explicit graph representations are beneficial for both document representation and summary generation.
• We propose an effective method to incorporate explicit graph representations into the neural architecture, and an effective method to combine pre-trained LMs with our graph model to process long inputs more effectively.
• Our model brings substantial improvements over several strong baselines on both Wik-iSum and MultiNews dataset. We also report extensive analysis results, demonstrating that graph modeling enables our model process longer inputs with better performance, and graphs with richer relations are more beneficial for MDS. 1 2 Related Work

Graph-based MDS
Most previous MDS approaches are extractive, which extract salient textual units from documents based on graph-based representations of sentences. Various ranking methods have been developed to rank textual units based on graphs to select most salient ones for inclusion in the final summary. Erkan and Radev (2004) propose LexRank to compute sentence importance based on a lexical similarity graph of sentences. Mihalcea and Tarau (2004) propose a graph-based ranking model to extract salient sentences from documents. Wan (2008) further proposes to incorporate documentlevel information and sentence-to-document relations into the graph-based ranking process. A series of variants of the PageRank algorithm has been 1 Codes and results are in: https://github.com/ PaddlePaddle/Research/tree/master/NLP/ ACL2020-GraphSum further developed to compute the salience of textual units recursively based on various graph representations of documents (Wan and Xiao, 2009;Cai and Li, 2012). More recently, Yasunaga et al. (2017) propose a neural graph-based model for extractive MDS. An approximate discourse graph is constructed based on discourse markers and entity links. The salience of sentences is estimated using features from graph convolutional networks (Kipf and Welling, 2016). Yin et al. (2019) also propose a graph-based neural sentence ordering model, which utilizes entity linking graph to capture the global dependencies between sentences.
Although neural abstractive models have achieved promising results on SDS (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Celikyilmaz et al., 2018;Li et al., 2018a,b;Narayan et al., 2018;Yang et al., 2019a;Sharma et al., 2019;Perez-Beltrachini et al., 2019), it's not straightforward to extend them to MDS. Due to the lack of sufficient training data, earlier approaches try to simply transfer SDS model to MDS task (Lebanoff et al., 2018;Zhang et al., 2018;Baumel et al., 2018) or utilize unsupervised models relying on reconstruction objectives (Ma et al., 2016;Chu and Liu, 2019). Later, Liu et al. (2018) Figure 1: Illustration of our model, which follows the encoder-deocder architecture. The encoder is a stack of transformer layers and graph encoding layers, while the decoder is a stack of graph decoding layers. We incorporate explicit graph representations into both the graph encoding layers and graph decoding layers.
extractive model with a standard Seq2Seq model. The above Seq2Seq models haven't study the importance of cross-document relations and graph representations in MDS.
Most recently, Liu and Lapata (2019a) propose a hierarchical transformer model to utilize the hierarchical structure of documents. They propose to learn cross-document relations based on selfattention mechanism. They also propose to incorporate explicit graph representations into the model by simply replacing the attention weights with a graph matrix, however, it doesn't achieve obvious improvement according to their experiments. Our work is partly inspired by this work, but our approach is quite different from theirs. In contrast to their approach, we incorporate explicit graph representations into the encoding process via a graphinformed attention mechanism. Under the guidance of explicit relations in graphs, our model can learn better and richer cross-document relations, thus achieves significantly better performance.We also leverage the graph structure to guide the summary decoding process, which is beneficial for long summary generation. Additionally, we combine the advantages of pretrained LMs into our model.

Summarization with Pretrained LMs
Pretrained LMs (Peters et al., 2018;Radford et al.;Devlin et al., 2019;Dong et al., 2019;Sun et al., 2019) have recently emerged as a key technology for achieving impressive improvements in a wide variety of natural language tasks, including both language understanding and language generation (Edunov et al., 2019;Rothe et al., 2019). Liu and Lapata (2019b) attempt to incorporate pre-trained BERT encoder into SDS model and achieves significant improvements. Dong et al. (2019) further propose a unified LM for both language understanding and language generation tasks, which achieves state-of-the-art results on several generation tasks including SDS. In this work, we propose an effective method to combine pretrained LMs with our graph model and make them be able to process much longer inputs effectively.

Model Description
In order to process long source documents more effectively, we follow Liu and Lapata (2019a) in splitting source documents into multiple paragraphs by line-breaks. Then the graph representation of documents is constructed over paragraphs. For example, a similarity graph can be built based on cosine similarities between tf-idf representations of paragraphs. Let G denotes a graph representation matrix of the input documents, where G[i] [j] indicates the relation weights between paragraph P i and P j . Formally, the task is to generate the summary S of the document collection given L input paragraphs P 1 , . . . , P L and their graph representation G.
Our model is illustrated in Figure 1, which follows the encoder-decoder architecture (Bahdanau et al., 2015). The encoder is composed of several token-level transformer encoding layers and paragraph-level graph encoding layers which can be stacked freely. The transformer encoding layer follows the Transformer architecture introduced in Vaswani et al. (2017), encoding contextual information for tokens within each paragraph. The graph encoding layer extends the Transformer architecture with a graph attention mechanism to incorporate explicit graph representations into the encoding process. Similarly, the decoder is composed of a stack of graph decoding layers. They extend the Transformer with a hierarchical graph attention mechanism to utilize explicit graph structure to guide the summary decoding process. In the following, we will focus on the graph encoding layer and graph decoding layer of our model.

Graph Encoding Layer
As shown in Figure 1, based on the output of the token-level transformer encoding layers, the graph encoding layer is used to encode all documents globally. Most existing neural work only utilizes attention mechanism to learn latent graph representations of documents where the graph edges are attention weights (Liu and Lapata, 2019a;Niculae et al., 2018;Fernandes et al., 2018). However, much work in traditional MDS has shown that explicit graph representations are very beneficial to MDS. Different types of graphs capture different kinds of semantic relations (e.g. lexical relations or discourse relations), which can help the model focus on different facets of the summarization task. In this work, we propose to incorporate explicit graph representations into the neural encoding process via a graph-informed attention mechanism. It takes advantage of the explicit relations in graphs to learn better inter-paragraph relations. Each paragraph can collect information from other related paragraphs to capture global information from the whole input.
Graph-informed Self-attention The graphinformed self-attention extends the self-attention mechanism to consider the pairwise relations in explicit graph representations. Let x l−1 i denotes the output of the (l − 1)-th graph encoding layer for paragraph P i , where x 0 i is just the input paragraph vector. For each paragraph P i , the context representation u i can be computed as a weighted sum of linearly transformed paragraph vectors: where W K , W Q and W V ∈ R d * d are parameter weights. e tj denotes the latent relation weight between paragraph P i and P j . The main difference of our graph-informed self-attention is the additional pairwise relation bias ij , which is computed as a Gaussian bias of the weights of graph representation matrix G: where σ denotes the standard deviation that represents the influence intensity of the graph structure. We set it empirically by tuning on the development dataset. The gaussian bias R ij ∈ (−inf, 0] measures the tightness between the paragraphs P i and P j . Due to the exponential operation in softmax function, the gaussian bias approximates to multiply the latent attention distribution by a weight ∈ (0, 1].
In our graph-attention mechanism, the term e ij in Equation 1 keeps the ability to model latent dependencies between any two paragraphs, and the term ij incorporates explicit graph representations as prior constraints into the encoding process. This way, our model can learn better and richer inter-paragraph relations to obtain more informative paragraph representations.
Then, a two-layer feed-forward network with ReLU activation function and a high-way layer normalization are applied to obtain the vector of each paragraph x l i : parameters, d f f is the hidden size of the feedforward layer.

Graph Decoding Layer
Graphs can also contribute to the summary generation process. The relations between textual units can help to generate more coherent or concise summaries. For example, Christensen et al. (2013) propose to leverage an approximate discourse graph to help generate coherent extractive summaries. The discourse relations between sentences are used to help order summary sentences. In this work, we propose to incorporate explicit graph structure into the end-to-end summary decoding process. Graph edges are used to guide the summary generation process via a hierarchical graph attention, which is composed by a global graph attention and a local normalized attention. As other components in the graph decoding layer are similar to the Transformer architecture, we focus on the extension of hierarchical graph attention.

Global Graph Attention
The global graph attention is developed to capture the paragraph-level context information in the encoder part. Different from the context attention in Transformer, we utilize the explicit graph structure to regularize the attention distributions so that graph representations of documents can be used to guide the summary generation process. Let y l−1 t denotes the output of the (l − 1)-th graph decoding layer for the t-th token in the summary. We assume that each token will align with several related paragraphs and one of them is at the central position. Since the prediction of the central position depends on the corresponding query token, we apply a feed-forward network to transform y l−1 t into a positional hidden state, which is then mapped into a scalar s t by a linear projection: where W p ∈ R d * d and U p ∈ R d denote weight matrix. s t indicates the central position of paragraphs that are mapped by the t-th summary token.
With the central position, other paragraphs are determined by the graph structure. Then an attention distribution over all paragraphs under the regularization of the graph structure can be obtained: where e tj denotes the attention weight between token vector y l−1 t and paragraph vector x j , which is computed similarly to Equation 1. The global context vector can be obtained as a weighted sum of paragraph vectors: g t = L j=1 β tj x j In our decoder, graphs are also modeled as a Gaussian bias. Different from the encoder, a central mapping position is firstly decided and then graph relations corresponding to that position are used to regularize the attention distributions β tj . This way, the relations in graphs are used to help align the information between source input and summary output globally, thus guiding the summary decoding process.
Local Normalized Attention Then, a local normalized attention is developed to capture the tokenlevel context information within each paragraph.
The local attention is applied to each paragraph independently and normalized by the global graph attention. This way, our model can process longer inputs effectively.
Let γ t,ji denotes the local attention distributions of the t-th summary token over the i-th token in the j-th input paragraph, the normalized attention is computed by:γ t,ji = γ t,ji β tj (6) and the local context vector can be computed as a weighted sum of token vectors in all paragraphs: l t = L j=1 n k=1γ t,ji x ji Finally, the output of the hierarchical graph attention component is computed by concatenating and linearly transforming the global and local context vector: where U d ∈ R 2d * d is a weight matrix. Through combining the local and global context, the decoder can utilize the source information more effectively.

Combined with Pre-trained LMs
Our model can be easily combined with pre-trained LMs. Pre-trained LMs are mostly based on sequential architectures which are more effective on short text. For example, both BERT (Devlin et al., 2019) and RoBERTa  are pre-trained with maximum 512 tokens. Liu and Lapata (2019b) propose to utilize BERT on single document summarization tasks. They truncate the input documents to 512 tokens on most tasks. However, thanks to the graph modeling, our model can process much longer inputs. A natural idea is to combine our graph model with pretrained LMs so as to combine the advantages of them. Specifically, the tokenlevel transformer encoding layer of our model can be replaced by a pre-trained LM like BERT. In order to take full advantage of both our graph model and pre-trained LMs, the input documents are formatted in the following way: [CLS]  Then they are encoded by a pre-trained LM, and the output vector of the "[CLS]" token is used as the vector of the corresponding paragraph. Finally, all paragraph vectors are fed into our graph encoder to learn global representations. Our graph decoder is further used to generate the summaries.

Experimental Setup
Graph Representations We experiment with three well-established graph representations: similarity graph, topic graph and discourse graph. The similarity graph is built based on tf-idf cosine similarities between paragraphs to capture lexical relations. The topic graph is built based on LDA topic model (Blei et al., 2003) to capture topic relations between paragraphs. The edge weights are cosine similarities between the topic distributions of the paragraphs. The discourse graph is built to capture discourse relations based on discourse markers (e.g. however, moreover), co-reference and entity links as in Christensen et al. (2013). Other types of graphs can also be used in our model. In our experiments, if not explicitly stated, we use the similarity graph by default as it has been most widely used in previous work.

WikiSum Dataset
We follow Liu et al. (2018) and Liu and Lapata (2019a) in treating the generation of lead Wikipedia sections as a MDS task. The source documents are reference webpages of the Wikipedia article and top 10 search results returned by Google, while the summary is the Wikipedia article's first section. As the source documents are very long and messy, they are split into multiple paragraphs by line-breaks. Further, the paragraphs are ranked by the title and top ranked paragraphs are selected as input for MDS systems. We directly utilize the ranking results from Liu and Lapata (2019a) and top-40 paragraphs are used as source input. The average length of each paragraph and the target summary are 70.1 tokens and 139.4 tokens, respectively. For the seq2seq baselines, paragraphs are concatenated as a sequence in the ranking order, and lead tokens are used as input. The dataset is split into 1,579,360 instances for training, 38,144 for validation and 38,205 for testing, similar to Liu and Lapata (2019a). We build similarity graph representations over paragraphs on this dataset. The average length of source documents and output summaries are 2103.5 tokens and 263.7 tokens, respectively. For the seq2seq baselines, we truncate N input documents to L tokens by taking the first L/N tokens from each source document. Then we concatenate the truncated source documents into a sequence by the original order. Similarly, for our graph model, the input documents are truncated to M paragraphs by taking the first M/N paragraphs from each source document. We build all three types of graph representations on this dataset to explore the influence of graph types on MDS.

MultiNews Dataset
Training Configuration We train all models with maximum likelihood estimation, and use label smoothing (Szegedy et al., 2016) with smoothing factor 0.1. The optimizer is Adam (Kingma and Ba, 2015) with learning rate 2, β 1 =0.9 and β 2 =0.998. We also apply learning rate warmup over the first 8,000 steps and decay as in (Vaswani et al., 2017). Gradient clipping with maximum gradient norm 2.0 is also utilized during training. All models are trained on 4 GPUs (Tesla V100) for 500,000 steps with gradient accumulation every four steps. We apply dropout with probability 0.1 before all linear layers in our models. The number of hidden units in our models is set as 256, the feed-forward hidden size is 1,024, and the number of heads is 8. The number of transformer encoding layers, graph encoding layers and graph decoding layers are set as 6, 2 and 8, respectively. The parameter σ is set as 2.0 after tuning on the validation dataset. During decoding, we use beam search with beam size 5 and length penalty with factor 0.6. Trigram blocking is used to reduce repetitions.
For the models with pretrained LMs, we apply different optimizers for the pretrained part and other parts as in (Liu and Lapata, 2019b). Two Adam optimizers with β 1 =0.9 and β 2 =0.999 are used for the pretrained part and other parts, respectively. The learning rate and warmup steps for the pretrained part are set as 0.002 and 20000, while 0.2 and 10000 for other parts. Other model configurations are in line with the corresponding pretrained LMs. We choose the base version of BERT, RoBERTa and XLNet in our experiments.

Evaluation Results
We evaluate our models on both the WikiSum and MultiNews datasets to validate the efficiency of them on different types of corpora. The summa- rization quality is evaluated using ROUGE F 1 (Lin and Och, 2004). We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) between system summaries and gold references as a means of assessing informativeness, and the longest common subsequence (ROUGE-L 2 ) as a means of accessing fluency. Table 6 summarizes the evaluation results on the WikiSum dataset. Several strong extractive baselines and abstractive baselines are also evaluated and compared with our models. The first block in the table shows the results of extractive methods Lead and LexRank (Erkan and Radev, 2004). The second block shows the results of abstractive methods: (1) FT (Flat Transformer), a transformer-based encoderdecoder model on a flat token sequence; (2) T-DMCA, the best performing model of Liu et al. (2018); (3) HT (Hierarchical Transformer), a model with hierarchical transformer encoder and flat transformer decoder, proposed by Liu and Lapata (2019a). We report their results following Liu and Lapata (2019a). The last block shows the results of our models, which are feed with 30 paragraphs (about 2400 tokens) as input. The results show that all abstractive models outperform the extractive ones. Compared with FT, T-DMCA and HT, our model GraphSum achieves significant improvements on all three metrics, which demonstrates the effectiveness of our model. Furthermore, we develop several strong base-  lines which combine the Flat Transformer with pre-trained LMs. We replace the encoder of FT by the base versions of pre-trained LMs, including BERT+FT, XLNet+FT and RoBERTa+FT. For them, the source input is truncated to 512 tokens 3 . The results show that the pre-trained LMs significantly improve the summarization performance. As RoBERTa boosts the summarization performance most significantly, we also combine it with our GraphSum model, namely GraphSum+RoBERTa 4 . The results show that GraphSum+RoBERTa further improves the summarization performance on all metrics, demonstrating that our graph model can be effectively combined with pre-trained LMs. The significant improvements over RoBERTa+FT also demonstrate the effectiveness of our graph modeling even with pre-trained LMs. Table 7 summarizes the evaluation results on the MultiNews dataset. Similarly, the first block shows two popular extractive baselines, and the second block shows several strong abstractive baselines. We report the results of Lead, LexRank, PG-BRNN, HiMAP and FT following Fabbri et al. (2019). The last block shows the results of our models. The results show that our model GraphSum consistently outperforms all baselines, which further demonstrate the effectiveness of our model on different types of corpora. We also compare the performance of RoBERTa+FT and GraphSum+RoBERTa, which show that our model significantly improves all metrics.  The above evaluation results on both WikiSum and MultiNews dataset both validate the effectiveness of our model. The proposed method to modeling graph in end-to-end neural model greatly improves the performance of MDS.

Model Analysis
We further analyze the effects of graph types and input length on our model, and validate the effectiveness of different components of our model by ablation studies.

Effects of Graph Types
To study the effects of graph types, the results of GraphSum+RoBERTa with similarity graph, topic graph and discourse graph are compared on the MultiNews test set. The last block in Table 7 summarizes the comparison results, which show that the topic graph achieves better performance than similarity graph on ROUGE-1 and ROUGE-2, and the discourse graph achieves the best performance on ROUGE-2 and ROUGE-L. The results demonstrate that graphs with richer relations are more helpful to MDS.
Effects of Input Length Different lengths of input may affect the summarization performance seriously for Seq2Seq models, so most of them restrict the length of input and only feed the model with hundreds of lead tokens. As stated by Liu and Lapata (2019a), the FT model achieves the best performance when the input length is set to 800  tokens, while longer input hurts performance. To explore the effectiveness of our GraphSum model on different length of input, we compare it with HT on 500, 800, 1600, 2400 and 3000 tokens of input respectively. Table 3 summarizes the comparison results, which show that our model outperforms HT on all length of input. More importantly, the advantages of our model on all three metrics tend to become larger as the input becomes longer. The results demonstrate that modeling graph in the end-to-end model enables our model process much longer inputs with better performance. Table 4 summarizes the results of ablation studies aiming to validate the effectiveness of individual components. Our experiments confirmed that incorporating well-known graphs into the encoding process by our graph encoder (see w/o graph enc) and utilizing graphs to guide the summary decoding process by our graph decoder (w/o graph dec) are both beneficial for MDS.

Human Evaluation
In addition to the automatic evaluation, we also access system performance by human evaluation. We randomly select 50 test instances from the Wik-iSum test set and 50 from the MultiNews test set, and invite 3 annotators to access the outputs of different models independently. Annotators access the overall quality of summaries by ranking them taking into account the following criteria: (1) Informativeness: does the summary convey important facts of the input? (2) Fluency: is the summary fluent and grammatical? (3) Succinctness: does the summary avoid repeating information? Annotators are asked to ranking all systems from 1(best) to 5 (worst). Ranking could be the same for different systems if they have similar quality. For example, the ranking of five systems could be 1, 2, 2, 4, 5 or 1, 2, 3, 3, 3. All systems get score 2, 1, 0, -1, -2 for ranking 1, 2, 3, 4, 5 respectively. The rating of each system is computed by averaging the scores on all test instances.  Table 5: Ranking results of system summaries by human evaluation. 1 is the best and 5 is the worst. The larger rating denotes better summary quality. R.B. and G.S. are the abbreviations of RoBERTa and GraphSum, respectively. * indicates the overall ratings of the corresponding model are significantly (by Welch's t-test with p < 0.01) outperformed by our models GraphSum and GraphSum+RoBERTa. and overall ratings are reported. The results demonstrate that GraphSum and GraphSum+RoBERTa are able to generate higher quality summaries than other models. Specifically, the summaries generated by GraphSum and GraphSum+RoBERTa usually contains more salient information, and are more fluent and concise than other models. The human evaluation results further validates the effectiveness of our proposed models.

Conclusion
In this paper we explore the importance of graph representations in MDS and propose to leverage graphs to improve the performance of neural abstractive MDS. Our proposed model is able to incorporate explicit graph representations into the document encoding process to capture richer relations within long inputs, and utilize explicit graph structure to guide the summary decoding process to generate more informative, fluent and concise summaries. We also propose an effective method to combine our model with pre-trained LMs, which further improves the performance of MDS significantly. Experimental results show that our model outperforms several strong baselines by a wide margin. In the future we would like to explore other more informative graph representations such as knowledge graphs, and apply them to further improve the summary quality.