Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks

Text summarization aims to compress a textual document to a short summary while keeping salient information. Extractive approaches are widely used in text summarization because of their fluency and efficiency. However, most of existing extractive models hardly capture inter-sentence relationships, particularly in long documents. They also often ignore the effect of topical information on capturing important contents. To address these issues, this paper proposes a graph neural network (GNN)-based extractive summarization model, enabling to capture inter-sentence relationships efficiently via graph-structured document representation. Moreover, our model integrates a joint neural topic model (NTM) to discover latent topics, which can provide document-level features for sentence selection. The experimental results demonstrate that our model not only substantially achieves state-of-the-art results on CNN/DM and NYT datasets but also considerably outperforms existing approaches on scientific paper datasets consisting of much longer documents, indicating its better robustness in document genres and lengths. Further discussions show that topical information can help the model preselect salient contents from an entire document, which interprets its effectiveness in long document summarization.


Introduction
Text summarization is an important task in natural language processing, which can help people rapidly acquire important information from a large sum of documents. Previous summarization approaches can be mainly classified into two categories, which are abstractive and extractive. Neural-based abstractive models usually use a seq2seq framework (Sutskever et al., 2014) to generate a word-by-word summary after encoding a full document. By contrast, extractive models directly select important sentences from the original document and then aggregate them into a summary. Abstractive models are generally more flexible but may produce disfluent or ungrammatical summary texts (Liu and Lapata, 2019b), whereas extractive models have advantages in factuality and efficiency (Cao et al., 2018).
Despite their success, modeling long-range inter-sentence relationships for summarization remains a challenge (Xu et al., 2019b). Hierarchical networks are usually applied to tackle this problem by modeling a document as a sequence of sequences (Cohan et al., 2018;Zhang et al., 2019). However, empirical observations (Liu and Lapata, 2019a) showed that the use of such a paradigm to model intersentence relationships does not provide much performance gain for summarization. Hierarchical approaches are also slow to train and tend to overfit (Xiao and Carenini, 2019). Most recently, Graph Neural Networks (GNNs) are widely explored to model cross-sentence relationships for summarization task. The critical step of this framework is to build an effective document graph. Several studies (Xu et al., 2019a;Yasunaga et al., 2017) built document graphs based on discourse analysis. However, this approach depends on external tools and may lead to other problems, such as semantically fragmented output . Wang and Liu (2020) built a word-sentence document graph based on word appearance, but such statistical graph-building approach hardly captures semantic-level relationships. Therefore, how to model a document as a graph for summarization effectively remains an open question.
Another critical point of summarization is modeling global information, which plays a key role in sentence selection (Xiao and Carenini, 2019). Pre-trained language models can considerably boost the performance of summarization (Liu and Lapata, 2019a;Zhang et al., 2019) since they effectively capture context features. However, they are poor at modeling document-level information, particularly for long documents, because most of them are designed for sentences or a short paragraph (Xu et al., 2019b).
To tackle the abovementioned weaknesses, this paper proposes a novel graph-based extractive summarization model. First, we encode an entire document with a pre-trained BERT (Devlin et al., 2019) to learn contextual sentence representations, and discover latent topics with a joint neural topic model (NTM; Miao et al., 2017;Srivastava and Sutton, 2017). Second, we build a heterogeneous document graph consisting of sentence and topic nodes, and simultaneously update their representations with a modified graph attention network (GAT; Veličković et al., 2017). Third, the representations of sentence nodes are extracted to compute the final labels. Intuitively, our topic-sentence document graph has the following advantages: 1) During the graph propagation, sentence representations can be enriched by topical information, which can be considered as a kind of document-level feature and help our model distil important contents from an entire document. 2) Topic nodes can act as intermediary to bridge longdistance sentences; hence, our model can efficiently capture inter-sentence relationships. We evaluate our model on four standard datasets, including news articles and scientific papers. The experimental results show its effectiveness and superiority. To summarize, our contributions are threefold.
• We conduct a quantitative exploration on the effect of latent topics on document summarization and provide an intuitive understanding of how topical information help summarize documents.
• We propose a novel graph-based neural extractive summarization model, which innovatively incorporates latent topics into graph propagation via a joint neural topic model. To the best of our knowledge, we are the first to propose applying NTM to the extractive text summarization task.
• The experimental results demonstrate that our proposed model not only achieves competitive results compared with state-of-the-art extractive models on news datasets but also considerably outperforms existing approaches on scientific paper datasets consisting of much longer documents, indicating its better robustness in document genres and lengths.

Related Work
Neural Extractive Summarization Neural networks have achieved remarkable results in extractive summarization. Existing works mainly regard extractive summarization as a sequence labeling task (Nallapati et al., 2017;Zhang et al., 2018;Dong et al., 2018) or sentence ranking task (Narayan et al., 2018). Pre-trained language models have provided substantial performance gain for summarization (Liu and Lapata, 2019a;Zhang et al., 2019;Xu et al., 2019). In the current work, we further model inter-sentence relationships with a graph encoder and enrich sentence representations with topical information after a BERT encoder.

Graph-based Summarization
Early works, such as TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004), built document graphs on the basis of inter-sentence similarity and extracted summary sentences in an unsupervised manner. Recently, the application of GNNs to document summarization has attracted considerable interests (Yasunaga et al., 2017;Xu et al., 2019b;Fernandes et al., 2018;Wang and Liu et al., 2020). Existing GNN-based summarization models build document graphs on the basis of only words or sentences. On the contrary, we explore the effects of high-level semantic units, i.e., latent topics.

Topic Modeling for Summarization
Topic modeling is a powerful approach to learning document features. However, it has been rarely applied to document summarization. Wei et al. (2012) proposed to build a document graph consisting of words, sentences, and topic nodes and learn the graph with Markov chain. Zheng et al. (2019) proposed to summarize multiple documents by mining cross-document subtopics. Narayan et al. (2018) recommended enriching word representation with topical information. Unlike them, we discover latent topics with a neural topic model together with summarization. To the best of our knowledge, NTM had never been applied to extractive summarization task.

Model
This section describes our model, namely, topic-aware graph neural network for document summarization (Topic-GraphSum). Figure 1 presents the overview architecture. Given an arbitrary document = { 1 , 2 , … , } that consists of sentences, the objective of our model is to learn a sequence of binary labels { 1 , 2 , … , }, where ϵ{0, 1} represents whether the -th sentence should be included in summary. Our model generally consists of three parts, which are the 1) document encoder, 2) neural topic model, and 3) graph attention layer. Given the input document, the document encoder learns contextual representations of each sentence with a pre-trained BERT. The NTM aims to learn the document topic distribution and a group of topic representations. The graph attention layer builds a heterogeneous document graph with topics and sentences and then simultaneously update their node representations. After graph encoding, sentence representations are further combined with topics and then sent to a sentence classifier to compute the final labels. We elucidate each part below.

Document Encoder
BERT is a bidirectional transformer encoder pre-trained with a large corpus. Similar to previous works (Xu et al., 2019b;Liu and Lapata, 2019a), we employ a modified version of BERT to generates local context-aware hidden representations of sentences. Specifically, we insert < > and < > tokens at the beginning and end of each sentence, respectively. Then, we put all tokens into BERT layer and learn their hidden states.
where , represents the -th word of the -th sentence. ,0 and , * represent the < > and < > tokens of the -th sentence, and ℎ , represents the hidden state of the corresponding token. After the BERT encoding, we regard the hidden states of < > : = {ℎ 1,0 , … , ℎ ,0 } as the corresponding sentence contextual representations, which will be further enriched by topic information.

Neural Topic Model
NTM is based on the Variational Autoencoder (VAE; Kingma and Welling, 2013) framework. It learns the latent topic via an encoding-decoding process. Let ∈ ℝ | | be the bag-of-words representation of a given document, where is the vocabulary. In the encoder, we have = ( ), = ( ), where and are the prior parameters for parameterizing topic distribution in decoder networks. Functions and are linear transformations with ReLU activation. The decoder can be regarded as a three-step document generation process. First, we employ Gaussian softmax (Miao et al., 2017) to draw topic distribution, i.e., ∼ ( , 2 ), = ( ), where is the latent topic variable, ∈ ℝ is the topic distribution, and is the predefined topic number. Second, we learn the probability of predicted words ∈ ℝ | | throughout = ( ∅ ) .
∅ ∈ ℝ | |× is analogous to the topic-word distribution matrix in LDA-style topic models, and ∅ , ) represents the relevance between the -th word and j-th topic. Finally, we draw each word from to reconstruct input . We leave out the details and refer the readers to Miao et al. (2017). Considering the intermediate parameters ∅ and have encoded topical information, we further use them to build topic representations as follows: where ∈ ℝ × represents a group of topic representations with a predefined dimension of , and ∅ is a linear transformation with ReLU activation. ∈ ℝ is the weighted sum of each topic representation, which can be regarded as the overall topic representation of document. and are used in the graph attention layer to enrich sentence representation. Other summarization approaches (Zheng et al., 2019;Narayan et al., 2018) with topical information learn topic as a fixed feature from an external model. In comparison with them, the latent topic of our model is learned via a neural approach and can be dynamically updated with entire networks.

Graph Propagation
We initialize the vectors of sentence nodes and that of topic nodes with learned from the document encoder and learned from NTM (Eq. 2), respectively. Then, we update node representations with graph attention network, which can be denoted as: where ℎ is the -th node representation, and represents its neighbor nodes. ∥ * represents multiheads concatenation.
, , and are model trainable parameters. The vanilla GAT is designed for homogeneous graphs. However, our document graph is heterogeneous because the sentence and topic should be considered different semantic units; hence, we need to make some adaptation. Inspired by Hu et al. (2019), we consider a convenient approach to project the topic and sentence representations into an implicit common space, in which we calculate the attention weight. Let ℎ be the -th sentence node and ℎ be the -th topic node. We modify Eq. 4 by replacing shared matrix with different projection functions, as shown as follows: where and are the nonlinear transformation functions to project sentence and topic nodes to a common vector space, respectively.
The graph attention layer can build semantic relationships between sentences and topics. For example, during graph propagation, sentences can enrich their representation with topical information, which can be regarded as a global feature. Topics can capture their related sentences and distil salient contents from an entire document by their different topical relevance. Meanwhile, topic nodes can act as intermediary to help build inter-sentence relationships because they are high-level semantic units across sentences.
After graph encoding, we obtain topic-sensitive sentence representations. We concatenate them with overall topic representation (Eq. 3) to further capture their topical relevance to the document. Then, we choose a single feed-forward layer as the sentence classifier 1 to predict the final labels, i.e., � = ([ℎ : ]), where ( * ) is the sigmoid function.

Joint Training
We jointly train NTM and sentence classifier. For the NTM, the objective function is defined as the negative evidence lower bound, as shown as follows: where the first term indicates the Kullback-Leibler divergence loss, and the second term indicates the reconstruction loss. ( | ) and ( | ) represent the encoder and decoder networks, respectively. The binary cross-entropy loss of the sentence classifier is expressed as: The final loss of our model is the linear combination of two parts of loss with hyperparameter to balance their weights, i.e., 4 Experimental Setup

Datasets
We conduct experiments on four datasets, including two document types, which are news article and scientific paper. The summarization of news articles has been widely explored, but that of much longer scientific papers is more challenging since accurately encoding long texts for summarization is a known challenge (Vaswani et al., 2017;Frermann and Klementiev, 2019). Therefore, we conduct experiments on scientific paper datasets to verify the generalization capability of our model for long documents. The detailed statistics of four datasets is summarized in Table 1.  1 We also tried adding more advanced classifiers (e.g., CNN and RNN) on top of GAT layer. However, the performance shows no substantial gain, indicating that our model has already learned sufficient features. (Hermann et al., 2015) is the most widely used standard dataset for document summarization. We use standard splits and preprocess data in accordance with previous works (See et al., 2017;Liu and Lapata, 2019a;Wang and Liu, 2020). NYT (Sandhaus, 2008) is another popular summarization dataset. It is collected from New York Times Annotated Corpus. We preprocess and divide this dataset according to Durrett et al. (2016). arXiv and PubMed (Cohan et al., 2018) are two newly constructed datasets for long document summarization, which are collected from arXiv.org and PubMed.com, respectively. Xiao and Carenini (2019) created oracle labels for the two datasets. We use the same split as that of Cohan et al. (2018).

Models for Comparison
NeuSum  is a neural extractive model based on seq2seq framework with attention mechanism.
BanditSum (Dong et al., 2018) regards sentence selection as a contextual bandit problem. Policy gradient methods are used to train the model. JECS (Xu and Durrett, 2019) is a compression-based summarization model that selects sentences and compresses them by pruning a dependency tree to reduce redundancy. BERTSUM (Liu and Lapta, 2019a) inserts multiple segmentation tokens into document to obtain each sentence representation. It is the first BERT-based extractive summarization model. We employ its framework as the basic document encoder of our model.
HiBERT (Zhang et al., 2019) modifies BERT into a hierarchical structure and design an unsupervised method to pre-train it.
DISCOBERT (Xu et al., 2019b) is a state-of-the-art BERT-based extractive model which encodes documents with BERT and then updates sentence representations with a graph encoder. DISCOBERT builds a document graph with only sentence units based on discourse analysis, whereas our model incorporates latent topics into a document graph and produce a heterogeneous bipartite graph.

Implementation Details Hyperparameters
For the document encoder, we use "bert-base-uncased" as our pre-trained BERT version and fine-tune it for all experiments. We also implement a non-BERT version of our model by replacing the pre-trained BERT with a Bi-GRU (Chung et al., 2014) layer and set its hidden size to 768 to compare with baseline approaches without pre-trained language models fairly. For NTM, we set topic number K=50. The dimension size of topic representation is set to 512. We implement GNNs with DGL (Wang et al., 2019b), and the number of GAT layer is set to 2. We set the number of attention heads to 4 for topic nodes and 6 for sentence nodes with the same hidden size of 128 to keep the dimension size of node representations unchanged. We train our model for 500 epochs with 2 NVIDIA V100 cards, and the batch size is set to 8. Except for the pre-trained BERT encoder, other parameters are randomly initialized and optimized using Adam (Kingma and Ba, 2014).
(Eq. 11) is set to 0.85 to balance the loss of topic modeling and sentence selection. All the hyperparameters are selected via grid search on the validation set with "Rouge-2" as metric.

Training Strategy
We consider some empirical training strategies similar with (Cui et al., 2019) to make our model efficiently converge. Specifically, we pre-train NTM for 200 epochs with a learning rate of 1e-3, considering its convergence speed is much slower than that of general neural networks. In joint training, the NTM parameters are trained with a learning rate of 5e-4, while the learning rate of other parameters is set to 1e-3 because the NTM is relatively stable.

Result and Analysis
This section reports our experimental results. We evaluate our model on two criteria: 1) Whether it can achieve state-of-the-art results? 2) What benefits does the latent topic contribute to summarization? To this end, we first compare our model with state-of-the-art approaches on two widely used benchmark datasets CNN/DM and NYT. Then, we evaluate our model on two scientific paper datasets to verify whether discovering latent topics can help summarize long documents. Lastly, we present ablation and case studies for further analysis. Table 2 presents the Rouge F1 results of different models on CNN/DM and NYT datasets. The first section reports the Lead-3 and Oracle; the second section reports the approaches without pre-trained language models; the third section reports BERT-based models; and the last section reports our models. From the results, we make the following observations. (1) When removing pre-trained language model, the Bi-GRU version of our model outperforms all non-BERT baseline models and obtains competitive results compared with basic BERT on both datasets.

Overall Performance
(2) Our model achieves state-of-the-art results on NYT dataset, and its performance on CNN/DM dataset is on par with DISCOBERT, which is a state-ofthe-art BERT-based extractive summarization model. It needs to mention that DISCOBERT relies on external discourse analysis for modeling long-range dependencies. Our model achieves highly competitive results without external tools, which proves its inherent superiority.  Table 2: Rouge F1 results on the test set of CNN/DM and NYT datasets. The results of comparison models are obtained from respective papers, and -represents that corresponding result is not reported.

Long Document Summarization
Long documents typically cover multiple topics (Xiao and Carenini, 2019 Cohan et al. (2018), and results with + are token from Xiao and Carenini (2019).
the summarization performance. To verify this hypothesis, we conduct additional experiments on longform documents. Table 3 presents the results of our model and state-of-the-art public summarization systems on arXiv and PubMed datasets. The first section includes traditional approaches and Oracle; the second and third sections include abstractive and extractive models, respectively. From Table 3, our model substantially outperforms baseline models by a large margin without pre-trained BERT, and the gaps further increase when combined with BERT. We note that discourse-aware model (Cohan et al., 2018) slightly outperforms our model on R-L of PubMed dataset; a possible reason is that it explicitly leverages the section information (e.g., introduction and conclusion) of papers, which may be strong clues in selecting summary sentences. Our model achieves state-of-the-art performance on scientific paper datasets without additional features, indicating that discovering latent topics can indeed help summarize long document, consistent with aforementioned analysis.

Ablation Study
To analyze the relative contributions of different modules in summarizing documents, we compare our full model with three ablated variants: 1) w/o NTM, which removes the NTM module, builds a document graph with fully connected sentence nodes, and can be regarded as performing self-attention calculation on the top of BERT; 2) w/o GAT, which removes the graph attention layer, directly concatenates each of sentence representation with overall topic vector (Eq. 3), and sends them to the sentence classifier; and 3) LDA Version, which replaces NTM with standard LDA and randomly initializes each topic representation.  Figure 2 shows the results of different variants on four datasets, from which we can make the following observations. 1) Our full model outperforms all variants on four datasets, which proves that each module is necessary and combining them can help our model achieve the best performance. 2) When NTM module is removed or using LDA instead, the performance on arXiv and PubMed datasets declines dramatically, whereas on CNN/DM and NYT datasets, the results are still competitive with our full model. A possible reason lies in that news documents are relatively short, which leads to the data sparsity problem and thus reduces the effect of topic models. 3) Similarly, when GAT is removed, the performance of scientific paper datasets has decreased more significantly than that of news datasets. This phenomenon indicates that inter-sentence relationships are especially important for summarizing long documents. 4) The LDA topic model can also boost the performance, but the gain of LDA is much fewer than that of NTM for long documents; a possible reason is that LDA and neural networks are inevitably disconnected, whereas NTM can be jointly optimized with the document encoder and graph networks, which can mutually improve each module (Wang et al., 2019).

Analysis of Latent Topics
In this subsection, we conduct experiments to better understand how latent topics help summarize documents. To this end, we define the topical weight of a sentence as the weighted summation of attention score between each topic and the sentence, i.e., Figure 3. Visualized results of sentence topical weight. The degree of highlighting represents the overall relevance of the sentence and all topics. Underlined sentences are model-selected summary. The left document is from PubMed dataset, and the right document is from CNN/DM dataset.
where represents the topical weight of the -th sentence. is the topic distribution of the document learned by NTM described in Section 3.2, and ( ) represents the weight of -th topic in document. , (Eq. 5) is the attention score from the j-th topic node to the i-th sentence node. Figure 3 shows two examples of visualized sentence topical weights. The ground-truth summary sentences have relatively high topical weights, and the final selected sentences highly overlap with these topical sentences. From such observation, we can have an intuitive understanding of how our model works. First, our model learns sentence representations and discovers latent topics, individually. Second, the graph attention layer builds semantic relationships between sentences and topics and then roughly selects important contents on the basis of topical information. Finally, our model accurately selects summary sentences by integrating all features, such as the topical relevance to the document, context information, and inter-sentence relationships. This process may explain why our model is effective for long documents. Latent topics can help our model preselect salient texts; thus, further selection can mainly focus on these fragments rather than entire document.

Conclusion and Future Work
In this paper, we systematically explore the effects of latent topics for document summarization, and propose a novel graph-based extractive summarization model, which allows joint learning of latent topics and leverages them to enrich sentence representations via a heterogeneous graph neural network. The experimental results on four well-studied datasets demonstrate that our model not only achieves results on par with state-of-the-art summarization models on news article datasets but also significantly outperforms existing approaches on scientific paper datasets, indicating its strong robustness in various document genres and lengths. Further explorations on incorporating more types of semantic units (e.g. keywords and entities) into document graph for enhancing the performance of summarization will be addressed in our future work.