Heterogeneous Graph Neural Networks for Extractive Document Summarization

As a crucial step in extractive document summarization, learning cross-sentence relations has been explored by a plethora of approaches. An intuitive way is to put them in the graph-based neural network, which has a more complex structure for capturing inter-sentence relationships. In this paper, we present a heterogeneous graph-based neural network for extractive summarization (HETERSUMGRAPH), which contains semantic nodes of different granularity levels apart from sentences. These additional nodes act as the intermediary between sentences and enrich the cross-sentence relations. Besides, our graph structure is flexible in natural extension from a single-document setting to multi-document via introducing document nodes. To our knowledge, we are the first one to introduce different types of nodes into graph-based neural networks for extractive document summarization and perform a comprehensive qualitative analysis to investigate their benefits. The code will be released on Github.


Introduction
Extractive document summarization aims to extract relevant sentences from the original documents and reorganize them as the summary. Recent years have seen a resounding success in the use of deep neural networks on this task (Cheng and Lapata, 2016;Narayan et al., 2018;Arumae and Liu, 2018;Zhong et al., 2019a;Liu and Lapata, 2019b). These existing models mainly follow the encoder-decoder framework in which each sentence will be encoded by neural components with different forms.
To effectively extract the summary-worthy sentences from a document, a core step is to model the cross-sentence relations. Most current models capture cross-sentence relations with recurrent neural networks (RNNs) (Cheng and Lapata, 2016;Nallapati et al., 2017;. However, RNNs-based models are usually hard to capture sentence-level long-distance dependency, especially in the case of the long document or multidocuments. One more intuitive way is to model the relations of sentences using the graph structure. Nevertheless, it is challenging to find an effective graph structure for summarization. Efforts have been made in various ways. Early traditional work makes use of inter-sentence cosine similarity to build the connectivity graph like LexRank (Erkan and Radev, 2004) and TextRank (Mihalcea and Tarau, 2004). Recently, some works account for discourse inter-sentential relationships when building summarization graphs, such as the Approximate Discourse Graph (ADG) with sentence personalization features (Yasunaga et al., 2017) and Rhetorical Structure Theory (RST) graph . However, they usually rely on external tools and need to take account of the error propagation problem. A more straightforward way is to create a sentence-level fully-connected graph. To some extent, the Transformer encoder (Vaswani et al., 2017) used in recent work (Zhong et al., 2019a;Liu and Lapata, 2019b) can be classified into this type, which learns the pairwise interaction between sentences. Despite their success, how to construct an effective graph structure for summarization remains an open question.
In this paper, we propose a heterogeneous graph network for extractive summarization. Instead of solely building graphs on sentence-level nodes, we introduce more semantic units as additional nodes in the graph to enrich the relationships between sentences. These additional nodes act as the intermediary that connects sentences. Namely, each additional node can be viewed as a special rela-tionship between sentences containing it. During the massage passing over the heterogeneous graph, these additional nodes will be iteratively updated as well as sentence nodes.
Although more advanced features can be used (e.g., entities or topics), for simplicity, we use words as the semantic units in this paper. Each sentence is connected to its contained words. There are no direct edges for all the sentence pairs and word pairs. The constructed heterogeneous wordsentence graph has the following advantages: (a) Different sentences can interact with each other in consideration of the explicit overlapping word information. (b) The word nodes can also aggregate information from sentences and get updated. Unlike ours, existing models usually keep the words unchanged as the embedding layer. (c) Different granularities of information can be fully used through multiple message passing processes. (d) Our heterogeneous graph network is expandable for more types of nodes. For example, we can introduce document nodes for multi-document summarization.
We highlight our contributions as follows: (1) To our knowledge, we are the first one to construct a heterogeneous graph network for extractive document summarization to model the relations between sentences, which contains not only sentence nodes but also other semantic units. Although we just use word nodes in this paper, more superior semantic units (e.g. entities) can be incorporated.
(2) Our proposed framework is very flexible in extension that can be easily adapt from singledocument to multi-document summarization tasks.
(3) Our model can outperform all existing competitors on three benchmark datasets without the pre-trained language models 2 . Ablation studies and qualitative analysis show the effectiveness of our models.

Related Work
Extractive Document Summarization With the development of neural networks, great progress has been made in extractive document summarization. Most of them focus on the encoderdecoder framework and use recurrent neural networks (Cheng and Lapata, 2016;Nallapati et al., 2017; or Transformer encoders (Zhong et al., 2019b;Wang et al., 2019a) for the sentential encoding. Recently, pre-trained language models are also applied in summarization for contextual word representations (Zhong et al., 2019a;Liu and Lapata, 2019b;Zhong et al., 2020).
Another intuitive structure for extractive summarization is the graph, which can better utilize the statistical or linguistic information between sentences. Early works focus on document graphs constructed with the content similarity among sentences, like LexRank (Erkan and Radev, 2004) and TextRank (Mihalcea and Tarau, 2004). Some recent works aim to incorporate a relational priori into the encoder by graph neural networks (GNNs) (Yasunaga et al., 2017;. Methodologically, these works only use one type of nodes, which formulate each document as a homogeneous graph. Heterogeneous Graph for NLP Graph neural networks and their associated learning methods (i.e. message passing (Gilmer et al., 2017), selfattention (Velickovic et al., 2017)) are originally designed for the homogeneous graph where the whole graph shares the same type of nodes. However, the graph in the real-world application usually comes with multiple types of nodes (Shi et al., 2016), namely the heterogeneous graph. To model these structures, recent works have made preliminary exploration. Tu et al. (2019) introduced a heterogeneous graph neural network to encode documents, entities and candidates together for multihop reading comprehension. Linmei et al. (2019) focused on semi-supervised short text classification and constructed a topic-entity heterogeneous neural graph.
For summarization, Wei (2012) proposes a heterogeneous graph consisting of topic, word and sentence nodes and uses the markov chain model for the iterative update.  modify TextRank for their graph with keywords and sentences and thus put forward HeteroRank. Inspired by the success of the heterogeneous graph-based neural network on other NLP tasks, we introduce it to extractive text summarization to learn a better node representation.  Figure 1: Model Overview. The framework consists of three major modules: graph initializers, the heterogeneous graph layer and the sentence selector. Green circles and blue boxes represent word and sentence nodes respectively. Orange solid lines denote the edge feature (TF-IDF) between word and sentence nodes and the thicknesses indicate the weight. The representations of sentence nodes will be finally used for summary selection. Liu and Lapata, 2019b). Our goal is to predict a sequence of labels y 1 , · · · , y n (y i ∈ {0, 1}) for sentences, where y i = 1 represents the i-th sentence should be included in the summaries. The ground truth labels, which we call ORACLE, is extracted using the greedy approach introduced by Nallapati et al. (2016) with the automatic metrics ROUGE (Lin and Hovy, 2003). Generally speaking, our heterogeneous summarization graph consists of two types of nodes: basic semantic nodes (e.g. words, concepts, etc.) as relay nodes and other units of discourse (e.g. phrases, sentences, documents, etc.) as supernodes. Each supernode connects with basic nodes contained in it and takes the importance of the relation as their edge feature. Thus, high-level discourse nodes can establish relationships between each other via basic nodes.
In this paper, we use words as the basic semantic nodes for simplicity. HETERSUMGRAPH in Section 3.1 is a special case which only contains one type of supernodes (sentences) for classification, while HETERDOCSUMGRAPH in Section 3.5 use two (documents and sentences). Based on our framework, other types of supernodes (such as paragraphs) can also be introduced and the only difference lies in the graph structure.

Document as a Heterogeneous Graph
Given a graph G = {V, E}, where V stands for a node set and E represents edges between nodes, our undirected heterogeneous graph can be formally defined as V = V w ∪ V s and E = {e 11 , · · · , e mn }. Here, V w = {w 1 , · · · , w m } denotes m unique words of the document and V s = {s 1 , · · · , s n } corresponds to the n sentences in the document. E is a real-value edge weight matrix and e ij = 0 (i ∈ {1, · · · , m}, j ∈ {1, · · · , n}) indicates the j-th sentence contains the i-th word. Figure 1 presents the overview of our model, which mainly consists of three parts: graph initializers for nodes and edges, the heterogeneous graph layer and the sentence selector. The initializers first create nodes and edges and encode them for the document graph. Then the heterogeneous graph updates these node representations by iteratively passing messages between word and sentence nodes via Graph Attention Network (GAT) (Velickovic et al., 2017). Finally, the representations of sentence nodes are extracted to predict labels for summaries.

Graph Initializers
Let X w ∈ R m×dw and X s ∈ R n×ds represent the input feature matrix of word and sentence nodes respectively, where d w is the dimension of the word embedding and d s is the dimension of each sentence representation vector. Specifically, we first use Convolutional Neural Networks (CNN) (Le-Cun et al., 1998) with different kernel sizes to capture the local n-gram feature for each sentence l j and then use the bidirectional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) layer to get the sentence-level feature g j . The concatenation of the CNN local feature and the BiLSTM global feature is used as the sentence node feature X s j = [l j ; g j ].
To further include information about the importance of relationships between word and sentence nodes, we infuse TF-IDF values in the edge weights. The term frequency (TF) is the number of times w i occurs in s j and the inverse document frequency (IDF) is made as the inverse function of the out-degree of w i .

Heterogeneous Graph Layer
Given a constructed graph G with node features X w ∪ X s and edge features E, we use graph attention networks (Velickovic et al., 2017) to update the representations of our semantic nodes.
We refer to h i ∈ R d h , i ∈ {1, · · · , (m + n)} as the hidden states of input nodes and the graph attention (GAT) layer is designed as follows: where W a , W q , W k , W v are trainable weights and α ij is the attention weight between h i and h j . The multi-head attention can be denoted as: Besides, we also add a residual connection to avoid gradient vanishing after several iterations. Therefore, the final output can be represented as: We further modify the GAT layer to infuse the scalar edge weights e ij , which are mapped to the multi-dimensional embedding space e ij ∈ R mn×de . Thus, Equal 1 is modified as follows: After each graph attention layer, we introduce a position-wise feed-forward (FFN) layer consisting of two linear transformations just as Transformer (Vaswani et al., 2017).
Iterative updating To pass messages between word and sentence nodes, we define the information propagation as Figure 2. Specifically, after the initialization, we update sentence nodes with their neighbor word nodes via the above GAT and FFN layer: where s is used as the attention query and H 0 w is used as the key and value. Orange edges indicate the current information flow direction. First, for sentence s 1 , word w 1 and w 3 are used to aggregate word-level information in (a). Next, w 1 is updated by the new representation of s 1 and s 2 in (b), which are the sentences it occurs. See Section 3.3 for details on the notation.
After that, we obtain new representations for word nodes using the updated sentence nods and further update sentence nodes iteratively. Each iteration contains a sentence-to-word and a wordto-sentence update process. For the t-th iteration, the process can be represented as: As Figure 2 shows, word nodes can aggregate the document-level information from sentences. For example, the high degree of a word node indicates the word occurs in many sentences and is likely to be the keyword of the document. Regarding sentence nodes, the one with more important words tends to be selected as the summary.

Sentence Selector
Finally, we need to extract sentence nodes included in the summary from the heterogeneous graph. Therefore, we do node classification for sentences and cross-entropy loss is used as the training objective for the whole system.
Trigram blocking Following Paulus et al. (2017) and Liu and Lapata (2019b), we use Trigram Blocking for decoding, which is simple but powerful version of Maximal Marginal Relevance (Carbonell and Goldstein, 1998). Specifically, we rank sentences by their scores and discard those which have trigram overlappings with their predecessors.  Figure 1). Green, blue and orange boxes represent word, sentence and document nodes respectively. d 1 consists of s 11 and s 12 while d 2 contains s 21 and s 22 . As a relay node, the relation of document-document, sentence-sentence, and sentencedocument can be built through the common word nodes. For example, sentence s 11 , s 12 and s 21 share the same word w 1 , which connects them across documents.

Multi-document Summarization
For multi-document summarization, the documentlevel relation is crucial for better understanding the core topic and most important content of this cluster. However, most existing neural models ignore this hierarchical structure and concatenate documents to a single flat sequence Fabbri et al., 2019). Others try to model this relation by attention-based full-connected graph or take advantage of similarity or discourse relations (Liu and Lapata, 2019a).
Our framework can establish the document-level relationship in the same way as the sentence-level by just adding supernodes for documents(as Figure 3), which means it can be easily adapted from single-document to multi-document summarization. The heterogeneous graph is then extended to three types of nodes: V = V w ∪ V s ∪ V d and V d = {d 1 , · · · , d l } and l is the number of source documents. We name it as HETERDOC-SUMGRAPH.
As we can see in Figure 3, word nodes become the bridges between sentences and documents. Sentences containing the same words connect with each other regardless of their distance across documents, while documents establish relationships based on their similar contents.
Document nodes can be viewed as a special type of sentence nodes: a document node connects with contained word nodes and the TF-IDF value is used as the edge weight. Besides, document nodes also share the same update process as sentence nodes.
The differences lie in the initialization, where the document node takes the mean-pooling of its sentence node features as its initial state. During the sentence selection, the sentence nodes are concatenated with the corresponding document representations to obtain the final scores for multi-document summarization.

Experiment
We evaluate our models both on single-and multidocument summarization tasks. Below, we start our experiment with the description of the datasets. NYT50 NYT50 is also a single-document summarization dataset, which was collected from New York Times Annotated Corpus (Sandhaus, 2008) and preprocessed by Durrett et al. (2016). It contains 110,540 articles with summaries and is split into 100,834 and 9706 for training and test.

Settings and Hyper-parameters
For both single-document and multi-document summarization, we limit the vocabulary to 50,000 and initialize tokens with 300-dimensional GloVe embeddings (Pennington et al., 2014). We filter stop words and punctuations when creating word nodes and truncate the input document to a maximum length of 50 sentences. To get rid of the noisy common words, we further remove 10% of the vocabulary with low TF-IDF values over the whole dataset. We initialize sentence nodes with d s = 128 and edge features e ij in GAT e with d e = 50. Each GAT layer is 8 heads and the hidden size is d h = 64, while the inner hidden size of FFN layers is 512.
During training, we use a batch size of 32 and apply Adam optimizer (Kingma and Ba, 2014) with a learning rate 5e-4. An early stop is performed when valid loss does not descent for three continuous epochs. We select the number of iterations t = 1 based on the performance on the validation set. 3 For decoding, we select top-3 sentences for CNN/DailyMail and NYT50 datasets and top-9 for Multi-New according to the average length of their human-written summaries.

Models for Comparison
Ext-BiLSTM Extractive summarizer with BiL-STM encoder learns the cross-sentence relation by regarding a document as a sequence of sentences. For simplification, we directly take out the initialization of sentence nodes for classification, which includes a CNN encoder for the word level and 2layer BiLSTM for sentence level. This model can also be viewed as an ablation study of our HETER-SUMGRAPH on the updating of sentence nodes.

Ext-Transformer Extractive summarizers with
Transformer encoder learn the pairwise interaction (Vaswani et al., 2017) between sentences in a purely data-driven way with a fully connected priori. Following (Liu and Lapata, 2019b), we implement a Transformer-based extractor as a baseline, which contains the same encoder for words followed by 12 Transformer encoder layers for sentences. Ext-Transformer can be regarded as the sentence-level fully connected graph.

Single-document Summarization
We evaluate our single-document model on CNN/DailyMail and NYT50 and report the unigram, bigram and longest common subsequence overlap with reference summaries by R-1, R-2 and R-L. Due to the limited computational resource, we don't apply pre-trained contextualized encoder (i.e. BERT (Devlin et al., 2018)) to our models, which we will regard as our future work. Therefore, here, we only compare with models without BERT for the sake of fairness. Table 1 shows the results on CNN/DailyMail. The first part is the LEAD-3 baseline and ORACLE upper bound, while the second part includes other summarization models. We present our models (described in Section 4.3) in the third part. Compared with Ext-BiLSTM, our heterogeneous graphs achieve more than 0.6/0.51/0.7 improvements on R-1, R-2 and R-L, which indicates the cross-sentence relationships learned by our sentence-word-sentence structure is more powerful than the sequential structure. Besides, Our models also outperform Ext-Transformer based on fully connected relationships. This demonstrates that our graph structures effectively prune unnecessary connections between sentences and thus improve the performance of sentence node classification.

Results on CNN/DailyMail
Compared with the second block of Figure 1, we observe that HETERSUMGRAPH outperforms all previous non-BERT-based summarization systems and trigram blocking leads to a great improvement on all ROUGE metrics. Among them, HER (Luo et al., 2019) is a comparable competitor to our HET-ERSUMGRAPH, which formulated the extractive summarization task as a contextual-bandit problem and solved it with reinforcement learning. Since the reinforcement learning and our trigram blocking plays a similar role in reorganizing sentences into a summary (Zhong et al., 2019a), we additionally compare HER without policy gradient with HETER-SUMGRAPH. Our HETERSUMGRAPH achieve 0.61 improvements on R-1 over HER without policy for sentence scoring, and HETERSUMGRAPH with trigram blocking outperforms by 0.65 over HER for the reorganized summaries.

Model
R-1 R-2 R-L First sentence (Durrett et al., 2016) 28.60 17.30 -First k words (Durrett et al., 2016)   Results on NYT50 Results on NYT50 are summarized in Table 2. Note that we use limited-length ROUGE recall as Durrett et al. (2016), where the selected sentences are truncated to the length of the human-written summaries and the recall scores are used instead of F1. The first two lines are baselines given by Durrett et al. (2016) and the next two lines are our baselines for extractive summarization. The second and third part report the performance of other non-BERT-based works and our models respectively. Again, we observe that our cross-sentence relationship modeling performs better than BiLSTM and Transformer. Our models also have strong advantages over other non-BERT-based approaches on NYT50. Meanwhile, we find trigram block doesn't work as well as shown on CNN/DailyMail, and we attribute the reason to the special formation of summaries of CNN/DailyMail dataset. 4 Ablation on CNN/DailyMail In order to better understand the contribution of different modules to the performance, we conduct ablation study using our proposed HETERSUMGRAPH model on CNN/DailyMail dataset. First, we remove the filtering mechanism for low TF-IDF words and the edge weights respectively. We also remove residual connections between GAT layers. As a compensation, we concatenate the initial sentence feature after updating messages from nearby word nodes in Equal 8: Furthermore, we make iteration number t = 0, which deletes the word updating and use the sentence representation H 1 s for classification. Finally, we remove the BiLSTM layer in the initialization of sentence nodes.
As Table 3 shows, the removal of low TF-IDF words leads to increases on R-1 and R-L but drops on R-2. We suspect that filtering noisy words enable the model to better focus on useful word nodes, at the cost of losing some bigram information. The residual connection plays an important role in the combination of the original representation and the updating message from another type of nodes, which cannot be replaced by the concatenation. Besides, the introduction of edge features, word update and BiLSTM initialization for sentences also show their effectiveness.

Multi-document Summarization
We first take the concatenation of the First-k sentences from each source document as the baseline and use the codes and model outputs 5 released by Fabbri et al. (2019) for other models.
To explore the adaptability of our model to multidocument summarization, we concatenate multisource documents to a single mega-document and apply HETERSUMGRAPH as the baseline. For comparison, we extend HETERSUMGRAPH to multi-document settings HETERDOCSUMGRAPH  as described in Section 3.5. Our results are presented in Table 4. Specifically, we observe that both of our HETER-SUMGRAPH and HETERDOCSUMGRAPH outperform previous methods while HETERDOC-SUMGRAPH achieves better performance improvements. This demonstrates the introduction of document nodes can better model the documentdocument relationships and is beneficial for multidocument summarization. As mentioned above, trigram blocking does not work for the Multi-News dataset, since summaries are written as a whole instead of the concatenations of summary bullets for each source document.

Qualitative Analysis
We further design several experiments to probe into how our HETERSUMGRAPH and HETERDOC-  Figure 4: Relationships between the average degree of word nodes of the document (x-axis) andR, which is the mean of R-1, R-2 and R-L (lines for left y-axis), and between ∆R, which is the deltaR of HETERSUM-GRAPH and Ext-BiLSTM (histograms for right y-axis).
SUMGRAPH help the single-and multi-document summarization.
Degree of word nodes In HETERSUMGRAPH, the degree of a word node indicates its occurrence across sentences and thus can measure the redundancy of the document to some extent. Meanwhile, words with a high degree can aggregate information from multiple sentences, which means that they can benefit more from the iteration process. Therefore, it is important to explore the influence of the node degree of words on the summarization performance.
We first calculate the average degree of word nodes for each example based on the constructed graph. Then the test set of CNN/DailyMail is divided into 5 intervals based on it (x-axis in Figure  4). We evaluate the performance of HETERSUM-GRAPH and Ext-BiLSTM in various parts and the mean score of R-1, R-2, R-L is drawn as lines (left y-axisR). The ROUGE increases with the increasing of the average degree of word nodes in the document, which means that articles with a high redundancy are easier for neural models to summarize.
To make ∆R between models more obvious, we draw it with histograms (right y-axis). From Figure 4, we can observe that HETERSUMGRAPH performs much better for documents with a higher average word node degree. This proves that the benefit brought by word nodes lies in the aggregation of information from sentences and the propagation of their global representations.

Number of source documents
We also investigate how the number of source documents influences the performance of our model. To this end, we divide the test set of Multi-News into different parts by the number of source documents and discard parts with less than 100 examples. Then, we take First-3 as the baseline, which concatenates the top-3 sentences of each source document as the summary.
In Figure 5, we can observe that the lead baseline raises while both of our model performance degrade and finally they converge to the baseline. This is because it is more challenging for models to extract limited-number sentences that can cover the main idea of all source documents with the increasing number of documents. However, the First-3 baseline is forced to take sentences from each document which can ensure the coverage. Besides, the increase of document number enlarges the performance gap between HETERSUMGRAPH and HET-ERDOCSUMGRAPH. This indicates the benefit of document nodes will become more significant for more complex document-document relationships.

Conclusion
In this paper, we propose a heterogeneous graphbased neural network for extractive summarization. The introduction of more fine-grained semantic units in the summarization graph helps our model to build more complex relationships between sentences . It is also convenient to adapt our singledocument graph to multi-document with document nodes. Furthermore, our models have achieved the best results on CNN/DailyMail compared with non-BERT-based models, and we will take the pretrained language models into account for better encoding representations of nodes in the future.