Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization

In this paper, we propose a multi-granularity interaction network for extractive and abstractive multi-document summarization, which jointly learn semantic representations for words, sentences, and documents. The word representations are used to generate an abstractive summary while the sentence representations are used to produce an extractive summary. We employ attention mechanisms to interact between different granularity of semantic representations, which helps to capture multi-granularity key information and improves the performance of both abstractive and extractive summarization. Experiment results show that our proposed model substantially outperforms all strong baseline methods and achieves the best results on the Multi-News dataset.


Introduction
Document summarization aims at producing a fluent, condensed summary for given documents. Single document summarization has shown promising results with sequence-to-sequence models that encode a source document and then decode it into a summary (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Ç elikyilmaz et al., 2018). Multi-document summarization requires producing a summary from a cluster of thematically related documents, where the given documents complement and overlap each other. Multi-document summarization involves identifying important information and filtering out redundant information from multiple input sources.
There are two primary methodologies for multidocument summarization: extractive and abstractive. Extractive methods directly select important sentences from the original, which are relatively simple. Cao et al. (2015) rank sentences with a recursive neural network. Yasunaga et al. (2017) employ a Graph Convolutional Network (GCN) to incorporate sentence relation graphs to improve the performance for the extractive summarization. Abstractive methods can generate new words and new sentences, but it is technically more difficult than extractive methods. Some works on multidocument summarization simply concatenate multiple source documents into a long flat sequence and model multi-document summarization as a long sequence-to-sequence task Fabbri et al., 2019). However, these approaches don't take the hierarchical structure of document clusters into account, while the too-long input often leads to the degradation in document summarization (Cohan et al., 2018;Liu and Lapata, 2019). Recently, hierarchical frameworks have shown their effectiveness on multi-document summarization (Zhang et al., 2018;Liu and Lapata, 2019). These approaches usually use multiple encoders to model hierarchical relationships in the discourse structure, but other methods to incorporate the structural semantic knowledge have not been explored. The combination of extractive and abstractive has been explored in single document summarization. Chen and Bansal (2018) use the extracted sentences as the input of the abstractive summarization. Subramanian et al. (2019) concatenate the extracted summary to the original document as the input of the abstractive summarization.
In this work, we treat documents, sentences, and words as the different granularity of semantic units, and connect these semantic units within a three-granularity hierarchical relation graph. With the multi-granularity hierarchical structure, we can unify extractive and abstractive summarization into one architecture simultaneously. Extractive summarization operates on sentence-granularity and directly supervises the sentence representations while abstractive summarization operates on wordgranularity and directly supervises the word repre-sentations. We propose a novel multi-granularity interaction network to enable the supervisions to promote the learning of all granularity representations. We employ the attention mechanism to encode the relationships between the same semantic granularity and hierarchical relationships between the different semantic granularity, respectively. And we use a fusion gate to integrate the various relationships for updating the semantic representations. The decoding part consists of a sentence extractor and a summary generator. The sentence extractor utilizes the sentence representations to select sentences, while the summary generator utilizes the word representations to generate a summary. The two tasks are trained in a unified architecture to promote the recognition of important information simultaneously.
We evaluate our model on the recently released Multi-News dataset and our proposed architecture brings substantial improvements over several strong baselines. We explore the influence of semantic units with different granularity, and the ablation study shows that joint learning of extractive and abstractive summarization in a unified architecture improves the performance.
In summary, we make the following contributions in this paper: • We establish multi-granularity semantic representations for documents, sentences, and words, and propose a novel multi-granularity interaction network to encode multiple input documents.
• Our approach can unify the extractive and abstractive summarization into one architecture with interactive semantic units and promote the recognition of important information in different granularities.
• Experimental results on the Multi-News dataset show that our approach substantially outperforms several strong baselines and achieves state-of-the-art performance. Our code is publicly available at https://github. com/zhongxia96/MGSum.

Related Work
The methods for multi-document summarization can generally be categorized to extractive and abstractive. The extractive methods produce a summary by extracting and merging sentences from the input documents, while the abstractive methods generate a summary using arbitrary words and expressions based on the understanding of the documents. Due to the lack of available training data, most previous multi-document summarization methods were extractive (Erkan and Radev, 2004;Christensen et al., 2013;Yasunaga et al., 2017). Since the neural abstractive models have achieved promising results on single-document summarization (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Ç elikyilmaz et al., 2018), some works trained abstractive summarization models on a single document dataset and adjusted the model to adapt the multi-document summarization task. Zhang et al. (2018) added a document set encoder into the single document summarization framework and tuned the pre-trained model on the multi-document summarization dataset. Lebanoff et al. (2018) combined an extractive summarization algorithm (MMR) for sentence extraction to reweight the original sentence importance distribution learned in the single document abstractive summarization model. Recently, two large scale multi-document summarization datasets have been proposed, one for very long input, aimed at generating Wikipedia  and another dedicated to generating a comprehensive summarization of multiple real-time news (Fabbri et al., 2019).  concatenated multiple source documents into a long flat text and introduced a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder-decoder architectures. Liu and Lapata (2019) introduced intermediate document representations and simply add the document representations to word representations for modeling the cross-document relationships. Compared with our proposed multi-granularity method, Liu and Lapata (2019) inclined to the traditional bottomup hierarchical method and don't effectively utilize the hierarchical representations while ignoring the hierarchical relationships of sentences. Fabbri et al. (2019) incorporated MMR into a hierarchical pointer-generator network to address the information redundancy in multi-document summarization.

Our Approach
Our model consists of a multi-granularity encoder, a sentence extractor, and a summary generator. Firstly, the multi-granularity encoder reads multiple input documents and learns the multi-granularity representations for words, sentences, and documents. Self-attention mechanisms are employed for capturing semantic relationships of the representations with same granularity, while crossattention mechanisms are employed for the information interaction between representations with different granularity. Fusion gates are used for integrating the information from different attention mechanisms. Then the sentence extractor scores sentences according to the learned sentence representations. Meanwhile, the summary generator produces the abstractive summary by attending to the word representations. In the following sections, we will describe the multi-granularity encoder, the sentence extractor, and the summary generator, respectively.

Multi-Granularity Encoder
Given a cluster of documents, we establish explicit representations for documents, sentences, and words, and connect them within a hierarchical semantic relation graph. The multi-granularity encoder is a stack of L 1 identical layers. Each layer has two sub-layers: the first is the multigranularity attention layer, and the second is multiple fully connected feed-forward networks. The multi-granularity attention sub-layer transfers semantic information between the different granularity and the same granularity, while the feed-forward network further aggregates the multi-granularity information. We employ multi-head attention to encode multi-granularity information and use a fusion gate to propagate semantic information to each other. Figure 1 shows the overview of the multi-granularity encoder layer, and Figure 2 illustrates how the semantic representations are updated, which takes the sentence representation as an example.
Let w i,j,k be the k-th word of the sentence s i,j in the document d i . At the bottom of the encoder stack, each input word w i,j,k is converted into the vector representation e i,j,k by learned embeddings. We assign positional encoding to indicate the position of the word w i,j,k and three positions need to be considered, namely i (the rank of the document), j (the position of the sentence within the document), k (the position of the word within the sentence). We concatenate the three position embedding P E i , P E j ,and P E k to get the final position embedding p i,j,k . The input word representation can be obtained by simply adding the word embedding e i,j,k and the position embedding p i,j,k : where the definition of positional encoding P E is consistent with the Transfomer (Vaswani et al., 2017). For convenience, we denote the output of l-th multi-granularity encoder layer as h l and the input for the first layer as h 0 . Symbols with subscripts w i,j,k , s i,j and d i are used to denote word, sentence, and document granularities, respectively. Both sentence representations h 0 s i,j and document representations h 0 d i are initialized to zeros. In each multi-granularity attention sub-layers, the word representation is updated by the information of word granularity and sentence granularity. We perform multi-head self-attention across the word representations in the same sentence h l−1 w i,j, * = {h l−1 w i,j,k |w i,j,k ∈ s i,j } to get the context representationh l w i,j,k . In order to propagate semantic information from sentence granularity to the word granularity, we duplicate sentence-aware representation ← − h l w i,j,k from corresponding sentence s i,j and employ a fusion gate to integrateh l where MHAtt denotes the multi-head attention proposed in Vaswani et al. (2017) and Fusion denotes the fusion gate. h l−1 w i,j,k is the query and h l−1 w i,j, * are Figure 2: The multi-granularity encoder layer for updating sentence representation. The sentence representation is updated by using two fusion gates to integrate the information from different granularities.

Fusion Gate
the keys and values for attention. The fusion gate works as The sentence representation is updated from three sources: (1) We take the sentence representation h l−1 s i,j as the query, the word representations h l−1 w i,j, * = {h l−1 w i,j,k |w i,j,k ∈ s i,j } as the keys and values, to perform multi-head cross-attention to get the (3) In order to propagate document granularity semantic information to the sentence, we duplicate the document-aware representation Semantic representations from the three sources are fused by two fusion gate to get the updated sentence representation f l s i,j .
The feed-forward network FFN is used to transform multiple-granularity semantic information further. To construct deep network, we use the residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) to connect adjacent layers.
where l ∈ [1, L 1 ], FFN consists of two linear transformations with a ReLU activation in between. Note that we used different FFN and LayerNorm for the different granularity. The final representation h L 1 s is fed to the sentence extractor while h L 1 w is fed to the summary generator. For convenience, we denote h L 1 s as o s , and h L 1 w as o w .

Sentence Extractor
we build a classifier to select sentences based on the sentence representations o s from the multigranularity encoder. The classifier uses a linear transformation layer with the sigmoid activation function to get the prediction score for each sentenceỹ These scores are used to sort the sentences of multiple documents and produce the extracted summary.

Summary Generator
The summary generator in our model is also a stack of L 2 identical layers. The layer consists of three parts: a masked multi-head self-attention mechanism, a multi-head cross-attention mechanism, and a fully connected feed-forward network. As the input and output of multi-document summarization are generally long, the multi-head attention degenerates as the length increases (Liu and Lapata, 2019). Following Zhao et al. (2019) 's idea, we adopt a sparse attention mechanism where each query only attends to the top-k values according to their weights calculated by the keys rather than all values in the original attention (Vaswani et al., 2017). And k is a hyper-parameter. This ensures that the generator focuses on critical information in the input and ignores much irrelevant information. We denote the multi-head sparse attention as MSAttn.
Similar to the multi-granularity encoder, we add the positional encoding of words in the summary to the input embedding at the bottom of the decoder stack. We denote the output of the l-th layer as g l and the input for the first layer as g 0 . The self-attention sub-layer with masking mechanism is used to encode the decoded information. The masking mechanism ensures that the prediction of the position t depends only on the known output of the position before t. g = LayerNorm g l−1 +MSAttn g l−1 , g l−1 (9) The cross-attention sub-layer take the selfattention outputg as the queries and the multigranularity encoder output o w as keys and values to performs multi-head sparse attention. The feedforward network is used to further transform the outputs.
The generation distribution p g t over the target vocabulary is calculated by feeding the output g L 2 t to a softmax layer.
where W g ∈ R d model ×d vocab , b g ∈ R d vocab and d vocab is the size of target vocabulary. The copy mechanism (Gu et al., 2016) is employed to tackle the problem of out-of-vocabulary (OOV) words. We compute the copy attention ε t with the decoder output g L 2 and the input representations o w , and further obtain copy distribution p c t .
where z i,j,k is the one-hot indicator vector for w i,j,k and b ε ∈ R d vocab . A gate is used over the the decoder output g L 2 to control generating words from the vocabulary or copying words directly from the source text. The final distribution p t is the "mixture" of the two distributions p g t and p c t .
where σ is the sigmoid function,

Objective Function
We train the sentence extractor and the summary generator in a unified architecture in an end-toend manner. We use the cross entropy as both the extractor loss and the generator loss.
where y s is the ground-truth extracted label, y w is the ground-truth summary and N is the number of samples in the corpus. The final loss is as below where λ is a hyper-parameter.

Implementation Details
We set our model parameters based on preliminary experiments on the development set. We prune the vocabulary to 50k and use the word in the source documents with maximum weight in copy attention to replace the unknown word of the generated summary. We set the dimension of word embeddings and hidden units d model to 512, feed-forward units to 2048. We set 8 heads for multi-head selfattention, masked multi-head sparse self-attention, and multi-head sparse cross-attention. We set the number of multi-granularity encoder layer L 1 to 5 and summary decoder layer L 2 to 6. We set dropout (Srivastava et al., 2014) rate to 0.1 and use Adam optimizer with an initial learning rate α = 0.0001, momentum β 1 = 0.9, β 2 = 0.999 and weight decay = 10 −5 . When the valid loss on the development set increases for two consecutive epochs, the learning rate is halved. We use a mini-batch size of 10, and set the hyper-parameter k = 5 and λ = 2. Given the salience score predicted by the sentence extractor, we apply a simple greedy procedure to select sentences. We select one sentence based on the descending order of the salience scores and append to the extracted summary until the summary reaches 300 words. We disallow repeating the same trigram (Paulus et al., 2018;Edunov et al., 2019) and use beam search with a beam size of 5 for summary generator.

Metrics and Baselines
We use ROUGE (Lin, 2004) to evaluate the produced summary in our experiments. Following previous work, we report ROUGE F1 1 on Multi-News dataset. We compare our model with several typical baselines and several baselines proposed in the latest years. Lead-3 is an extractive baseline which concatenates the first-3 sentences of each source document as a summary. LexRank (Erkan and Radev, 2004) Model R-1 R-2 R-SU4 Lead-3 39.41 11.77 14.51 LexRank (Erkan and Radev, 2004) 38.27 12.70 13.20 TextRank (Mihalcea and Tarau, 2004) 38.44 13.10 13.50 MMR (Carbonell and Goldstein, 1998) 38.77 11.98 12.91 HIBERT  43.86 14.62 18.34 PGN (See et al., 2017) 41.85 12.91 16.46 CopyTransformer (Gehrmann et al., 2018) 43.57 14.03 17.37 Hi-MAP(Fabbri et al., 2019) 43.47 14.89 17.41 HF (Liu and Lapata, 2019) 43  is an unsupervised graph based method for computing relative importance in extractive summarization. TextRank (Mihalcea and Tarau, 2004) is also an unsupervised algorithm while sentence importance scores are computed based on eigenvector centrality within weighted-graphs for extractive sentence summarization. MMR (Carbonell and Goldstein, 1998) extracts sentences with a ranked list of the candidate sentences based on the relevance and redundancy. HIBERT  first encodes each sentence using the sentence Transformer encoder, and then encode the whole document using the document Transformer encoder. It is a single document summarization model and cannot handle the hierarchical relationship of documents. We migrate it to multi-document summarization by concatenating multiple source documents into a long sequence. These extractive methods are set to give an output of 300 tokens. PGN (See et al., 2017) is an RNN based model with an attention mechanism and allows the system to copy words from the source text via pointing for abstractive summarization. CopyTransformer (Gehrmann et al., 2018) augments Transformer with one of the attention heads chosen randomly as the copy distribution. Hi-MAP (Fabbri et al., 2019) expands the pointer-generator network model into a hierarchical network and integrates an MMR module to calculate sentence-level scores, which is trained on the Multi-News corpus. The baseline above has been compared and reported in the Fabbri et al.
(2019), which releases the Multi-News dataset, and we directly cite the results of the above methods from this paper. HT (Liu and Lapata, 2019) is a Transformer based model with an attention mechanism to share information cross-document for abstractive multi-document summarization. It is used initially to generate Wikipedia, and we reproduce their method for the multi-document news summarization.

Automatic Evaluation
Following previous work, we report ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-SU4 (skip bigrams with a maximum distance of 4 words) scores as the metrics for automatic evaluation (Lin and Hovy, 2003). In Table 2, we report the results on the Multi-News test set and our proposed multi-granularity model (denoted as MGSum) outperforms various previous models. Our abstractive method achieves scores of 46.00, 16.81, and 20.09 on the three ROUGE metrics while our extractive method achieves scores of 44.75, 15.75, and 19.30 on the three ROUGE metrics. We can also see that the abstractive methods perform better than the extractive methods. We attribute this result to the observation that the gold summary of this dataset tends to use new expressions to summarize the original input documents.
Owing to the characteristics of the news, lead-3 is superior to all unsupervised extractive methods. Our extractive method achieves about 1.13 points improvement on ROUGE-2 F1 compared with HIBERT. We attribute the improvement to two aspects: Firstly, the abstractive objective can promote the recognition of important sentences for the extractive model with the multi-granularity interaction network. Besides, while extractive goldlabel sequences are obtained by greedily optimizing ROUGE-2 F1 on the gold-standard summary, gold labels may not be accurate. Joint learning of two objectives may correct some biases for the extractive model due to the inaccurate labels. We calculate the oracle result based on the gold-label extractive sequences, which achieves a score of 29.78 on ROUGE-2 F1 and is 14.03 points higher than the score of our extractive method. While there is a big gap between our model and the oracle, more efforts can be made to improve extractive performance.
Among the abstractive baselines, CopyTransformer performs much better than PGN and achieves 1.12 points improvement on the ROUGE-2 F1, which demonstrates the superiority of the Transformer architecture. Our abstractive model gains an improvement of 2.78 points compared with CopyTransformer, 1.92 points compared with Hi-MAP, and 1.21 points compared with HF on  Figure 3: Human evaluation. The compared system summaries are rated on a Likert scale of 1(worst) to 5(best).
ROUGE-2 F1, which verifies the effectiveness of the proposed multi-granularity interaction network for the summary generation.

Human Evaluation
To evaluate the linguistic quality of generated summaries, we carry out a human evaluation. We focus on three aspects: fluency, informativeness, and non-redundancy. The fluency indicator focuses on whether the summary is well-formed and grammatical. The informativeness indicator can reflect whether the summary covers salient points from the input documents. The measures whether the summary contains repeated information. We sample 100 instances from the Multi-News test set and employ 5 graduate students to rate each summary. Each human judgment evaluates all outputs of different systems for the same sample. 3 human judgments are obtained for every sample, and the final scores are averaged across different judges.
Results are presented in Figure 3. We can see that our model performs much better than all baselines. In the fluency indicator, our model achieves a high score of 3.22, which is higher than 2.98 of CopyTransformer and 3.07 of HF, indicating that our model can reduce the grammatical errors and improve the readability of the summary. In the informativeness indicator, our model is 0.32 better than HF on ROUGE-2 F1. It indicates that our model can effectively capture the salient information. In the non-redundancy indicator, MGSum outperforms all baselines by a large margin, that indicates the multi-granularity semantic information and joint learning with extractive summarization does help to avoid the repeating information of the generated summary.

Ablation Study
We perform an ablation study on the development set to investigate the influence of different modules in our proposed MGSum model. Modules are tested in four ways: (1) we remove the sentence extractor and only train the generator to verify the effectiveness of joint learning on the abstractive summarization; (2) we remove the summary generator part and only train the sentence extractor to verify the effectiveness of joint learning on the extractive summarization; (3) we remove the document representation and use only the sentence and word representations to verify the effectiveness of the document granularity semantic information; (4) We remove the document and sentence representation and use only the word representation to verify the importance of the sentence representation further. Since there are no interactions between the sentences of different documents without document representations, we establish connections between all sentences after the document representation is removed. Furthermore, we also establish connections between all the words after the sentence representation is removed, and the model degenerates into Transformer at this time. Table 3 presents the results. We find that the ROUGE-2 F1 score of extractive summarization drops by 0.31 after the summary generator is removed. This indicates that the joint learning method helps extractive summarization to benefit from the abstractive summarization. ROUGE-2 F1 score of abstractive summarization drops by 0.6 after the sentence extractor is removed. This indicates that extractive summarization does help abstractive summarization identify important sentences during the interactive encoding phrase. ROUGE-2 F1 score of extractive summarization drops by 0.4, while the ROUGE-2 F1 score of abstractive summarization drops by 0.3 after the document representation is removed. It indicates es-Human: -it ' s a race for the governor ' s mansion in 11 states today , and the gop could end the night at the helm of more than two-thirds of the 50 states . the gop currently controls 29 of the country ' s top state offices ; it ' s expected to keep the three republican ones that are up for grabs ( utah , north dakota , and indiana ) , and wrest north carolina from the dems . that brings its toll to 30 , with the potential to take three more , reports npr . races in montana , new hampshire , and washington are still too close to call , and in all three , democrat incumbents aren ' t seeking reelection . the results could have a big impact on health care , since a supreme court ruling grants states the ability to opt out of obamacare ' s medicaid expansion . " a romney victory would dramatically empower republican governors , " said one analyst . click for npr ' s state-by-state breakdown of what could happen .
HF: -delaware , new hampshire , and missouri are expected to notch safe wins in 11 states , reports npr . the state ' s top state of the state has seen its top state offices , and it ' s expected to be more than twothirds of the nation ' s state , reports the washington post . the top 10 : montana , montana , and rhode island . indiana : missouri : the state is home to the top of the list of state offices . new hampshire : montana : incumbent john kasich : he ' s the first woman to hold a state seat in the state , notes the huffington post . north carolina : the only state to win gop-held seats in vermont and delaware . new jersey : the biggest state in the history of the year has seen a population of around 40 % of the population , reports ap . montana : new hampshire and missouri : a state department of emergency has been declared a state of emergency . click for the full list , or check out a list of the states that voted tonight .
MGSum-ext: gop eyes gains as voters in 11 states pick governors enlarge this image toggle caption jim cole/ap jim cole/ap voters in 11 states will pick their governors tonight , and republicans appear on track to increase their numbers by at least one , with the potential to extend their hold to more than two-thirds of the nation ' s top state offices . and that ' s health care , says political scientist thad kousser , co-author of the power of american governors . " republicans currently hold 29 governorships , democrats have 20 , and rhode island ' s gov . lincoln chafee is an independent . eight of the gubernatorial seats up for grabs are now held by democrats ; three are in republican hands . polls and race analysts suggest that only three of tonight ' s contests are considered competitive , all in states where incumbent democratic governors aren ' t running again : montana , new hampshire and washington .
MGSum-abs: -voters in 11 states will pick their governors tonight , and republicans appear on track to increase their numbers by at least one , with the potential to extend their hold to more than two-thirds of the nation ' s top state offices . republicans currently hold 29 governorships , democrats have 20 , and rhode island ' s gov . lincoln chafee is an independent . the seat is expected to be won by former charlotte mayor walter dalton , who won his last election with 65 % of the vote , reports the washington post . democrats are expected to hold on to their seats in west virginia and missouri , and democrats are likely to hold seats in vermont and delaware , reports npr . polls and race analysts say that only three of tonight ' s contests are considered competitive , and all in states where incumbent democratic governors aren ' t running again . " no matter who wins the presidency , national politics is going to be stalemated on the affordable care act , " says one political scientist . tablishing the document representation to simulate the relationships between documents is necessary to improve the performance of both extractive and abstractive summarization. ROUGE-2 F1 score drops by 1.61 compared with MGSum and 1.01 compared with the only summary generator after removing both the document representation and the sentence representation. And there is no extractive summarization to co-promote the recognition of important information for abstractive summarization after the sentence representation is removed. It indicates the semantic information of sentence granularity is of great importance to encode multi-ple documents.

Case Study
In Table 4, we present example summaries generated by strong baseline HF, and our extractive and abstractive methods. The output of our model has the highest overlap with the ground truth. Moreover, our extractive and abstractive summary show consistent behavior with the high overlap, which further indicates that the two methods can jointly promote the recognition of important information. Compared with the extracted summary, the generated summary is more concise and coherent.

Conclusion and Future Work
In this work, we propose a novel multi-granularity interaction network to encode semantic representations for documents, sentences, and words. It can unify the extractive and abstractive summarization by utilizing the word representations to generate the abstractive summary and the sentence representations to extract sentences. Experiment results show that the proposed method significantly outperforms all strong baseline methods and achieves the best result on the Multi-News dataset.
In the future, we will introduce more tasks like document ranking to supervise the learning of the multi-granularity representations for further improvement.