Adapting Neural Single-Document Summarization Model for Abstractive Multi-Document Summarization: A Pilot Study

Till now, neural abstractive summarization methods have achieved great success for single document summarization (SDS). However, due to the lack of large scale multi-document summaries, such methods can be hardly applied to multi-document summarization (MDS). In this paper, we investigate neural abstractive methods for MDS by adapting a state-of-the-art neural abstractive summarization model for SDS. We propose an approach to extend the neural abstractive model trained on large scale SDS data to the MDS task. Our approach only makes use of a small number of multi-document summaries for fine tuning. Experimental results on two benchmark DUC datasets demonstrate that our approach can outperform a variety of baseline neural models.


Introduction
Document summarization is a task of automatically producing a summary for given documents. Different from Single Document Summarization (SDS) which generates a summary for each given document, Multi-Document Summarization (MDS) aims to generate a summary for a set of topic-related documents. Previous approaches to document summarization can be generally categorized to extractive methods and abstractive methods. Extractive methods produce a summary by extracting and merging sentences from the original document(s), while abstractive methods generate a summary using arbitrary words and expressions based on understanding the document(s). Due to the difficulty of natural language understanding and generation, previous research on document summarization is more focused on extrac-tive methods (Yao et al., 2017). However, extractive methods suffer from the inherent drawbacks of discourse incoherence and long, redundant sentences, which hampers its application in reality (Tan et al., 2017). Recently, with the success of sequence-to-sequence (seq2seq) models in natural language generation tasks including machine translation (Bahdanau et al., 2014) and dialog systems (Mou et al., 2016), abstractive summarization methods has received increasing attention. With the resource of large-scale corpus of human summaries, it is able to train an abstractive summarization model in an end-to-end framework. Neural abstractive summarization models (See et al., 2017;Tan et al., 2017) have surpass the performance of extractive methods on single document summarization task with abundant training data.
Unfortunately, the extension of seq2seq models to MDS is not straightforward. Neural abstractive summarization models are usually trained on about hundreds of thousands of gold summaries, but there are usually very few human summaries available for the MDS task. More specifically, in the news domain, there is only a few hundred multi-document summaries provided by DUC and TAC conferences in total, which are largely insufficient for training neural abstractive models. Apart from insufficient training data, neural models for abstractive MDS also face the challenge of much more input content, and the study is still in the primary stage.
In this study, we investigate applying seq2seq models to the MDS task. We attempt various ways of extending neural abstractive summarization models pre-trained on the SDS data to the MDS task, and reveal that neural abstractive summarization models do not transfer well on a different dataset. Then we study the factors which affect the transfer performance, and propose methods to adapt the pre-trained model to the MDS task. We also study leveraging the few MDS training data to further improve the pre-trained model. We conduct experiment on the benchmark DUC datasets, and experiment results demonstrate our approach is able to achieve considerable improvement over a variety of neural baselines.
The contributions of this study are summarized as follows: • To the best of our knowledge, our work is one of the very few pioneering works to investigate adapting neural abstractive summarization models of single document summarization to the task of multi-document summarization.
• We propose a novel approach to adapt the neural model trained on the SDS data to the MDS task, and leverage the few MDS training data to further improve the pre-trained model.
• Evaluation results demonstrate the efficacy of our proposed approach, which outperforms a variety of neural baselines.
We organize the paper as follows. In Section 2 we introduce related work. In Section 3 we describe the previous neural abstractive summarization model. Then we introduce our proposed approach in Section 4. Experiment results and discussion are presented in Section 5. Finally, we conclude this paper in Section 6.
2 Related Work

Extractive Summarization Methods
The study of MDS is pioneered by (McKeown and Radev, 1995), and early notable works also include (McKeown et al., 1999;Radev et al., 2000). Extractive summarization systems that compose a summary from a number of important sentences from the source documents are by far the most popular solution for MDS (Avinesh and Meyer, 2017). Redundancy is one of the biggest problems for extractive methods (Gambhir and Gupta, 2017), and the Maximal Marginal Relevance (MRR) (Carbonell and Goldstein, 1998) is a well-known algorithm for reducing redundancy.
In the past years various models under extractive framework have been proposed (Tao et al., 2008;Wan and Yang, 2008;Wang et al., 2011;Tan et al., 2015). One important architecture is to model MDS as a budgeted maximum coverage problem, including the prior approach (Mc-Donald, 2007) and improved models (Woodsend and Lapata, 2012;Li et al., 2013;Boudin et al., 2015). There are still recent studies under traditional extractive framework (Peyrard and Eckle-Kohler, 2017;Avinesh and Meyer, 2017).

Abstractive Summarization Methods
Abstractive summarization methods aim at generating the summary based on understanding the original documents. Sequence-to-sequence models with attention mechanism have been applied to the abstractive summarization task. Success attempts are on sentence summarization (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016) or single document summarization (Tan et al., 2017;See et al., 2017;Paulus et al., 2017), which have abundant gold summaries to train an end-to-end system. Until very recently, there occurs attempt for abstractive multi-document summarization under the seq2seq framework. The lack of enough train examples is the major obstacle to this end. To address this, Liu et al. (2018) study the task of generating English Wikipedia under a viewpoint of multi-document summarization. They construct a large corpus with reference summaries, so that end-to-end training of a seq2seq is capable. Their study reveals that seq2seq model works when there are abundant training data for MDS. Very recently Baumel et al. (2018) try to apply pre-trained abstractive summarization model of SDS to the query-focused summarization task. They sort the input documents and then iteratively apply the SDS model to summarize each single document until the length limit is reached. Their major concern is incorporating query information into the abstractive model or using the query to filter the original documents, which is different from our work focusing on generic multi-document summarization. Moreover, the intuitive idea of using the SDS model for summarizing each single document in the multi-document set is adopted in the baseline models for comparison as well.

Preliminaries
In this work we investigate abstractive MDS approach based on the state-of-the-art neural abstractive model in Tan et al. (2017). Compared with another neural abstractive model in See et al. (2017),  Figure 1: SinABS model. The figure is borrowed from Tan et al. (2017). Tan et al. (2017) adopt a hierarchical encoderdecoder framework which we found is more scalable to more and longer input documents. The model is named SinABS in this paper. SinABS uses a hierarchical encoder-decoder framework like Li et al. (2015), where a PageRank (Page et al., 1999) based attention mechanism is proposed to identify salient sentences in the original documents. The SinABS model is illustrated in Figure 1.We introduce the SinABS model following Tan et al. (2017).

Encoder
The target of the encoder is to encode the input documents into vector representations. SinABS adopts a hierarchical encoder framework, where a word encoder enc word is used for encoding a sentence into the sentence representation from its words, as h i,k = enc word (h i,k−1 , e i,k ), where h i,k represents the hidden state when LSTM receives word e i,k . Then a sentence encoder enc sent is used for encoding an input document into the document representation from its sentences, as is the last hidden state when word encoder receives the whole sentence i. The input to the word encoder is the word sequence of a sentence, appended with an "<eos>" token indicating the end of a sentence. The last hidden state after the word encoder receives "<eos>" is used as the embedding representation of the sentence. A sentence encoder is used to sequentially receive the embeddings of the sentences. A pseudo sentence of an "<eod>" token is appended at the end of the document to indicate the end of the whole document. The hidden state after the sentence encoder receives "<eod>" is treated as the representation of the input doc-ument, denoted as c. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is used as the word encoder enc word and also the sentence encoder enc sent .

Decoder
Similar to the hierarchical encoder, The sentence decoder dec sent receives the document representation d as the initial state h 0 = d, and predicts the sentence representations sequentially, by , where x j−1 is the encoded representation of the previously generated sentence s j−1 . The word decoder dec word receives a sentence representation h j as the initial state h j,0 = h j , and predicts the word representations sequentially, by h j,k = dec word (h j,k−1 , e j,k−1 ), where e j,k−1 is the embedding of the previously generated word. The predicted word representations are first concatenated with the context vector c j , and then mapped to vectors of the vocabulary size dimension by a projection layer, and finally normalized by a softmax layer as the probability distribution of generating the words in the vocabulary. A word decoder stops when it generates the "<eos>" token and similarly the sentence decoder stops when it generates the "<eod>" token.

Attention Mechanism
The attention mechanism used in SinABS sets a different context vector c j when generating the words of sentence j, by c j = i α j i h i . The graphbased attention mechanism in Tan et al. (2017) adopts the topic-sensitive PageRank algorithm to compute the attention weights, by . . , f n ] ∈ R n denotes the rank scores of the n original sentences. D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th column of W . W (i, j) = h T i P h j where P is a parameter matrix to be learned. λ is a damping factor and set to 0.9. y ∈ R n is a one hot vector and only y 0 = 1. The ranked scores are then integrated with a distraction mechanism, and finally computed as:  In this section we introduce our approach. Our abstractive MDS model is the extension of the single document summarization model SinABS. It is an encoder-decoder framework, which takes all the documents of a document set as input, then encodes the documents into a document set representation, and further generates the summary with a decoder. To adapt SinABS to the MDS task, our model is different from SinABS in the encoder model and the attention mechanism, and it will also be tuned on the MDS dataset to adapt to the MDS task. The framework of our model is illustrated in Figure 2.

Multi-Document Encoder
The major difference of MDS is that we need to generate a summary for multiple input documents. So our system needs to deal with the multiple input documents although SinABS is trained to generate a summary for one document. Considering that the decoder generates the summary from the representation vector encoded by the encoder, we can generate a summary for a document set if the document set is encoded to a representation vector containing its key information. In our approach, we achieve this by adding a document set encoder, to encode a set of document representation vectors into a document set representation. Thus the hierarchical encoder structure becomes three levels.
The document set encoder enc docset takes document vectors {d m }, m ∈ [1, M ] where M is the number of documents in a document set as input, produces a new document set vectord, and theñ d is provided to the decoder to generate the summary for the document set. The decoder will be a two-level hierarchical framework similar to that in Tan et al. (2017). Since there is no order and dependency relationship between different documents in a document set, it is not reasonable to use LSTM as the document set encoder. Instead, we define the document set encoder as: where w = [w 1 , . . . , w m ] ∈ R m is a weight vector to merge the document vectors into a document set representation. The weight vector w can be a fixed one as w = [ 1 /m, . . . , 1 /m], but in our system we hope to assign different w m to different d m , since different documents may contribute differently to the overall summary. However, it is unreasonable to treat w as a parameter vector and learn it directly, because the weight w m for d m should be based on d m . The position of a document should not affect its weight since there is no order in a document set.
In our system the weight for a document is decided based on the document itself, and its contribution to the representation of the overall document set. Therefore, we define: where d Σ = m d m and [d m ; d Σ ] is the concatenation of d m and d Σ . The intuitive explanation of Eq. 4 is that the weight of d m is decided by its relationship (modeled by parameterized dot product) with the representation of the whole document set d Σ . q is the parameter to be learned, whose dimension is twice the dimension of d m or d Σ .

Attention
The decoder receives the document set vectord as initial state and generates the output summary from the document set representation. The difference of the decoder to SinABS is that when computing the attention distribution now it should be computed on all the sentences in a document set. Not only the amount of original sentences becomes larger, but also the original sentences come from different documents. Nevertheless, we believe the topic-sensitive PageRank attention mechanism is still able to identify salient sentences, since similar idea in LexRank and TextRank methods achieves good performance on MDS. Therefore, the attention distribution is now computed on all the input sentences, by conducting the topicsensitive PageRank algorithm in Eq. 1 and Eq. 2 on all the original sentences. However, a problem does occur because the amount of original sentences is much larger than that of single document summarization task. Even though the graph-based attention mechanism is still able to rank the relevance and salience of original sentences, the attention distribution will be too disperse and even. This results in that too many sentences are considered to produce the context vector, making the context vector contain too much information. We believe a more concentrated attention distribution will be better. Therefore, when computing the attention weights, only the top K ranked sentences can have attention weights. This can be easily realized by switching the rank scores of sentences not in largest K sentences to minimum value and re-normalizing the attention weights. K is a hyper-parameter.

Model Tuning
SinABS is trained on the single document summarization corpus -CNN/DailyMail. Although both the CNN/DailyMail corpus and DUC datasets are news data, the reference summaries of the datasets differ much. In order to better adapt the Sin-ABS model on the MDS task, we attempt to fine tune the pre-trained SinABS model, although we have only a few reference summaries for the MDS task. In our approach we tune the decoders of the model. The parameters are the LSTM parameters of the word and sentence decoders, and the weight vector q in the document set encoder. The loss function and the optimization algorithm are the same with those of the original SinABS model, and we use the cross-entropy loss and the Adam (Kingma and Ba, 2014) algorithm to train the model. To prevent overfitting the training is stopped when performance begins to decrease.

Implementation
We implement our approach based on the source code and pre-trained model on the CNN/DailyMail corpus provided by Tan et al. (2017). We process the DUC datasets similar to Tan et al. (2017), including tokenizing and lowercasing the text, replacing all digit characters with the "#" symbol and label all name entities with CoreNLP toolkit 1 . The "#" symbols are mapped back to the original digits after decoding according to the context. We also implement our model in Theano 2 based on the SinABS model. K is set to 15 based on developing on the training set.

Evaluation Metric
ROUGE: We use ROUGE-1.5.5 (Lin and Hovy, 2003) toolkit and report the Rouge-1, Rouge-2 and Rouge-SU4 F1-scores, which has been widely adopted by DUC and TAC for automatic summary quality evaluation. It measured summary quality by counting overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary.
Edit distance: In order to test if our model is truly abstractive, instead of simply copying relevant fragments verbatim from the input documents, we compute the word edit-distance between each generated sentence s i and the most similar original sentence of it, as ed i , and report the average ED = 1 n n i=1 ed i . Considering the significant difference of length between sentences, we also divide the word editdistance for each generated sentence by its word number w i as ED/w = 1 n n i=1 ed i/w i .

Baselines
To verify the effectiveness of our approach, we investigate various strategies to adapt SinABS to MDS task for comparison. Since SinABS takes one document as input but there are multiple input documents in the MDS task, we explore four possible approaches to address this ("ex." indicates extractive method and "ab." indicates abstractive method. SinABS is denoted as ∆).
Single-ab.: One representative document of every document set is selected as the input document to the SinABS model. This is the most straightforward way to adapt single document summarization model to the MDS task. The representative document is chosen by conducting the PageRank (Page et al., 1999) algorithm on every document set. This baseline is denoted as P.R.+∆.
Single-ex.+Merge+Single-ab.: Different from selecting one representative document, we also investigate constructing a pseudo document as the input to SinABS. We achieve this by first using extractive single document summarization method to summarize every input document, and then concatenate these summaries to form a new document. The motivation of this strategy is to keep only the important content of original documents, so that the input is both the key information and suitable for SinABS to handle. The methods for extractive summarization are Lead, LexRank, TextRank and Centroid. These four baselines are denoted as Lead/Lex./Text./Cent.+∆ respectively.
Single-ab.+Merge+Single-ab.: Generate the abstractive summary for every original document with SinABS. Then the abstractive summaries are concatenated to form a pseudo document, as the input to SinABS again. The difference from Single-ex.+Merge+Single-ab. is that no extractive methods are required. This baseline his denoted as ∆+∆.
Single-ab.+Multi-ex.: Generate the summary for every original document, then summarize these summaries using some extractive MDS method instead of SinABS to get the final summary. The extractive MDS methods used are Lead, LexRank, TextRank, Centroid and Coverage.
Note that Coverage is specially designed for the MDS task, therefore it is not used in Single-ex.+Merge+Single-ab. baselines.
We introduce the extractive MDS methods used in previous baselines as follows. These extractive methods themselves can also be the baselines for comparison.
Lead: This baseline method takes the first sen-tences one by one in single document or the first document in the document collection, where documents in the collection are assumed to be ordered by name.
Coverage: It takes the first sentence one by one from the first document to the last document in the document collection.
LexRank: LexRank (Erkan and Radev, 2004) computes sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.
TextRank: TextRank (Mihalcea and Tarau, 2004) builds a graph and adds each sentence as vertices, the overlap of two sentences is treated as the relation that connects sentences. Then graphbased ranking algorithm is applied until convergence. Sentences are sorted based on their final score and a greedy algorithm is employed to impose diversity penalty on each sentence and select summary sentences.
Centroid: In centroid-based summarization (Radev et al., 2000) method, a pseudo-sentence of the document called centroid is calculated. The centroid consists of words with TF-IDF scores above a predefined threshold. The score of each sentence is defined by summing the scores based on different features including cosine similarity of sentences with the centroid, position weight and cosine similarity with the first sentence.

Method
R-1 R-2 R-SU4 ED ED/w  The comparison results with abstractive baselines are presented in Table 1 and Table 2, respectively. As seen from Table 1 and Table 2, selecting one document as the representation of a document set (Single-ab.) performs poorly. This indicates considering the information of all documents is necessary for MDS task. Generally generating the abstractive summary for every document first and then merging these summaries with extractive MDS methods (i.e. Single-ab.+Multiex.) performs slightly better than constructing pseudo single document by extractive summarization methods (i.e. Single-ex.+Merge+Singleab.). It may be explained that Single-ab.+Multi-ex. keeps the integrity of a document, thus the Sin-ABS model will perform better. Similarly Singleab.+Merge+Single-ab. does not perform well because the constructed document is much different from a real one. Our system achieves the best performance on both datasets, since our model at the same time keeps the integrity of all original documents and takes into consideration only the salient sentences by ranking all original sentences in the attention mechanism.
The edit distance results verify that our method produces sentences that are quite different from original sentences, indicating the property of abstractive summarization.

Model Validation
We conduct ablation experiments to verify the effectiveness of our model. Since we make three extensions to the SinABS model, namely the learned weights in the document set encoder, the attention mechanism and the tuning of the model. We validate their effect with three baseline models, by each changes one of the three parts. The difference of the three baselines are listed in Table 3. Model-1 is the simplest model without tuning, which uses a fixed weight vector w = [ 1 /m, . . . , 1 /m], and uses the raw attention mechanism in Tan et al. (2017). Model-2 verifies the effectiveness of making the attention distribution more concentrated on the 15 most salient sentences. Model-3 verifies tuning the decoder but not the document set encoder. Compared with Model-3, our model further learns different weights for different documents in the document encoder. Results are presented in Table 4 and Table 5. As seen from Table 4 and

Human Evaluation
We also conduct human evaluation to evaluate the linguistic quality of the generated abstractive summaries, and compare with some significant baselines. We randomly sample 10 document sets from the DUC 2002 dataset and another 10 document sets from the DUC 2004 dataset for human evaluation. Three volunteers who are fluent in English were asked to perform manual ratings on three dimensions: Coherence, Non-Redundancy (N.R. for short) and Readability. The ratings are in the format of 1-5 numerical scores (not necessarily integral), with higher scores denote better quality.
The average results are shown in Table 6. It can be observed that our system also outperforms other abstractive summarization approaches in human evaluation, achieving good coherence and readability.

Case Study
We show the abstractive summaries generated for an example from the DUC 2004 test set in Figure 3. It can be seen that the abstractive summaries generally read well, and has the potential to better convey the key information of original documents.

Conclusion and Future Work
Abstractive Multi-Document Summarization (MDS) is still a challenging and open problem.
Although sequence-to-sequence models have achieved great progress in single document summarization, the demands of large amount of training data makes it hard to apply it to the MDS task. In this paper, we address this problem from another direction, that we investigate leveraging pre-trained successful single document summarization model to the MDS task. We propose a framework to realize this goal by adding a document set encoder into the hierarchical framework, Lead+∆: politics , opposition leader hun sen and the prime minister were ousted <eos> in the u.s. khmer rouge , the government 's prime minister 's ruling party has had a lengthy majority of its leader in cambodia 's human rights record . <eos> of the country 's opposition party leaders and opposition members , the government have become prime minister <eos> of parliament with its prime minister , the presidency of the khmer rouge has been ruled out by the government 's leading opposition <eos> two political parties previously clashed with the government 's top two parties <eod> ∆+Text: king hun sen on tuesday praised by cambodia 's top two political parties, a coalition government led by prime minister <eos> in a short letter sent to news agencies, the king said he had received copies of fiscal and his cambodian people 's party in the government. <eos> cambodia 's leading opposition party ruled out sharing the top position in the presidency of parliament with its opposition <eos> in talks between the two party opposition bloc and the cambodian people 's party to form a new government. <eod> Our System: opposition leader cambodian people 's party won the election. <eos> in the u.s. , they were arrested in bangkok and charged with a lengthy coup of human rights . <eos> leading opposition party , the top position in parliament with its political rights , was arrested in bangkok , insisting it would lead to the presidency of thailand 's leading government . <eos> prime minister , political parties won a three -month agreement and agreed to a coalition government . <eos> the government would not end in a new coup vote and his arrest was rejected by the parties of parliament . <eod> Figure 3: Example of generated abstractive summary by our system. and we propose three strategies to further improve the model performance. Experimental results demonstrate our approach is able to achieve promising results on standard MDS datasets. Our study is still primary effort towards abstractive MDS. Future work we can do includes alleviating the requirement of a good pre-trained abstractive summarization model, designing better attention mechanism for MDS, and investigating our approach based on other model architectures.