Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization

Generating a text abstract from a set of documents remains a challenging task. The neural encoder-decoder framework has recently been exploited to summarize single documents, but its success can in part be attributed to the availability of large parallel data automatically acquired from the Web. In contrast, parallel data for multi-document summarization are scarce and costly to obtain. There is a pressing need to adapt an encoder-decoder model trained on single-document summarization data to work with multiple-document input. In this paper, we present an initial investigation into a novel adaptation method. It exploits the maximal marginal relevance method to select representative sentences from multi-document input, and leverages an abstractive encoder-decoder model to fuse disparate sentences to an abstractive summary. The adaptation method is robust and itself requires no training data. Our system compares favorably to state-of-the-art extractive and abstractive approaches judged by automatic metrics and human assessors.


Introduction
Neural abstractive summarization has primarily focused on summarizing short texts written by single authors. For example, sentence summarization seeks to reduce the first sentence of a news article to a title-like summary (Rush et al., 2015;Takase et al., 2016;Song et al., 2018); single-document summarization (SDS) focuses on condensing a news article to a handful of bullet points (Paulus et al., 2017;See et al., 2017). These summarization studies are empowered by large parallel datasets automatically harvested from online news outlets, including Gigaword (Rush et al., 2015), CNN/Daily Mail (Hermann et al., 2015), NYT (Sandhaus, 2008), and Newsroom (Grusky et al., 2018).
To date, multi-document summarization (MDS) has not yet fully benefited from the development DATASET SOURCE SUMMARY #PAIRS Gigaword the first sentence 8.3 words 4 Million (Rush et al., 2015) of a news article title-like CNN/Daily Mail a news article 56 words 312 K (Hermann et al., 2015) multi-sent TAC (08-11) 10 news articles 100 words 728 (Dang et al., 2008) related to a topic multi-sent DUC (03-04) 10 news articles 100 words 320 (Over and Yen, 2004) related to a topic multi-sent of neural encoder-decoder models. MDS seeks to condense a set of documents likely written by multiple authors to a short and informative summary. It has practical applications, such as summarizing product reviews (Gerani et al., 2014), student responses to post-class questionnaires (Luo and Litman, 2015;Luo et al., 2016), and sets of news articles discussing certain topics (Hong et al., 2014). State-of-the-art MDS systems are mostly extractive (Nenkova and McKeown, 2011). Despite their promising results, such systems cannot perform text abstraction, e.g., paraphrasing, generalization, and sentence fusion (Jing and McKeown, 1999). Further, annotated MDS datasets are often scarce, containing only hundreds of training pairs (see Table 1). The cost to create ground-truth summaries from multiple-document inputs can be prohibitive. The MDS datasets are thus too small to be used to train neural encoder-decoder models with millions of parameters without overfitting. A promising route to generating an abstractive summary from a multi-document input is to apply a neural encoder-decoder model trained for singledocument summarization to a "mega-document" created by concatenating all documents in the set at test time. Nonetheless, such a model may not scale well for two reasons. First, identifying important text pieces from a mega-document can be challenging for the encoder-decoder model, which is trained on single-document summarization data where the summary-worthy content is often contained in the first few sentences of an article. This is not the case for a mega-document. Second, redundant text pieces in a mega-document can be repeatedly used for summary generation under the current framework. The attention mechanism of an encoder-decoder model (Bahdanau et al., 2014) is position-based and lacks an awareness of semantics. If a text piece has been attended to during summary generation, it is unlikely to be used again. However, the attention value assigned to a similar text piece in a different position is not affected. The same content can thus be repeatedly used for summary generation. These issues may be alleviated by improving the encoder-decoder architecture and its attention mechanism (Cheng and Lapata, 2016;Tan et al., 2017). However, in these cases the model has to be re-trained on large-scale MDS datasets that are not available at the current stage. There is thus an increasing need for a lightweight adaptation of an encoder-decoder model trained on SDS datasets to work with multidocument inputs at test time.
In this paper, we present a novel adaptation method, named PG-MMR, to generate abstracts from multi-document inputs. The method is robust and requires no MDS training data. It combines a recent neural encoder-decoder model (PG for Pointer-Generator networks; See et al., 2017) that generates abstractive summaries from singledocument inputs with a strong extractive summarization algorithm (MMR for Maximal Marginal Relevance; Carbonell and Goldstein, 1998) that identifies important source sentences from multidocument inputs. The PG-MMR algorithm iteratively performs the following. It identifies a handful of the most important sentences from the megadocument. The attention weights of the PG model are directly modified to focus on these important sentences when generating a summary sentence. Next, the system re-identifies a number of important sentences, but the likelihood of choosing certain sentences is reduced based on their similarity to the partially-generated summary, thereby reducing redundancy. Our research contributions include the following: • we present an investigation into a novel adaptation method of the encoder-decoder framework from single-to multi-document summarization.
To the best of our knowledge, this is the first attempt to couple the maximal marginal relevance algorithm with pointer-generator networks for multi-document summarization; • we demonstrate the effectiveness of the proposed method through extensive experiments on standard MDS datasets. Our system compares favorably to state-of-the-art extractive and abstractive summarization systems measured by both automatic metrics and human judgments.

Related Work
Popular methods for multi-document summarization have been extractive. Important sentences are extracted from a set of source documents and optionally compressed to form a summary (Daume III and Marcu, 2002;Zajic et al., 2007;Galanis and Androutsopoulos, 2010;Berg-Kirkpatrick et al., 2011;Li et al., 2013;Thadani and McKeown, 2013;Wang et al., 2013;Yogatama et al., 2015;Filippova et al., 2015;Durrett et al., 2016). In recent years neural networks have been exploited to learn word/sentence representations for single-and multi-document summarization (Cheng and Lapata, 2016;Cao et al., 2017;Isonuma et al., 2017;Yasunaga et al., 2017;Narayan et al., 2018). These approaches remain extractive; and despite encouraging results, summarizing a large quantity of texts still requires sophisticated abstraction capabilities such as generalization, paraphrasing and sentence fusion. Prior to deep learning, abstractive summarization has been investigated (Barzilay et al., 1999;Carenini and Cheung, 2008;Ganesan et al., 2010;Gerani et al., 2014;Fabbrizio et al., 2014;Pighin et al., 2014;Bing et al., 2015;Liao et al., 2018). These approaches construct domain templates using a text planner or an open-IE system and employ a natural language generator for surface realization. Limited by the availability of labelled data, experiments are often performed on small domain-specific datasets.
Neural abstractive summarization utilizing the encoder-decoder architecture has shown promising results but studies focus primarily on singledocument summarization Kikuchi et al., 2016;Chen et al., 2016;Miao and Blunsom, 2016;Tan et al., 2017;Zeng et al., 2017;Zhou et al., 2017;Paulus et al., 2017;See et al., 2017;Gehrmann et al., 2018). The pointing mechanism Gu et al., 2016) allows a summarization system to both copy words from the source text and generate new words from the vocabulary. Reinforcement learning is exploited to directly optimize evaluation metrics (Paulus et al., 2017;Kryściński et al., 2018;Chen and Bansal, 2018). These studies fo-cus on summarizing single documents in part because the training data are abundant.
The work of Baumel et al. (2018) and Zhang et al. (2018) are related to ours. In particular, Baumel et al. (2018) propose to extend an abstractive summarization system to generate query-focused summaries; Zhang et al. (2018) add a document set encoder to their hierarchical summarization framework. With these few exceptions, little research has been dedicated to investigate the feasibility of extending the encoder-decoder framework to generate abstractive summaries from multi-document inputs, where available training data are scarce.
This paper presents some first steps towards the goal of extending the encoder-decoder model to a multi-document setting. We introduce an adaptation method combining the pointer-generator (PG) networks (See et al., 2017) and the maximal marginal relevance (MMR) algorithm (Carbonell and Goldstein, 1998). The PG model, trained on SDS data and detailed in Section §3, is capable of generating document abstracts by performing text abstraction and sentence fusion. However, if the model is applied at test time to summarize multi-document inputs, there will be limitations. Our PG-MMR algorithm, presented in Section §4, teaches the PG model to effectively recognize important content from the input documents, hence improving the quality of abstractive summaries, all without requiring any training on multidocument inputs.

Limits of the Encoder-Decoder Model
The encoder-decoder architecture has become the de facto standard for neural abstractive summarization (Rush et al., 2015). The encoder is often a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) converting the input text to a set of hidden states {h e i }, one for each input word, indexed by i. The decoder is a unidirectional LSTM that generates a summary by predicting one word at a time. The decoder hidden states are represented by {h d t }, indexed by t. For sentence and singledocument summarization Paulus et al., 2017;See et al., 2017), the input text is treated as a sequence of words, and the model is expected to capture the source syntax inherently.
The attention weight α t,i measures how impor-tant the i-th input word is to generating the t-th output word (Eq. (1-2)). Following (See et al., 2017), α t,i is calculated by measuring the strength of interaction between the decoder hidden state h d t , the encoder hidden state h e i , and the cumulative attention α t,i (Eq. (3)). α t,i denotes the cumulative attention that the i-th input word receives up to time step t-1. A large value of α t,i indicates the i-th input word has been used prior to time t and it is unlikely to be used again for generating the t-th output word.
A context vector (c t ) is constructed (Eq. (4)) to summarize the semantic meaning of the input; it is a weighted sum of the encoder hidden states. The context vector and the decoder hidden state ([h d t ||c t ]) are then used to compute the vocabulary probability P vcb (w) measuring the likelihood of a vocabulary word w being selected as the t-th output word (Eq. (5)). 1 In many encoder-decoder models, a "switch" is estimated (p gen ∈ [0,1]) to indicate whether the system has chosen to select a word from the vocabulary or to copy a word from the input text (Eq. (6)). The switch is computed using a feedforward layer with σ activation over [h d t ||c t ||y t−1 ], where y t−1 is the embedding of the output word at time t-1. The attention weights (α t,i ) are used to compute the copy probability (Eq. (7)). If a word w appears once or more in the input text, its copy probability ( i:w i =w α t,i ) is the sum of the attention weights over all its occurrences. The final probability P (w) is a weighted combination of the vocabulary probability and the copy probability. A cross-entropy loss function can often be used to train the model end-to-end.
To thoroughly understand the aforementioned encoder-decoder model, we divide its model parameters into four groups. They include • parameters of the encoder and the decoder; • {w z , b z } for calculating the "switch" (Eq. (6) Figure 1: System framework. The PG-MMR system uses K highest-scored source sentences (in this case, K=2) to guide the PG model to generate a summary sentence. All other source sentences are "muted" in this process. Best viewed in color.
By training the encoder-decoder model on singledocument summarization (SDS) data containing a large collection of news articles paired with summaries (Hermann et al., 2015), these model parameters can be effectively learned.
However, at test time, we wish for the model to generate abstractive summaries from multidocument inputs. This brings up two issues. First, the parameters are ineffective at identifying salient content from multi-document inputs. Humans are very good at identifying representative sentences from a set of documents and fusing them into an abstract. However, this capability is not supported by the encoder-decoder model. Second, the attention mechanism is based on input word positions but not their semantics. It can lead to redundant content in the multi-document input being repeatedly used for summary generation. We conjecture that both aspects can be addressed by introducing an "external" model that selects representative sentences from multi-document inputs and dynamically adjusts the sentence importance to reduce summary redundancy. This external model is integrated with the encoder-decoder model to generate abstractive summaries using selected representative sentences. In the following section we present our adaptation method for multi-document summarization.

Our Method
Maximal marginal relevance. Our adaptation method incorporates the maximal marginal relevance algorithm (MMR; Carbonell and Goldstein, 1998) into pointer-generator networks (PG; See et al., 2017) by adjusting the network's attention values. MMR is one of the most successful extractive approaches and, despite its straightforwardness, performs on-par with state-of-the-art systems (Luo and Litman, 2015;Yogatama et al., 2015). At each iteration, MMR selects one sentence from the document (D) and includes it in the summary (S) until a length threshold is reached. The selected sentence (s i ) is the most important one amongst the remaining sentences and it has the least content overlap with the current summary. In the equation below, Sim 1 (s i , D) measures the similarity of the sentence s i to the document. It serves as a proxy of sentence importance, since important sentences usually show similarity to the centroid of the document. max s j ∈S Sim 2 (s i , s j ) measures the maximum similarity of the sentence s i to each of the summary sentences, acting as a proxy of redundancy. λ is a balancing factor.
Our PG-MMR describes an iterative framework for summarizing a multi-document input to a summary consisting of multiple sentences. At each iteration, PG-MMR follows the MMR principle to select the K highest-scored source sentences; they serve as the basis for PG to generate a summary sentence. After that, the scores of all source sentences are updated based on their importance and redundancy. Sentences that are highly similar to the partial summary receive lower scores. Selecting K sentences via the MMR algorithm helps the PG system to effectively identify salient source content that has not been included in the summary.
Muting. To allow the PG system to effectively utilize the K source sentences without retraining the neural model, we dynamically adjust the PG attention weights (α t,i ) at test time. Let S k rep-resent a selected sentence. The attention weights of the words belonging to {S k } K k=1 are calculated as before (Eq. (2)). However, words in other sentences are forced to receive zero attention weights (α t,i =0), and all α t,i are renormalized (Eq. (8)).
It means that the remaining sentences are "muted" in this process. In this variant, the sentence importance does not affect the original attention weights, other than muting.
In an alternative setting, the sentence salience is multiplied with the word salience and renormalized (Eq. (9)). PG uses the reweighted alpha values to predict the next summary word.
Sentence Importance. To estimate sentence importance Sim 1 (s i , D), we introduce a supervised regression model in this work. Importantly, the model is trained on single-document summarization datasets where training data are abundant. At test time, the model can be applied to identify important sentences from multi-document input. Our model determines sentence importance based on four indicators, inspired by how humans identify important sentences from a document set. They include (a) sentence length, (b) its absolute and relative position in the document, (c) sentence quality, and (d) how close the sentence is to the main topic of the document set. These features are considered to be important indicators in previous extractive summarization framework (Galanis and Androutsopoulos, 2010;Hong et al., 2014).
Regarding the sentence quality (c), we leverage the PG model to build the sentence representation. We use the bidirectional LSTM encoder to encode any source sentence to a vector repre- is the concatenation of the last hidden states of the forward and backward passes. A document vector is the average of all sentence vectors. We use the document vector and the cosine similarity between the document and sentence vectors as indicator (d). A support vector regression model is trained on (sentence, score) pairs where the training data are obtained from the CNN/Daily Mail dataset. The target importance score is the ROUGE-L recall of the sentence compared to the ground-truth summary. Our model architecture leverages neural representations of sen- I(S i ) and R(S i ) are the importance and redundancy scores of the source sentence S i 3: I(S i ) ← SVR(S i ) for all source sentences 4: MMR(S i ) ← λI(S i ) for all source sentences 5: Summary ← {} 6: t ← index of summary words 7: while t < L max do 8: if w t is the period symbol then 13: end if 16: end while tences and documents, they are data-driven and not restricted to a particular domain. Sentence Redundancy. To calculate the redundancy of the sentence (max s j ∈S Sim 2 (s i , s j )), we compute the ROUGE-L precision, which measures the longest common subsequence between a source sentence and the partial summary (consisting of all sentences generated thus far by the PG model), divided by the length of the source sentence. A source sentence yielding a high ROUGE-L precision is deemed to have significant content overlap with the partial summary. It will receive a low MMR score and hence is less likely to serve as basis for generating future summary sentences.
Alg. 1 provides an overview the PG-MMR algorithm and Fig. 1 is a graphical illustration. The MMR scores of source sentences are updated after each summary sentence is generated by the PG model. Next, a different set of highest-scored sentences are used to guide the PG model to generate the next summary sentence. "Muting" the remaining source sentences is important because it helps the PG model to focus its attention on the most significant source content. The code for our model is publicly available to further MDS research. 2

Experimental Setup
Datasets. We investigate the effectiveness of the PG-MMR method by testing it on standard multidocument summarization datasets (Over and Yen, 2004;Dang and Owczarzak, 2008). These include DUC-03, DUC-04, TAC-08, TAC-10, and TAC-11, containing 30/50/48/46/44 topics respectively. The summarization system is tasked with generating a concise, fluent summary of 100 words or less from a set of 10 documents discussing a topic. All documents in a set are chronologically ordered and concatenated to form a mega-document serving as input to the PG-MMR system. Sentences that start with a quotation mark or do not end with a period are excluded (Wong et al., 2008). Each system summary is compared against 4 human abstracts created by NIST assessors. Following convention, we report results on DUC-04 and TAC-11 datasets, which are standard test sets; DUC-03 and TAC-08/10 are used as a validation set for hyperparameter tuning. 3 The PG model is trained for single-document summarization using the CNN/Daily Mail (Hermann et al., 2015) dataset, containing single news articles paired with summaries (human-written article highlights). The training set contains 287,226 articles. An article contains 781 tokens on average; and a summary contains 56 tokens (3.75 sentences). During training we use the hyperparameters provided by See et al. (2017). At test time, the maximum/minimum decoding steps are set to 120/100 words respectively, corresponding to the max/min lengths of the PG-MMR summaries. Because the focus of this work is on multi-document summarization (MDS), we do not report results for the CNN/Daily Mail dataset. Baselines. We compare PG-MMR against a broad spectrum of baselines, including state-of-the-art extractive ('ext-') and abstractive ('abs-') systems. They are described below. 4 • ext-SumBasic (Vanderwende et al., 2007) is an extractive approach assuming words occurring frequently in a document set are more likely to be included in the summary; • ext-KL-Sum (Haghighi and Vanderwende, 2009) greedily adds source sentences to the summary if it leads to a decrease in KL divergence; • ext-LexRank (Erkan and Radev, 2004) uses a graph-based approach to compute sentence importance based on eigenvector centrality in a graph representation; • ext-Centroid (Hong et al., 2014) computes the importance of each source sentence based on its cosine similarity with the document centroid; • ext-ICSISumm ) leverages the ILP framework to identify a globally-optimal set of sentences covering the most important concepts in the document set; DUC-04 System R-1 R-2 R-SU4 SumBasic (Vanderwende et al., 2007) 29.48 4.25 8.64 KLSumm (Haghighi et al., 2009) 31.04 6.03 10.23 LexRank (Erkan and Radev, 2004) 34.44 7.11 11.19 Centroid (Hong et al., 2014) 35.49 7.80 12.02 ICSISumm  37.31 9.36 13.12 DPP (Taskar, 2012) 38.78 9.47 13.36 Extract+Rewrite (Song et al., 2018) 28.90 5.33 8.76 Opinosis (Ganesan et al., 2010) 27.07 5.03 8.63 PG-Original (See et al., 2017) 31
• ext-DPP (Taskar, 2012) selects an optimal set of sentences per the determinantal point processes that balance the coverage of important information and the sentence diversity; • abs-Opinosis (Ganesan et al., 2010) generates abstractive summaries by searching for salient paths on a word cooccurrence graph created from source documents; • abs-Extract+Rewrite (Song et al., 2018) is a recent approach that scores sentences using LexRank and generates a title-like summary for each sentence using an encoderdecoder model trained on Gigaword data.
• abs-PG-Original (See et al., 2017) introduces an encoderdecoder model that encourages the system to copy words from the source text via pointing, while retaining the ability to produce novel words through the generator.

Results
Having described the experimental setup, we next compare the PG-MMR method against the baselines on standard MDS datasets, evaluated by both automatic metrics and human assessors.
ROUGE (Lin, 2004). This automatic metric measures the overlap of unigrams (R-1), bigrams (R-2) and skip bigrams with a maximum distance of 4 words (R-SU4) between the system summary and a set of reference summaries. ROUGE scores of various systems are presented in Table 2 and 3 respectively for the DUC-04 and TAC-11 datasets. We explore variants of the PG-MMR method.
They differ in how the importances of source sentences are estimated and how the sentence importance affects word attention weights. "w/ Cosine" computes the sentence importance as the cosine similarity score between the sentence and document vectors, both represented as sparse TF-IDF vectors under the vector space model. "w/ Summ-Rec" estimates the sentence importance as the predicted R-L recall score between the sentence and the summary. A support vector regression model is trained on sentences from the CNN/Daily Mail datasets (≈33K) and applied to DUC/TAC sentences at test time (see §4). "w/ BestSumm-Rec" obtains the best estimate of sentence importance by calculating the R-L recall score between the sentence and reference summaries. It serves as an upper bound for the performance of "w/ SummRec." For all variants, the sentence importance scores are normalized to the range of [0,1]. "w/ SentAttn" adjusts the attention weights using Eq. (9), so that words in important sentences are more likely to be used to generate the summary. The weights are otherwise computed using Eq. (8).
As seen in Table 2 and 3, our PG-MMR method surpasses all unsupervised extractive baselines, including SumBasic, KLSumm, and LexRank. On the DUC-04 dataset, ICSISumm and DPP show good performance, but these systems are trained directly on MDS datasets, which are not utilized by the PG-MMR method. PG-MMR exhibits superior performance compared to existing abstractive systems. It outperforms Opinosis and PG-Original by a large margin in terms of R-2 F-scores (5.03/6.03/8.73 for DUC-04 and 5.12/6.40/10.92 for TAC-11). In particular, PG-Original is the original pointer-generator networks with multidocument inputs at test time. Compared to it, PG-MMR is more effective at identifying summaryworthy content from the input. "w/ Cosine" is used as the default PG-MMR and it shows better results than "w/ SummRec." It suggests that the sentence and document representations obtained from the encoder-decoder model (trained on CNN/DM) are suboptimal, possibly due to a vocabulary mismatch, where certain words in the DUC/TAC datasets do not appear in CNN/DM and their embeddings are thus not learned during training. Finally, we observe that "w/ BestSummRec" yields the highest performance on both datasets. This finding suggests that there is a great potential for improvements of the PG-MMR method as its "extractive" and "abstractive" components can be separately optimized.  Location of summary content. We are interested in understanding why PG-MMR outperforms PG-Original at identifying summary content from the multi-document input. We ask the question: where, in the source documents, does each system tend to look when generating their summaries? Our findings indicate that PG-Original gravitates towards early source sentences, while PG-MMR searches beyond the first few sentences.
In Figure 2 we show the median location of the first occurrences of summary n-grams, where the n-grams can come from the 1st to 5th summary sentence. For PG-Original summaries, n-grams of the 1st summary sentence frequently come from the 1st and 2nd source sentences, corresponding to the lower/higher quartiles of source sentence indices. Similarly, n-grams of the 2nd summary sentence come from the 2nd to 7th source sentences. For PG-MMR summaries, the patterns are different. The n-grams of the 1st and 2nd summary sentences come from source sentences of the range (2, 44) and (6, 53), respectively. Our findings suggest that PG-Original tends to treat the input as a single-document and identifies summary-worthy content from the beginning of the input, whereas PG-MMR can successfuly search a broader range of the input for summary content. This capability is crucial for multi-document input where important content can come from any article in the set.  • The plane was Adam Air flight KI-574, departing at 12:59 pm from Surabaya on Java bound for Manado in northeast Sulawesi.

Degree of extractiveness.
• The plane crashed in a mountainous region in Polewali, west Sulawesi province.
• There were three Americans on board, it is not know if they survived.
• The cause of the crash is not known at this time but it is possible bad weather was a factor.

Extract+Rewrite Summary
• Plane with 102 people on board crashes.
• Three Americans among 102 on board plane in Indonesia.
• Rescue team arrives in Indonesia after plane crash.
• Plane with 102 crashes in West Sulawesi, killing at least 90.
• No word on the fate of Boeing 737-400.
• Plane carrying 96 passengers loses contact with Makassar.
• Plane crashes in Indonesia , killing at least 90.
• Indonesian navy sends two planes to carry bodies of five.
• Indonesian lawmaker criticises slow deployment of plane.
• Hundreds of kilometers plane crash.

PG-Original Summary
• Adam Air Boeing 737-400 crashed Monday after vanishing off air traffic control radar screens between the Indonesian islands of Java and Sulawesi.
• Up to 12 people were thought to have survived, with rescue teams racing to the crash site near Polewali in West Sulawesi , some 180 kilometres north of the South Sulawesi provincial capital Makassar.
• It was the worst air disaster since Sept. 5, 2005, when a Mandala Airline's Boeing 737-200 crashed shortly after taking off from the North Sumatra's airport, killing 103 people.
• Earlier on Friday, a ferry carrying 628 people sank off the Java coast.

PG-MMR Summary
• The Adam Air Boeing 737-400 crashed Monday afternoon, but search and rescue teams only discovered the wreckage early Tuesday.
• The Indonesian rescue team arrived at the mountainous area in West Sulawesi province where a passenger plane with 102 people onboard crashed into a mountain in Polewali, West Sulawesi province.
• Air force rear commander Eddy Suyanto told-Shinta radio station that the plane -operated by local carrier Adam Air -had crashed in a mountainous region in Polewali province on Monday.
• There was no word on the fate of the remaining 12 people on board the boeing 737-400. percentages of summary n-grams (or entire sentences) appearing in the multi-document input. PG-Original and PG-MMR summaries both show a high degree of extractiveness, and similar findings have been revealed by See et al. (2017). Because PG-MMR relies on a handful of representative source sentences and mutes the rest, it appears to be marginally more extractive than PG-Original. Both systems encourage generating summary sentences by stitching together source sentences, as about 52% and 41% of the summary sentences do not appear in the source, but about 90% the n-grams do. The Extract+Rewrite summaries ( §5), generated by rewriting selected source sentences to title-like summary sentences, exhibits a high degree of abstraction, close to that of human abstracts.
Linguistic quality. To assess the linguistic quality of various system summaries, we employ Amazon Mechanical Turk human evaluators to judge the summary quality, including PG-MMR, LexRank, PG-Original, and Extract+Rewrite. A turker is asked to rate each system summary on a scale of 1 (worst) to 5 (best) based on three evaluation criteria: informativeness (to what extent is the meaning expressed in the ground-truth text preserved in the summary?), fluency (is the summary grammatical and well-formed?), and non-redundancy (does the summary successfully avoid repeating information?). Human summaries are used as the ground-truth. The turkers are also asked to provide an overall ranking for the four system summaries. Results are presented in Table 5. We observe that the LexRank summaries are highest-rated on fluency. This is because LexRank is an extractive approach, where summary sentences are directly taken from the input. PG-MMR is rated as the best on both informativeness and non-redundancy. Regarding overall system rankings, PG-MMR summaries are frequently ranked as the 1st-and 2ndbest summaries, outperforming the others.
Example summaries. In Table 6 we present example summaries generated by various systems. PG-Original cannot effectively identify important content from the multi-document input. Extract+Rewrite tends to generate short, title-like sentences that are less informative and carry substantial redundancy. This is because the system is trained on the Gigaword dataset (Rush et al., 2015) where the target summary length is 7 words. PG-MMR generates summaries that effectively condense the important source content.

Conclusion
We describe a novel adaptation method to generate abstractive summaries from multi-document inputs. Our method combines an extractive summarization algorithm (MMR) for sentence extraction and a recent abstractive model (PG) for fusing source sentences. The PG-MMR system demonstrates competitive results, outperforming strong extractive and abstractive baselines.