Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

We introduce “extreme summarization”, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question “What is the article about?”. We collect a real-world, large-scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC). We propose a novel abstractive model which is conditioned on the article’s topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures long-range dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans.

SUMMARY: A man and a child have been killed after a light aircraft made an emergency landing on a beach in Portugal.DOCUMENT: Authorities said the incident took place on Sao Joao beach in Caparica, south-west of Lisbon.The National Maritime Authority said a middleaged man and a young girl died after they were unable to avoid the plane.
[6 sentences with 139 words are abbreviated from here.]Other reports said the victims had been sunbathing when the plane made its emergency landing.
[Another 4 sentences with 67 words are abbreviated from here.]Video footage from the scene carried by local broadcasters showed a small recreational plane parked on the sand, apparently intact and surrounded by beachgoers and emergency workers.
[Last 2 sentences with 19 words are abbreviated.]Figure 1: An abridged example from our extreme summarization dataset showing the document and its oneline summary.Document content present in the summary is color-coded.
Neural approaches to NLP and their ability to learn continuous features without recourse to pre-processing tools or linguistic annotations have driven the development of large-scale document summarization datasets (Sandhaus, 2008;Hermann et al., 2015;Grusky et al., 2018).However, these datasets often favor extractive models which create a summary by identifying (and subsequently concatenating) the most important sentences in a document (Cheng and Lapata, 2016;Nallapati et al., 2017;Narayan et al., 2018b).Abstractive approaches, despite being more faithful to the actual summarization task, either lag behind extractive ones or are mostly extractive, exhibiting a small degree of abstraction (See et al., 2017;Tan and Wan, 2017;Paulus et al., 2018;Pasunuru and Bansal, 2018;Celikyilmaz et al., 2018).
In this paper we introduce extreme summarization, a new single-document summarization task which is not amenable to extractive strategies and requires an abstractive modeling approach.The idea is to create a short, one-sentence news summary answering the question "What is the article about?".An example of a document and its extreme summary are shown in Figure 1.As can be seen, the summary is very different from a headline whose aim is to encourage readers to read the story; it draws on information interspersed in various parts of the document (not only the beginning) and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference.We build a dataset for the proposed task by harvesting online articles from the British Broadcasting Corporation (BBC) that often include a firstsentence summary.
We further propose a novel deep learning model which we argue is well-suited to the extreme summarization task.Unlike most existing abstractive approaches (Rush et al., 2015;Chen et al., 2016;Nallapati et al., 2016;See et al., 2017;Tan and Wan, 2017;Paulus et al., 2018;Pasunuru and Bansal, 2018;Celikyilmaz et al., 2018) which rely on an encoder-decoder architecture modeled by recurrent neural networks (RNNs), we present a topic-conditioned neural model which is based entirely on convolutional neural networks (Gehring et al., 2017b).Convolution layers capture long-range dependencies between words in the document more effectively compared to RNNs, allowing to perform document-level inference, abstraction, and paraphrasing.Our convolutional encoder associates each word with a topic vector capturing whether it is representative of the document's content, while our convolutional decoder conditions each word prediction on a document topic vector.
Experimental results show that when evaluated automatically (in terms of ROUGE) our topicaware convolutional model outperforms an oracle extractive system and state-of-the-art RNN-based abstractive systems.We also conduct two human evaluations in order to assess (a) which type of summary participants prefer and (b) how much key information from the document is preserved in the summary.Both evaluations overwhelmingly show that human subjects find our summaries more informative and complete.Our contributions in this work are three-fold: a new single document summarization dataset that encourages the development of abstractive systems; corroborated by analysis and empirical results showing that extractive approaches are not well-suited to the extreme summarization task; and a novel topicaware convolutional sequence-to-sequence model for abstractive summarization.

The XSum Dataset
Our extreme summarization dataset (which we call XSum) consists of BBC articles and accompanying single sentence summaries.Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article.The summary bears the HTML class "storybody introduction," and can be easily identified and extracted from the main text body (see Figure 1 for an example summary-article pair).
We followed the methodology proposed in Hermann et al. (2015) to create a large-scale dataset for extreme summarization.Specifically, we collected 226,711 Wayback archived BBC articles ranging over almost a decade (2010 to 2017) and covering a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts).Each article comes with a unique identifier in its URL, which we used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) set.Table 1 compares XSum with the CNN, DailyMail, and NY Times benchmarks.As can be seen, XSum contains a substantial number of training instances, similar to DailyMail; documents and summaries in XSum are shorter in relation to other datasets but the vocabulary size is sufficiently large, comparable to CNN.
Table 2 provides empirical analysis supporting our claim that XSum is less biased toward extractive methods compared to other summarization datasets.We report the percentage of novel n-grams in the target gold summaries that do not appear in their source documents.There are 36% novel unigrams in the XSum reference summaries compared to 17% in CNN,17% in DailyMail,and 23% in NY Times.and followed Narayan et al. (2018b) to preprocess them.For NY Times (Sandhaus, 2008), we used the splits and pre-processing steps of Paulus et al. (2018).For the vocabulary, we lowercase tokens.We further evaluated two extractive methods on these datasets.LEAD is often used as a strong lower bound for news summarization (Nenkova, 2005) and creates a summary by selecting the first few sentences or words in the document.We extracted the first 3 sentences for CNN documents and the first 4 sentences for DailyMail (Narayan et al., 2018b).Following previous work (Durrett et al., 2016;Paulus et al., 2018), we obtained LEAD summaries based on the first 100 words for NY Times documents.For XSum, we selected the first sentence in the document (excluding the one-line summary) to generate the LEAD.Our second method, EXT-ORACLE, can be viewed as an upper bound for extractive models (Nallapati et al., 2017;Narayan et al., 2018b).It creates an oracle summary by selecting the best possible set of sentences in the document that gives the highest ROUGE (Lin and Hovy, 2003) with respect to the gold summary.For XSum, we simply selected the single-best sentence in the document as summary.
Table 2 reports the performance of the two extractive methods using ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) with the gold sum-maries as reference.The LEAD baseline performs extremely well on CNN, DailyMail and NY Times confirming that they are biased towards extractive methods.EXT-ORACLE further shows that improved sentence selection would bring further performance gains to extractive approaches.Abstractive systems trained on these datasets often have a hard time beating the LEAD, let alone EXT-ORACLE, or display a low degree of novelty in their summaries (See et al., 2017;Tan and Wan, 2017;Paulus et al., 2018;Pasunuru and Bansal, 2018;Celikyilmaz et al., 2018).Interestingly, LEAD and EXT-ORACLE perform poorly on XSum underlying the fact that it is less biased towards extractive methods.
In line with our findings, Grusky et al. ( 2018) have recently reported similar extractive biases in existing datasets.They constructed a new dataset called "Newsroom" which demonstrates a high diversity of summarization styles.XSum is not diverse, it focuses on a single news outlet (i.e., BBC) and a unifrom summarization style (i.e., a single sentence).However, it is sufficiently large for neural network training and we hope it will spur further research towards the development of abstractive summarization models.

Convolutional Sequence-to-Sequence Learning for Summarization
Unlike tasks like machine translation and paraphrase generation where there is often a one-to- one semantic correspondence between source and target words, document summarization must distill the content of the document into a few important facts.This is even more challenging for our task, where the compression ratio is extremely high, and pertinent content can be easily missed.
Recently, a convolutional alternative to sequence modeling has been proposed showing promise for machine translation (Gehring et al., 2017a,b) and story generation (Fan et al., 2018).We believe that convolutional architectures are attractive for our summarization task for at least two reasons.Firstly, contrary to recurrent networks which view the input as a chain structure, convolutional networks can be stacked to represent large context sizes.Secondly, hierarchical features can be extracted over larger and larger contents, allowing to represent long-range dependencies efficiently through shorter paths.
Our model builds on the work of Gehring et al. (2017b) who develop an encoder-decoder architecture for machine translation with an attention mechanism (Sukhbaatar et al., 2015) based exclusively on deep convolutional networks.We adapt this model to our summarization task by allow-ing it to recognize pertinent content (i.e., by foregrounding salient words in the document).In particular, we improve the convolutional encoder by associating each word with a vector representing topic salience, and the convolutional decoder by conditioning each word prediction on the document topic vector.
Model Overview At the core of our model is a simple convolutional block structure that computes intermediate states based on a fixed number of input elements.Our convolutional encoder (shown at the top of Figure 2) applies this unit across the document.We repeat these operations in a stacked fashion to get a multi-layer hierarchical representation over the input document where words at closer distances interact at lower layers while distant words interact at higher layers.The interaction between words through hierarchical layers effectively captures long-range dependencies.
Analogously, our convolutional decoder (shown at the bottom of Figure 2) uses the multi-layer convolutional structure to build a hierarchical representation over what has been predicted so far.Each layer on the decoder side determines useful source context by attending to the encoder representation before it passes its output to the next layer.This way the model remembers which words it previously attended to and applies multihop attention (shown at the middle of Figure 2) per time step.The output of the top layer is passed to a softmax classifier to predict a distribution over the target vocabulary.
Our model assumes access to word and document topic distributions.These can be obtained by any topic model, however we use Latent Dirichlet Allocation (LDA; Blei et al. 2003) in our experiments; we pass the distributions obtained from LDA directly to the network as additional input.This allows us to take advantage of topic modeling without interfering with the computational advantages of the convolutional architecture.The idea of capturing document-level semantic information has been previously explored for recurrent neural networks (Mikolov and Zweig, 2012;Ghosh et al., 2016;Dieng et al., 2017), however, we are not aware of any existing convolutional models.
Topic Sensitive Embeddings Let D denote a document consisting of a sequence of words (w 1 , . . ., w m ); we embed D into a distributional space x = (x 1 , . . ., x m ) where x i ∈ R f is a column in embedding matrix M ∈ R V ×f (where V is the vocabulary size).We also embed the absolute word positions in the document p = (p 1 , . . ., p m ) where p i ∈ R f is a column in position matrix P ∈ R N ×f , and N is the maximum number of positions.Position embeddings have proved useful for convolutional sequence modeling (Gehring et al., 2017b), because, in contrast to RNNs, they do not observe the temporal positions of words (Shi et al., 2016).Let t D ∈ R f ′ be the topic distribution of document D and t ′ = (t ′ 1 , . . ., t ′ m ) the topic distributions of words in the document (where During encoding, we represent document D via e = (e 1 , . . ., e m ), where e i is: and ⊗ denotes point-wise multiplication.The topic distribution t ′ i of word w i essentially captures how topical the word is in itself (local context), whereas the topic distribution t D represents the overall theme of the document (global context).The encoder essentially enriches the context of the word with its topical relevance to the document.
For every output prediction, the decoder estimates representation g = (g 1 , . . ., g n ) for previously predicted words (w ′ 1 , . . ., w ′ n ) where g i is: x ′ i and p ′ i are word and position embeddings of previously predicted word w ′ i , and t D is the topic distribution of the input document.Note that the decoder does not use the topic distribution of w ′ i as computing it on the fly would be expensive.However, every word prediction is conditioned on the topic of the document, enforcing the summary to have the same theme as the document.
Multi-layer Convolutional Structure Each convolution block, parametrized by W ∈ R 2d×kd and b w ∈ R 2d , takes as X ∈ R k×d which is the concatenation of k adjacent elements embedded in a d dimensional space, applies one dimensional convolution and returns an output element Y ∈ R 2d .We apply Gated Linear Units (GLU, v : R 2d → R d , Dauphin et al. 2017) on the output of the convolution Y .Subsequent layers operate over the k output elements of the previous layer and are connected through residual connections (He et al., 2016) to allow for deeper hierarchical representation.We denote the output of the ℓth layer as h ℓ = (h ℓ 1 , . . ., h ℓ n ) for the decoder network, and z ℓ = (z ℓ 1 , . . ., z ℓ m ) for the encoder network.
Multi-hop Attention Our encoder and decoder are tied to each other through a multi-hop attention mechanism.For each decoder layer ℓ, we compute the attention a ℓ ij of state i and source element j as: , is the decoder state summary combining the current decoder state h ℓ i and the previous output element embedding g i .The vector z u is the output from the last encoder layer u.The conditional input c ℓ i to the current decoder layer is a weighted sum of the encoder outputs as well as the input element embeddings e j : The attention mechanism described here performs multiple attention "hops" per time step and considers which words have been previously attended to.It is therefore different from single-step attention in recurrent neural networks (Bahdanau et al., 2015), where the attention and weighted sum are computed over z u only.
Our network uses multiple linear layers to project between the embedding size (f + f ′ ) and the convolution output size 2d.They are applied to e before feeding it to the encoder, to the final encoder output z u , to all decoder layers h ℓ for the attention score computation, and to the final decoder output h L before the softmax.We pad the input with k − 1 zero vectors on both left and right side to ensure that the output of the convolutional layers matches the input length.During decoding, we ensure that the decoder does not have access to future information; we start with k zero vectors and shift the covolutional block to the right after every prediction.The final decoder output h L is used to compute the distribution over the target vocabulary T as: We use layer normalization and weight initialization to stabilize learning.
Our topic-enhanced model calibrates longrange dependencies with globally salient content.As a result, it provides a better alternative to vanilla convolutional sequence models (Gehring et al., 2017b) and RNN-based summarization models (See et al., 2017) for capturing cross-document inferences and paraphrasing.At the same time it retains the computational advantages of convolutional models.Each convolution block operates over a fixed-size window of the input sequence, allowing for simultaneous encoding of the input, ease in learning due to the fixed number of non-linearities and transformations for words in the input sequence.

Experimental Setup
In this section we present our experimental setup for assessing the performance of our Topic-aware Convolutional Sequence to Sequence model which we call T-CONVS2S for short.We discuss implementation details and present the systems used for comparison with our approach.
Comparison Systems We report results with various systems which were all trained on the XSum dataset to generate a one-line summary given an input news article.
We compared T-CONVS2S against three extractive systems: a baseline which randomly selects a sentence from the input document (RANDOM), a baseline which simply selects the leading sentence from the document (LEAD), and an oracle which selects a singlebest sentence in each document (EXT-ORACLE).The latter is often used as an upper bound for extractive methods.We also compared our model against the RNN-based abstractive systems introduced by See et al. (2017).In particular, we experimented with an attention-based sequence to sequence model (SEQ2SEQ), a pointer-generator model which allows to copy words from the source text (PTGEN), and a pointer-generator model with a coverage mechanism to keep track of words that have been summarized (PTGEN-COVG).Finally, we compared our model against the vanilla convolution sequence to sequence model (CONVS2S) of Gehring et al. (2017b).

Model Parameters and Optimization
We did not anonymize entities but worked on a lowercased version of the XSum dataset.court,murder,police,arrest,guilty,sentence,boy,bail,space,crown,trial T2: church,abuse,bishop,child,catholic,gay,pope,school,christian,priest,cardinal T3: council,people,government,local,housing,home,house,property,city,plan,authority T4: clinton,party,trump,climate,poll,vote,plaid,election,debate,change,candidate,campaign T5: country,growth,report,business,export,fall,bank,security,economy,rise,global,inflation T6: hospital,patient,trust,nhs,people,care,health, service, staff, report, review, system, child ing and at test time the input document was truncated to 400 tokens and the length of the summary limited to 90 tokens.The LDA model (Blei et al., 2003) was trained on XSum documents (training portion).We therefore obtained for each word a probability distribution over topics which we used to estimate t ′ ; the topic distribution t D can be inferred for any new document, at training and test time.We explored several LDA configurations on held-out data, and obtained best results with 512 topics.Table 3 shows some of the topics learned by the LDA model.
For SEQ2SEQ, PTGEN and PTGEN-COVG, we used the best settings reported on the CNN and DailyMail data (See et al., 2017). 2 All three models had 256 dimensional hidden states and 128 dimensional word embeddings.They were trained using Adagrad (Duchi et al., 2011) with learning rate 0.15 and an initial accumulator value of 0.1.We used gradient clipping with a maximum gradient norm of 2, but did not use any form of regularization.We used the loss on the validation set to implement early stopping.
For CONVS2S 3 and T-CONVS2S, we used 512 dimensional hidden states and 512 dimensional word and position embeddings.We trained our convolutional models with Nesterov's accelerated gradient method (Sutskever et al., 2013) using a momentum value of 0.99 and renormalized gradients if their norm exceeded 0.1 (Pascanu et al., 2013).We used a learning rate of 0.10 and once the validation perplexity stopped improving, we 2 We used the code available at https://github.com/abisee/pointer-generator.
reduced the learning rate by an order of magnitude after each epoch until it fell below 10 −4 .We also applied a dropout of 0.2 to the embeddings, the decoder outputs and the input of the convolutional blocks.Gradients were normalized by the number of non-padding tokens per mini-batch.We also used weight normalization for all layers except for lookup tables.
All neural models, including ours and those based on RNNs (See et al., 2017) had a vocabulary of 50,000 words and were trained on a single Nvidia M40 GPU with a batch size of 32 sentences.Summaries at test time were obtained using beam search (with beam size 10).

Results
Automatic Evaluation We report results using automatic metrics in Table 4.We evaluated summarization quality using F 1 ROUGE (Lin and Hovy, 2003).Unigram and bigram overlap (ROUGE-1 and ROUGE-2) are a proxy for assessing informativeness and the longest common subsequence (ROUGE-L) represents fluency. 4 On the XSum dataset, SEQ2SEQ outperforms the LEAD and RANDOM baselines by a large margin.PTGEN, a SEQ2SEQ model with a "copying" mechanism outperforms EXT-ORACLE, a "perfect" extractive system on ROUGE-2 and ROUGE-L.This is in sharp contrast to the performance of these models on CNN/DailyMail (See et al., 2017) and Newsroom datasets (Grusky et al., 2018), where they fail to outperform the LEAD.The result provides further evidence that XSum is a good testbed for abstractive summarization.PTGEN-COVG, the best performing abstractive system on the CNN/DailyMail datasets, does not do well.We believe that the coverage mechanism is more useful when generating multi-line summaries and is basically redundant for extreme summarization.
CONVS2S, the convolutional variant of SEQ2SEQ, significantly outperforms all RNN-based abstractive systems.We hypothesize that its superior performance stems from the ability to better represent document content (i.e., by capturing long-range dependencies).Table 4 shows several variants of T-CONVS2S including an encoder network enriched with information about how topical a word is on its own (enc t ′ ) or in the document (enc (t ′ ,t D ) ).We also experimented with various decoders by conditioning every prediction on the topic of the document, basically encouraging the summary to be in the same theme as the document (dec t D ) or letting the decoder decide the theme of the summary.Interestingly, all four T-CONVS2S variants outperform CONVS2S.T-CONVS2S performs best when both encoder and decoder are constrained by the document topic (enc (t ′ ,t D ) ,dec t D ).In the remainder of the paper, we refer to this variant as T-CONVS2S.
We further assessed the extent to which various models are able to perform rewriting by generating genuinely abstractive summaries.Table 5 shows the proportion of novel n-grams for LEAD, EXT-ORACLE, PTGEN, CONVS2S, and T-CONVS2S.As can be seen, the convolutional models exhibit the highest proportion of novel n-grams.We should also point out that the summaries being evaluated have on average comparable lengths; the summaries generated by PTGEN contain 22.57 words, those generated by CONVS2S and T-EXT-ORACLE Caroline Pidgeon is the Lib Dem candidate, Sian Berry will contest the election for the Greens and UKIP has chosen its culture spokesman Peter Whittle.[34.1, 20.5, 34 (See et al., 2017).This result further strengthens our hypothesis that XSum is a good testbed for abstractive methods.
Human Evaluation In addition to automatic evaluation using ROUGE which can be misleading when used as the only means to assess the informativeness of summaries (Schluter, 2017), we also evaluated system output by eliciting human judgments in two ways.
In our first experiment, participants were asked to compare summaries produced from the EXT-ORACLE baseline, PTGEN, the best performing system of See et al. (2017), CONVS2S, our topic-aware model T-CONVS2S, and the humanauthored gold summary (GOLD).We did not include extracts from the LEAD as they were significantly inferior to other models.
The study was conducted on the Amazon Mechanical Turk platform using Best-Worst Scaling (BWS; Louviere and Woodworth 1991; Louviere et al. 2015), a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales (Kiritchenko and Mohammad, 2017).Participants were presented with a document and summaries generated from two out of five systems and were asked to decide which summary was better and which one was worse in order of informativeness (does the summary capture important information in the document?)and fluency (is the summary written in well-formed English?system summaries are shown in Table 6.We randomly selected 50 documents from the XSum test set and compared all possible combinations of two out of five systems for each document.We collected judgments from three different participants for each comparison.The order of summaries was randomized per document and the order of documents per participant.The score of a system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst.The scores range from -1 (worst) to 1 (best) and are shown in Table 7.Perhaps unsurprisingly human-authored summaries were considered best, whereas, T-CONVS2S was ranked 2nd followed by EXT-ORACLE and CONVS2S.PTGEN was ranked worst with the lowest score of −0.218.We carried out pairwise comparisons between all models to assess whether system differences are statistically significant.GOLD is significantly different from all other systems and T-CONVS2S is significantly different from CONVS2S and PTGEN (using a one-way ANOVA with posthoc Tukey HSD tests; p < 0.01).All other differences are not statistically significant.
For our second experiment we used a questionanswering (QA) paradigm (Clarke and Lapata, 2010;Narayan et al., 2018b) to assess the degree to which the models retain key information from the document.We used the same 50 documents as in our first elicitation study.We wrote two fact-based questions per document, just by reading the summary, under the assumption that it highlights the most important content of the news article.Questions were formulated so as not to reveal answers to subsequent questions.We created 100 questions in total (see Table 6 for examples).Participants read the output summaries and answered the questions as best they could without access to the document or the gold summary.The more questions can be answered, the better the corresponding system is at summarizing the document as a whole.Five participants answered ques-tions for each summary.
We followed the scoring mechanism introduced in Clarke and Lapata (2010).A correct answer was marked with a score of one, partially correct answers with a score of 0.5, and zero otherwise.The final score for a system is the average of all its question scores.Answers again were elicited using Amazon's Mechanical Turk crowdsourcing platform.We uploaded the data in batches (one system at a time) to ensure that the same participant does not evaluate summaries from different systems on the same set of questions.
Table 7 shows the results of the QA evaluation.Based on summaries generated by T-CONVS2S, participants can answer 46.05% of the questions correctly.Summaries generated by CONVS2S, PTGEN and EXT-ORACLE provide answers to 30.90%, 21.40%, and 15.70% of the questions, respectively.Pairwise differences between systems are all statistically significant (p < 0.01) with the exception of PTGEN and EXT-ORACLE.EXT-ORACLE performs poorly on both QA and rating evaluations.The examples in Table 6 indicate that EXT-ORACLE is often misled by selecting a sentence with the highest ROUGE (against the gold summary), but ROUGE itself does not ensure that the summary retains the most important information from the document.The QA evaluation further emphasizes that in order for the summary to be felicitous, information needs to be embedded in the appropriate context.For example, CONVS2S and PTGEN will fail to answer the question "Who has resigned?"(see Table 6 second block) despite containing the correct answer "Dick Advocaat" due to the wrong context.T-CONVS2S is able to extract important entities from the document with the right theme.

Conclusions
In this paper we introduced the task of "extreme summarization" together with a large-scale dataset which pushes the boundaries of abstractive methods.Experimental evaluation revealed that models which have abstractive capabilities do better on this task and that high-level document knowledge in terms of topics and long-range dependencies is critical for recognizing pertinent content and generating informative summaries.In the future, we would like to create more linguistically-aware encoders and decoders incorporating co-reference and entity linking.

Table 2 :
Corpus bias towards extractive methods in the CNN, DailyMail, NY Times, and XSum datasets.We show the proportion of novel n-grams in gold summaries.We also report ROUGE scores for the LEAD baseline and the extractive oracle system EXT-ORACLE.Results are computed on the test set.

Table 3 :
Example topics learned by the LDA model on XSum documents (training portion).

Table 4 :
′ ,t D ) , dect D ) 31.89 11.54 25.75 ROUGE results on XSum test set.We report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) F 1 scores.Extractive systems are in the upper block, RNN-based abstractive systems are in the middle block, and convolutional abstractive systems are in the bottom block.

Table 5 :
Proportion of novel n-grams in summaries generated by various models on the XSum test set.

Table 6 :
Example output summaries on the XSum test set with [ROUGE-1, ROUGE-2 and ROUGE-L] scores, goldstandard reference, and corresponding questions.Words highlighted in blue are either the right answer or constitute appropriate context for inferring it; words in red lead to the wrong answer.

Table 7 :
). Examples of System ranking according to human judgments and QA-based evaluation.