Inducing Document Structure for Aspect-based Summarization

Automatic summarization is typically treated as a 1-to-1 mapping from document to summary. Documents such as news articles, however, are structured and often cover multiple topics or aspects; and readers may be interested in only some of them. We tackle the task of aspect-based summarization, where, given a document and a target aspect, our models generate a summary centered around the aspect. We induce latent document structure jointly with an abstractive summarization objective, and train our models in a scalable synthetic setup. In addition to improvements in summarization over topic-agnostic baselines, we demonstrate the benefit of the learnt document structure: we show that our models (a) learn to accurately segment documents by aspect; (b) can leverage the structure to produce both abstractive and extractive aspect-based summaries; and (c) that structure is particularly advantageous for summarizing long documents. All results transfer from synthetic training documents to natural news articles from CNN/Daily Mail and RCV1.


Introduction
Abstractive summarization systems typically treat documents as unstructured, and generate a single generic summary per document (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017). In this work we argue that incorporating document structure into abstractive summarization systems is beneficial for at least three reasons. First, the induced structure increases model interpretability, and can be leveraged for other purposes such as document segmentation. Second, structure-aware models help alleviate performance bottlenecks associated with summarization of long documents by learning to focus only on the segments relevant to the topic of interest. Third, they can adapt more flexibly to demands of a user who, faced with a long document or a document collection, might be interested only in some of its topics.
For example given a set of reviews of a smartphone, one user might be interested in a summary of opinions on battery life while another may care more about its camera quality; or, given a news article about a body builder running for governor, a reader might care about the effect on his sports career, or on the political consequences (cf., Figure 1 (bottom) for another example). Throughout this paper, we will refer to such topics or perspectives collectively as aspects. We develop models for aspect-based summarization: given a document and a target aspect, our systems generate a summary specific to the aspect.
We extend recent neural models (See et al., 2017) for abstractive summarization making the following contributions: • We propose and compare models for aspectbased summarization incorporating different aspect-driven attention mechanisms in both the encoder and the decoder.
• We propose a scalable synthetic training setup and show that our models generalize from synthetic to natural documents, sidestepping the data sparsity problem and outperforming recent aspect-agnostic summarization models in both cases.
• We show that our models induce meaningful latent structure, which allows them to generate abstractive and extractive aspect-driven summaries, segment documents by aspect, and generalize to long documents. 1 We argue that associating model attention with aspects also improves model interpretability.
Our models are trained on documents paired with aspect-specific summaries. A sizable data set does not exist, and we adopt a scalable, synthetic training setup (Choi, 2000;Krishna and Srinivasan, 2018). We leverage aspect labels (such as news or health) associated with each article in the CNN/Daily Mail dataset (Hermann et al., 2015), and construct synthetic multi-aspect documents by interleaving paragraphs of articles pertaining to different aspects, and pairing them with the original summary of one of the included articles. Although assuming one aspect per source article may seem crude, we demonstrate that our model trained on this data picks up subtle aspect changes within natural news articles. Importantly, our setup requires no supervision such as pre-trained topics (Krishna and Srinivasan, 2018) or aspect-segmentation of documents.
A script to reproduce the synthetic data set presented in this paper can be found at https://github.com/ColiLea/ aspect_based_summarization.
Our evaluation shows that the generated summaries are more aspect-relevant and meaningful compared to aspect agnostic baselines, as well as a variety of advantages of the inferred latent aspect representations such as accurate document segmentation, that our models produce both extractive and abstractive summaries of high quality, and that they do so for long documents. We also show that our models, trained on synthetic documents, generalize to natural documents from the Reuters and the CNN/Daily Mail corpus, through both automatic and human evaluation.

Related Work
Aspect-based summarization has previously been considered in the customer feedback domain (Hu and Liu, 2004;Zhuang et al., 2006;Titov and Mc-Donald, 2008;Lu et al., 2009;Zhu et al., 2009), where a typical system discovers a set of relevant aspects (product properties), and extracts sentiment and information along those aspects. In contrast, we induce latent aspect representations under an abstractive summarization objective. Gerani et al. (2016) consider discourse and topical structure to abstractively summarize product reviews using a micro planning pipeline for text generation rather than building on recent advances in end-to-end modeling. Yang et al. (2018) propose an aspect-and sentiment-aware neural sum-marization model in a multi-task learning setup. Their model is geared towards the product domain and requires document-level category labels, and sentiment-and aspect lexica.
In query-based summarization sets of documents are summarized with respect to a natural language input query (Dang, 2005;Daumé III and Marcu, 2006;Mohamed and Rajasekaran, 2006;Liu et al., 2012;Wang et al., 2014;Baumel et al., 2018). Our systems generate summaries with respect to abstract input aspects (akin to topics in a topic model), whose representations are learnt jointly with the summarization task.
We build on neural encoder-decoder architectures with attention (Nallapati et al., 2016;Cheng and Lapata, 2016;Chopra et al., 2016;See et al., 2017;Narayan et al., 2017), and extend the pointer-generator architecture of See et al. (2017) to our task of aspect-specific summarization. Narayan et al. (2018) use topic information from a pre-trained LDA topic model to generate ultra-short (single-topic) summaries, by scoring words in their relevance to the overall document. We learn topics jointly within the summarization system, and use them to directly drive summary content selection.
Our work is most related to Krishna and Srinivasan (2018) (KS), who concurrently developed models for topic-oriented summarization in the context of artificial documents from the CNN/Daily Mail data. Our work differs from theirs in several important ways. KS use pointergenerator networks directly, whereas we develop novel architectures involving aspect-driven attention mechanisms (Section 3). As such, we can analyze the representations learnt by different attention mechanisms, whereas KS re-purpose attention which was designed with a different objective (coverage). KS use pre-trained topics to pre-select articles from CNN/Daily Mail whose summaries are highly separable in topic space, whereas we do not require such resources nor do we pre-select our data, resulting in a simpler and more realistic setup (Section 4). In addition, our synthetic data set is more complex (ours: 1-4 aspects per document, selected from a set of 6 global aspects; KS: 2 aspects per document, unknown total number of aspects). We extensively evaluate the benefit of latent document structure (Sections 5.1-5.3), and apply our method to human-labeled multi-aspect news documents from the Reuters corpus (Sec-tion 5.4).

Aspect-specific Summarization
In this section we formalize the task of aspectspecific document summarization, and present our models. Given an input document x and a target aspect a, our model produces a summary of x with respect to a such that the summary (i) contains only information relevant to a; and (ii) states this information in a concise way (cf., examples in Figure 1).
Our model builds on the pointer-generator networks (PG-net; See et al. (2017)), an encoderdecoder architecture for abstractive summarization. Unlike traditional document summarization, a model for aspect-based summarization needs to include aspects in its input document representation in order to select and compress relevant information. We propose three extensions to PG-net which allow the resulting model to learn to detect aspects. We begin by describing PG-net before we describe our extensions. Our models are trained on documents paired with aspect-specific summaries (cf., Section 4). Importantly, all proposed extensions treat aspect segmentation as latent, and as such learn to segment documents by aspects without exposure to word-or sentence-level aspect labels at train time. Figure 2 visualizes our models.
PG-net. PG-net (See et al., 2017) is an encoderdecoder abstractive summarization model, consisting of two recurrent neural networks. The encoder network is a bi-directional LSTM which reads in the article x = {w i } N 1 , token by token, and produces a sequence of hidden states h = {h i } N 1 . This sequence is accessed by the decoder network, also an LSTM, which incrementally produces a summary, by sequentially emitting words. At each step t the decoder produces word y t conditioned on the previously produced word y t−1 , its own latent LSTM state s t and a time-specific representation of the encoder states h * t . This time-specific representation is computed through Bahdanau attention (Bahdanau et al., 2015) over the encoder states, where v, W h , W s and b are model parameters. Given this information, the decoder learns to ei-ther generate a word from a fixed vocabulary or copy a word from the input. This procedure is repeated until either the maximum output sequence length is reached, or a special < ST OP > symbol is produced. 2 Loss. The loss of PG-net, and all proposed extensions, is the average negative log-likelihood of all words in the summary

Aspect-aware summarization models
Our proposed models embed all words {w} ∈ x into a latent space, shared between the encoder and the decoder. We also embed the input aspect a (a 1-hot indicator) into the same latent space, treating aspects as additional items of the vocabulary. The embedding space is randomly initialized and updated during training.
Decoder aspect attention. As a first extension, we modify the decoder attention mechanism to depend on the target summary aspect a (Figure 2, left). To this end, we learn separate attention weights and biases for each possible input aspect, and use the parameters specific to target-aspect a during decoding, replacing equation (1) with Intuitively, the model can now focus on parts of the input not only conditioned on its current decoder state, but also depending on the aspect the summary should reflect.
Encoder attention. Intuitively, all information about aspects is present in the input, independently of the summarization mechanism, and as such should be accurately reflected in the latent document representation. We formalize this intuition by adding an attention mechanism to the encoder (Figure 2, center). After LSTM encoding, we attend over the LSTM states h = {h i } N 1 conditioned on the target aspect as follows Synthetic multi-aspect news article from the MT-news corpus a father spent #10,000 on private detectives after police failed to track down the thug who killed his daughter's kitten with an air rifle. neil tregarthen devoted six weeks to gathering information [...] roma 's players were the latest to face the wrath of angry fans following thursday 's capitulation against italian rivals fiorentina. serie a pundit matteo bonetti tweeted that [...] having a demanding job can help stave off dementia later in life, a study has found. keeping your brain active throughout your lifetime, both at work and by enjoying stimulating hobbies, can delay mental decline [...] the europa league had offered the last realistic chance [...] the most beneficial hobbies included reading, having an active social life and using a computer regularly. aylish was horrified to find farah lying in a pool of blood after limping home wounded and crippled with pain last september. [...] news neil tregarthen spent #10,000 on private detectives after police failed to track down thug who killed the thug who killed his daughter's kitten with an air rifle. the kitten was shot near her owner's home in exeter. health having a demanding job can help stave off dementia later in life. doctors have long said training your brain in later years can prevent dementia but but this is the first time mental activity earlier in life. sport mats francesco totti also spoke to fans despite being an unused substitute. roma captain francesco totti also spoke to fans despite being an unused substitute.
News article from the Reuters RCV1 corpus steffi graf reluctantly paid 1.3 million marks to charity last month as part of a settlement with german prosecutors who dropped their tax evasion investigation [...] spiegel magazine said graf had 'agreed with a heavy heart' to the bargain with prosecutors because she wanted to put the 'media circus' about her tax affairs behind her and concentrate on tennis. [...] prosecutors dropped their investigation last month after probing graf's finances for nearly two years when she agreed to their offer to pay a sum to charity. [...] german prosecutors often use the charity donation procedure , with the agreement of the accused, to end a case which they do not believe merits a lengthy legal process. [...] the seven-times wimbledon champion, who has not played since the semifinals [...] sport seven-times wimbledon champion could make a return to the court at the end of april in the german open . former family tax adviser joachim eckardt received two and a half years for complicity . news prosecutors dropped their investigation last month after probing graf 's finances for nearly two years when she agreed to their offer to pay a sum to charity last month as part of a settlement with german prosecutors who dropped their tax evasion investigation of the tennis player , a news magazine tvshowbiz steffi graf reluctantly paid 1.3 million marks $ 777,000 ) to charity last month as part of a settlement with german prosecutors who dropped their tax evasion investigation of the tennis player . the player said she had entrusted financial matters to her father and his advisers from an early age . Figure 1: Two news articles with color-coded encoder attention-based document segmentations, and selected words for illustration (left), the abridged news article (top right) and associated aspect-specific model summaries (bottom right). Top: Article from our synthetic corpus with aspects sport, tvshowbiz and health. The true boundaries are known, and indicated by black lines in the plot and in the article. Bottom: Article from the RCV1 corpus with document-level human-labeled aspects sports, news and tvshowbiz (gold segmentation unknown).
where Wã and bã are parameters, and e a is the embedded target aspect. The decoder will now attend over h instead of h in equations (1)-(3). Intuitively, we calculate a weight for each tokenspecific latent representation, and scale each latent representation independently by passing the weight through a sigmoid function. Words irrelevant to aspect a should be scaled down by the sigmoid transformation.
Source-factors. Our final extension uses the original PG-net, and modifies its input by treating the target aspect as additional information (factor), which gets appended to our input document (Figure 2, right). 3 We concatenate the aspect embed-ding e a to the embedding of each word w i ∈ x. The target summary aspect, not the word's true aspect (which is latent and unknown), is utilized. Through the lexical signal from the target summary, we expect the model to learn to up-or downscale the latent token representations, depending on whether they are relevant to target aspect a. Note that this model does not provide us with aspect-driven attention, and as such cannot be used for document segmentation.

A Multi-Aspect News Dataset
To train and evaluate our models, we require a data set of documents paired with aspect-specific summaries. Several summarization datasets conindicators to each word in the input. sisting of long and multifaceted documents have been proposed recently (Cohan et al., 2018;. These datasets do not include aspectspecific summaries, however, and as such are not applicable to our problem setting. We synthesize a dataset fulfilling our requirements from the CNN/Daily Mail (CNN/DM) dataset (Hermann et al., 2015). Our dataset, MA-News, is a set D of data points d = (x, y, a), where x is a multi-aspect document, a is an aspect in d, and y is a summary of x wrt. aspect a. We assemble synthetic multi-aspect documents, leveraging the article-summary pairs from the CNN/DM corpus, as well as the URL associated with each article, which indicates its topic category. We select six categories as our target aspects, optimizing for diversity and sufficient coverage in the CNN/DM corpus: A = { tvshowbiz, travel, health, sciencetech, sports, news}.
We then create multi-aspect documents by interleaving paragraphs of documents belonging to different aspects. For each document d, we first sample its number of aspects n d ∼ U (1, 4). Then, we sample n d aspects from A without replacement, and randomly draw a document for each aspect from the CNN/DM corpus. 4 We randomly interleave paragraphs of the documents, maintaining each input document's chronological order. Since paragraphs are not marked in the input data, we draw paragraph length between 1 and 5 sentences. The six aspects are roughly uniformly distributed in the resulting dataset, and the distribution of number of aspects per document is slightly skewed towards more aspects. 5 Finally, we create n d data points from the resulting document, by pairing the document once with each of its n d components' reference summaries. 4 Train, validation and test documents are assembled from non-overlapping sets of articles. 5 # aspects/proportion: 1/0.107, 2/0.203, 3/0.297, 4/0.393 We construct 284,701 documents for training and use 1,000 documents each for validation and test.
In order to keep training and evaluation fast, we only consider CNN/DM documents of length 1000 words or less, and restrict the length of assembled MA-News documents to up to 1500 words. Note that the average MA-News article (1350 words) is longer than CNN/DM (770 words), increasing the difficulty of the summarization task, and emphasizing the importance of learning a good segmentation model, which allows the summarizer to focus on relevant parts of the input. We present evidence for this in Section 5.3.

Evaluation
This section evaluates whether our models generate concise, aspect-relevant summaries for synthetic multi-aspect documents (Section 5.1), as well as natural documents (Sections 5.3, 5.4). We additionally explore the quality of the induced latent aspect structure, by (a) evaluating our models on document segmentation (Section 5.2), and (b) demonstrating the benefit of structure for summarizing long natural documents (Section 5.3).
Model parameters. We extend the implementation of pointer-generator networks 6 , and use their training parameters. We set the maximum encoder steps to 2000 because our interleaved training and test documents are longer on average than the original CNN/DM articles. We use the development set for early stopping. We do not use coverage (See et al., 2017) in any of our models to minimize interaction with the aspect-attention mechanisms. We also evaluated systems trained with all combinations of our three aspect-awareness mechanisms, but we did not observe systematic improvements over the single-mechanism systems. Hence, we will only report results on those.

Summarization
This section evaluates the quality of produced summaries using the Rouge metric (Lin, 2004).
Model Comparison. We compare the aspectaware models with decoder aspect attention (decattn), encoder attention (enc-attn), and source factors (sf) we introduced in Section 3.1 against a baseline which extracts a summary as the first three sentences in the article (lead-3). We expect any lead-n baseline to be weaker for aspectspecific summarization than for classical summarization, where the first n sentences typically provide a good generic summary. We also apply the original pointer-generator network (PGnet), which is aspect-agnostic. In addition to the abstractive summarization setup, we also derive extractive summaries from the aspect-based attention distributions of two of our models (encattn-extract and dec-attn-extract). We iteratively extract sentences in the input which received the highest attention until a maximum length of 100 words (same threshold as for abstractive) is reached. Sentence attention a s is computed as average word attention a w for words in s: a s = 1 |s| w∈s a w . Finally, as an upper bound, we train our models on the subset of the original CNN/DM documents from which the MA-News documents were created (prefixed with ub-).
Table 1 (top) presents results of models trained and tested on the synthetic multi-aspect dataset. All aspect-aware models beat both baselines by a large margin. For classical summarization, the lead-3 baseline remains a challenge to beat even by state-of-the-art systems, and also on multiaspect documents we observe that, unlike our systems, PG-net performs worse than lead-3. Unsurprisingly, the extractive aspect-aware models outperform their abstractive counterparts in terms of ROUGE, and the decoder attention distributions are more amenable to extraction than encoder attention scores. Overall, our structured models enable both abstractive and extractive aspectaware summarization at a quality clearly exceeding structure-agnostic baselines.
To assess the impact of the synthetic multiaspect setup, we apply all models to the original CNN/DM documents from which MA-news was assembled (Table 1, bottom). Both baselines show a substantial performance boost, suggesting that they are well-suited for general summarization but do not generalize well to aspect-based summariza- tion. The performance of our own models degrades more gracefully. Note that some of our aspect-aware methods outperform the PG-net on natural documents, showing that our models can pick up and leverage their less pronounced structure (compared to synthetic documents) as well.
Aspect-based summarization requires models to leverage topical document structure to produce relevant summaries, and as such a baseline focusing on the beginning of the article, which typically summarizes its main content, is no longer viable.

Segmentation
The model attention distribution over the input document, conditioned on a target aspect, allows us to qualitatively inspect the model's aspect representation, and to derive a document segmentation. Since we know the true aspect segmentations for documents in our synthetic dataset, we can evaluate our models on this task, using all test documents with > 1 aspect (896 in total). We decode each test document multiple times conditioned on each of its aspects, and use the attention distributions over the input document under different target aspects to derive a document segmentation. Figure 1 visualizes induced segmentations of two documents. We omit the source-factor model in this evaluation, because it does not provide us with a latent document representation.
For the encoder attention model, we obtain n d attention distributions (one per input aspect), and assign each word the aspect under which it received highest attention. For the decoder aspect attention model, we obtain n d × T attention dis-  Table 2: Text segmentation results: Segmentation metrics P k and windiff (WD; lower is better), aspect label accuracies (acc w, acc s), and the ratio of system to summary segments (ratio). Three majority baselines (global-max, word-max, sent-max), and a topic model (LDA) and classification baseline (MNB). The majority baselines assign the same aspect to all words (sentences) in a doc, so that P k and WD scores are identical.
tributions, one for each decoder step t and input aspect. For each aspect we assign each word the maximum attention it received over the T decoder steps. 7 Since our gold standard provides us with sentence-level aspect labels, we derive sentencelevel aspect labels as the most prevalent wordlevel aspect in the sentence.
Baselines. global-max assigns each word to the globally most prevalent aspect in the corpus. A second baseline assigns each word to the document's most prevalent aspect on word-(wordmax) or sentence level (sent-max). An unsupervised topic model baseline (LDA) is trained on the training portion of our synthetic data set (K = 6; topics were mapped manually to aspects). At decode time, we assign each word its most likely topic and derive sentence labels as the topic assigned to most of its words. Finally, a supervised classification baseline (multinomial naive Bayes; MNB) is trained to classify sentences into aspects.
Metrics. We either consider the set of aspects present in a document (Table 2 center) or all possible aspects in the data set (Table 2 bottom). We measure traditional segmentation met- rics P k (Beeferman et al., 1999) and windiff (WD; Pevzner and Hearst (2002)) (lower is better) which estimate the accuracy of segmentation boundaries, but do not evaluate whether a correct aspect has been assigned to any segment. Hence, we also include aspect label accuracy on the word level (acc w) and sentence level (acc s) (higher is better). We also compute the ratio of the true number of segments to the predicted number of segments (ratio). The attention-aware summarization models outperform all baselines across the board (Table 2). LDA outperforms the most basic global-max baseline, but not the more informed per-document majority baselines. Unsurprisingly, MNB as a supervised model trained specifically to classify sentences performs competitively. Overall, the performance drops when considering the larger set of all six aspects (bottom) compared to only aspects present in the document (between 2 and 4; center).

Long Documents
Accurately encoding long documents is a known challenge for encoder-decoder models. We hypothesize that access to a structured intermediate document representation would help alleviate this issue. To this end, we compare our models against the aspect-agnostic PG-net on natural average and long documents from CNN/DM. All models are trained on the multi-aspect data set. We construct two test datasets: (i) the CNN/DM documents underlying our test set (up to 1000 words; avg), and (ii) CNN/DM documents which are at least 2000 words long (long) and are tagged with one of our target aspects. The total number of average and long documents is 527 and 4560, respectively. Results (Figure 3) confirm that our aspect-aware rand max LDA MNB enc-attn dec-attn 0.34 0.71 0.40 0.53 0.75 0.37 We finally explore our aspect-aware models on the task of aspect-agnostic summarization, decoding test documents under all possible aspects, and selected the aspect with the highest-scoring summary in terms of ROUGE (avg best and long best, respectively). In this setup, all our models outperform the PG-baseline by a large margin, both on long and average documents.

Evaluation on Reuters News
Finally, we evaluate our models on documents with multiple gold-annotated aspects, using the Reuters RCV1 dataset (Lewis et al., 2004). Our target aspects sport, health, sciencetech and travel are identically annotated in the Reuters data set. We map the remaining tags tvshowbiz and news to their most relevant Reuters counterparts. 8 We obtain 792 document (with average length of 12.2 sentences), which were labeled with two or more aspects. Figure 1 (bottom) shows an example of generated summaries for a multi-aspect Reuters document.
Automatic evaluation. We evaluate how well our models recover aspects actually present in the documents. We use the approach described in Section 5.2 to assign aspects to sentences in a document, then collect all of the aspects we discover in each document. We compare aspect to document assignment accuracy against two baselines, one assigning random aspects to sentences (rand), and one always assigning the globally most prominent aspect in the corpus (max). Note that we do not include PG-net or the source-factor model because neither can assign aspects to input tokens.  Table 4: Human evaluation: aspect label accuracy (acc), aspect label diversity for two summaries (diversity), and fluency and informativeness (info) scores. Systems performing significantly better than the lead-2 baseline are marked with a * (p < 0.05, paired t-test; Dror et al. (2018)).
The global majority baseline shows that the gold aspect distribution in the RCV1 corpus is peaked (the most frequent aspect, news, occurs in about 70% of the test documents), and majority class assignment leads to a strong baseline.
Human evaluation. We measure the quality and aspect diversity in aspect-specific summaries of RCV1 articles through human evaluation, using Amazon Mechanical Turk. We randomly select a subset of 50 articles with at least two aspects from the Reuters RCV1 data, and present Turkers with a news article and two summaries. We ask the Turkers to (1) select a topic for each summary from the set of six target topics; 9 ; (2) rate the summary with respect to its fluency (0=not fluent, 1=somewhat fluent, 2=very fluent); and (3) analogously rate its informativeness. We evaluate the extractive and abstractive versions of our three aspect aware models. We do not include the original PG-net, because it is incapable of producing distinct, aspect-conditioned summaries for the same document. Like in our automatic summarization evaluation we include a lead baseline. Since the annotators are presented with two summaries for each article, we adopt a lead-2 baseline, and present the first two sentences of a document as a summary each (lead-2). This baseline has two advantages over our systems: first, it extracts summaries as single, complete sentences which are typically semantically coherent units; second, the two sentences (i.e., summaries) do not naturally map to a gold aspect each. We consider both mappings, and score the best.
Results are displayed in Table 4. As expected, the extractive models score higher on fluency, and consequently on aspect-agnostic informativeness. Our abstractive models, however, outperform all other systems in terms of aspect-labeling accuracy (acc), and annotators more frequently assign distinct aspects to two summaries of an article (diversity). The results corroborate our conclusion that the proposed aspect-aware summarization models produce summaries aspect-focused summaries with and distinguishable and human interpretable focus.

Conclusions
This paper presented the task of aspect-based summarization, where a system summarizes a document with respect to a given input aspect of interest. We introduced neural models for abstractive, aspect-driven document summarization. Our models induce latent document structure, to identify aspect-relevant segments of the input document. Treating document structure as latent allows for efficient training with no need for subdocument level topic annotations. The latent document structure is induced jointly with the summarization objective.
Sizable datasets of documents paired with aspect-specific summaries do not exist and are expensive to create. We proposed a scalable synthetic training setup, adapting an existing summarization data set to our task. We demonstrated the benefit of document structure aware models for summarization through a diverse set of evaluations. Document structure was shown to be particularly useful for long documents. Evaluation further showed that models trained on synthetic data generalize to natural test documents.
An interesting challenge, and open research question, concerns the extent to which synthetic training impacts the overall model generalizability. The aspects considered in this work, as well as the creation process of synthetic data by interleaving documents which are maximally distinct with respect to the target aspects leave room for refinement. Ideas for incorporating more realistic topic structure in artificial documents include leveraging more fine-grained (or hierarchical) topics in the source data; or adopting a more sophisticated selection of article segments to interleave by controlling for confounding factors like author, time period, or general theme. 10 We believe that