Exploring Content Selection in Summarization of Novel Chapters

We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides. This is a harder task than the news summarization task, given the chapter length as well as the extreme paraphrasing and generalization found in the summaries. We focus on extractive summarization, which requires the creation of a gold-standard set of extractive summaries. We present a new metric for aligning reference summary sentences with chapter sentences to create gold extracts and also experiment with different alignment methods. Our experiments demonstrate significant improvement over prior alignment approaches for our task as shown through automatic metrics and a crowd-sourced pyramid analysis.


Introduction
When picking up a novel one is reading, it would be helpful to be reminded of what happened last. To address this need, we develop an approach to generate extractive summaries of novel chapters. This is much harder than the news summarization tasks on which most of the summarization field (e.g., (Cheng and Lapata, 2016;Grusky et al., 2018;Paulus et al., 2017)) focuses; chapters are on average seven times longer than news articles. There is no one-to-one correspondence between summary and chapter sentences, and the summaries in our dataset use extensive paraphrasing, while news summaries copy most of their information from the words used in the article. We focus on the task of content selection, taking an initial, extractive summarization approach given the task difficulty. 1 As the reference sum-maries are abstractive, training our model requires creating a gold-standard set of extractive summaries. We present a new approach for aligning chapter sentences with the abstractive summary sentences, incorporating weighting to ROUGE (Lin, 2004) and METEOR (Lavie and Denkowski, 2009) metrics to enable the alignment of salient words between them. We also experiment with BERT (Devlin et al., 2018) alignment.
We use a stable matching algorithm to select the best alignments, and show that enforcing one-toone alignments between reference summary sentences and chapter sentences is the best alignment method of those used in earlier work.
We obtain a dataset of summaries from five study guide websites paired with chapter text from Project Gutenberg. Our dataset consists of 4,383 unique chapters, each of which is paired with two to five human-written summaries.
We experiment with generating summaries using our new alignment method within three models that have been developed for single document news summarization (Chen and Bansal, 2018;Kedzie et al., 2018;Nallapati et al., 2017). Our evaluation using automated metrics as well as a crowd-sourced pyramid evaluation shows that using the new alignment method produces significantly better results than prior work.
We also experiment with extraction at different levels of granularity, hypothesizing that extracting constituents will work better than extracting sentences, since summary sentences often combine information from several different chapter sentences. Here, our results are mixed and we offer an explanation for why this might be the case.
Our contributions include a new, challenging summarization task, experimentation that reveals potential problems with previous methods for creating extracts, and an improved method for creating gold standard extracts.

Related Work
Relatively little work has been done in summarization of novels, but early work (Mihalcea and Ceylan, 2007) provided a dataset of novel/summary pairs drawn from CliffsNotes and GradeSaver and developed an unsupervised system based on Meade (Radev et al., 2001) and TextRank (Mihalcea and Tarau, 2004) that showed promise. More recently, Zhang et al. (2019) developed an approach for summarizing characters within a novel. We hypothesize that our proposed task is more feasible than summarizing the full novel.
Previous work has summarized documents using Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) to extract elementary discourse units (EDUs) for compression and more contentpacked summaries (Daumé III and Marcu, 2002;Li et al., 2016;Arumae et al., 2019). Some abstractive neural methods propose attention to focus on phrases within a sentence to extract (Gehrmann et al., 2018). Fully abstractive methods are not yet appropriate for our task due to extensive paraphrasing and generalization.
While previous work on semantic textual similarity is relevant to the problem of finding alignments between chapter and summary text, the data available (Cer et al., 2017;Dolan and Brockett, 2005) is not suitable for our domain, and the alignments we generated from this data were of a poorer quality than the other methods in our paper.

Data
We collect summary-chapter pairs from five online study guides: BarronsBookNotes (BB), BookWolf (BW), CliffsNotes (CN), GradeSaver (GS) and NovelGuide (NG). 2 We select summaries from these sources for which the complete novel text can be found on Project Gutenberg.
Our initial dataset, for summaries with two or more sources, includes 9,560 chapter/summary pairs for 4,383 chapters drawn from 79 unique books. As our analysis shows a very long tail, two rounds of filtering were applied. First, we remove reference texts with >700 sentences, as these are too large to fit into mini-batches (∼10% of data). Second, we remove summaries with a compres-2 We do not have the rights to redistribute the data. To allow others to replicate the dataset, we provide a list of novel chapters we used at https://github.com/ manestay/novel-chapter-dataset sion ratio of <2.0, as such wordy summaries often contain a lot of commentary (i.e. phrases that have no correspondence in the chapter, ∼5%). This results in 8,088 chapter/summary pairs, and we randomly assign each book to train, development and test splits (6,288/938/862 pairs respectively). After filtering, chapters are on average seven times longer than news articles from CNN/Dailymail (5,165 vs 761 words), and chapter summaries are eight times longer than news summaries (372 vs 46 words).
Train split statistics are given in Table 1. These statistics reveal the large variation in length. Furthermore, we calculate word overlap, the proportion of vocabulary that overlaps between the summary and chapter. For novels, this is 33.7%; for CNN/DailyMail news, this is 68.7%. This indicates the large amount of paraphrasing in the chapter summaries in relation to the original chapter.
In Figure 1, we show the first three sentences of a reference summary for Chapter 11, The Awakening which is paraphrased from several, nonconsecutive chapter sentences shown near the bottom of the figure. We also show a portion of the summaries from two other sources which convey the same content and illustrate the extreme level of paraphrasing as well as differences in detail. We show the full chapter and three full reference summaries in Appendix A.2.

Alignment Experiments
To train models for content selection, we need saliency labels for each chapter segment that serve as proxy extract labels, since there are no gold extracts. In news summarization, these are typically produced by aligning reference summaries to the best matching sentences from the news article. Here, we align the reference summary sentences with sentences from the chapter.
We address two questions for aligning chapter GS: In this chapter Mr. and Mrs. Pontellier participate in a battle of wills. When Mr. Pontellier gets back from the beach, he asks his wife to come inside. She tells him not to wait for her, at which point he becomes irritable and more forcefully tells her to come inside. NG: Mr. Pontellier is surprised to find Edna still outside when he returns from escorting Madame Lebrun home. ... although he asks her to come in to the house with him, she refuses, and remains outside, exercising her own will. BW: Leonce urges Edna to go to bed, but she is still exhilarated and decides to stay outside in the hammock... Chapter sentences: He had walked up with Madame Lebrun and left her at the house. "Do you know it is past one o'clock? Come on," and he mounted the steps and went into their room. "Don't wait for me," she answered. "You will take cold out there," he said, irritably. "What folly is this? Why don't you come in?" and summary sentences to generate gold standard extracts: 1) Which similarity metric works best for alignment (Section 4.1)? and 2) Which alignment method works best (Section 4.2)?

Similarity Metrics
ROUGE is commonly used as a similarity metric to align the input document and the gold standard summary to produce gold extracts (Chen and Bansal, 2018;Nallapati et al., 2017;Kedzie et al., 2018). One drawback to using ROUGE as a similarity metric is that it weights all words equally. We want to, instead, assign a higher weight for the salient words of a particular sentence. To achieve this, we incorporate a smooth inverse frequency weighting scheme (Arora et al., 2017) to compute word weights. The weight of a given word is computed as follows: where p(w i ) is estimated from the chapter text and α is a smoothing parameter (here α = 1e −3 ). Ngram and Longest Common Subsequence (LCS) weights are derived by summing the weights of each of the individual words in the N-gram/LCS. We take the average of ROUGE-1, 2, L using this weighting scheme as the metric for generating extracts, R-wtd, incorporating a stemmer to match morphological variants (Porter, 1980). Similarity Metrics Results: We compare Rwtd against ROUGE-L (Chen and Bansal, 2018) (R-L), and ROUGE-1, with stop-word removal and stemming (Kedzie et al., 2018) (R-1), for sentence alignment. To incorporate paraphrasing, we average METEOR (Banerjee and Lavie, 2005) scores with ROUGE-1,2,L for both un-weighted (RM) and weighted scores (RM-wtd). Given the recent success of large, pre-trained language models for downstream NLP tasks, we also experiment with BERT (Devlin et al., 2019) to compute alignment, using cosine similarity between averaged chapter segment and summary segment vectors. We compare the generated gold extracts using R-L F1 against reference summaries, to determine a shortlist for human evaluation (to save costs).
For the human evaluation, we ask crowd workers to measure content overlap between the generated alignments, and the reference summary, on a subset of the validation data. For each summary reference, they are shown a generated alignment and asked to indicate whether it conveys each of up to 12 summary reference sentences. An example task is shown in Appendix Figure 7. We then compute precision and recall based on the number of summary sentences conveyed in the extract. Table 2 shows that humans prefer alignments generated using R-wtd by a significant margin. 3 Sample alignments generated by R-wtd in comparison to the baseline are shown in Figure 2.

Alignment Methods
Some previous work in news summarization has focused on iteratively picking the best article sentence with respect to the summary, in order to get the gold extracts (Nallapati et al., 2017;Kedzie et al., 2018), using ROUGE between the set of selected sentences and the target summary. In contrast, others have focused on picking the best article sentence with respect to each sentence in the summary (Chen and Bansal, 2018). We investigate which approach yields better alignments. We refer to the former method as summary-level alignment and the latter method as sentence-level alignment.
For sentence-level alignment, we note that the problem of finding optimal alignments is similar to a stable matching problem. We wish to find a set of alignments such that there exists no chapter segment a and summary segment x where both a and x would prefer to be aligned with each other over their current alignment match. We compute alignments based on the Gale-Shapley algorithm (1962) for stable matching and compare it with the greedy approach from prior work (Chen and Bansal, 2018).
For summary-level alignment (Nallapati et al., 2017;Kedzie et al., 2018), we compare two variants: selecting sentences until we reach the reference word count (WL summary), and selecting sentences until the ROUGE score no longer increases (WS summary).
Crowd-sourced evaluation results (Table 3) show that sentence-level stable matching is significantly better. We use this in the remainder of this work. These differences in alignments affect earlier claims about the performance of summarization systems, as they were not measured, yet have a significant impact. 4  Table 3: Crowd sourced evaluation on content overlap for summary vs. sentence level on validation set.

Ref summary:
He says he will, as soon as he has finished his last cigar. R-L greedy: "You will take cold out there," he said, irritably. R-L stable: He drew up the rocker, hoisted his slippered feet on the rail, and proceeded to smoke a cigar. R-wtd stable: "Just as soon as I have finished my cigar." Figure 2: A reference summary sentence and its alignments. R-L greedy and R-L stable are incorrect because they weight words equally (e.g. said, cigar, '.'). 4 Bold text indicates statistical significance with p < 0.05.

Summarization Experiments
In order to assess how alignments impact summarization, we train three extractive systems -hierarchical CNN-LSTM extractor (Chen and Bansal, 2018) (CB), seq2seq with attention (Kedzie et al., 2018) (K), and RNN (Nallapati et al., 2017) The target word length of generated summaries is based on the average summary length of similarly long chapters from the training set. 5 We also experiment with aligning and extracting at the constituent level, 6 given our observation during data analysis that summary sentences are often drawn from two different chapter sentences. We create syntactic constituents by taking sub-trees from constituent parse trees for each sentence (Manning et al., 2014) rooted with S-tags. To ensure that constituents are long enough to be meaningful, we take the longest S-tag when one Stag is embedded within others (see Appendix A.5).
Summary quality is evaluated on F1 scores for R-{1,2,L}, and METEOR. Each chapter has 2-5 reference summaries and we evaluate the generated summary against all the reference summaries. Part of a generated summary of extracted constituents for Chapter 11, The Awakening, is shown in Figure 3. The full generated summaries for this chapter (both extracted constituents and extracted sentences) are shown in Appendix A.2. Generated Summary: |I thought I should find you in bed , " ||said her husband , |when he discovered her |lying there . |He had walked up with Madame Lebrun and left her at the house . ||She heard him moving about the room ; |every sound indicating impatience and irritation .| Figure 3: System generated summary, extracted constituents in teal, and separated by |.

Results
We compare our method for generating extractive targets (ROUGE weighted, with stable matching at the sentence level) against the baseline method for generating extractive targets for each of the systems. Table 4 shows three rows for each summarization system: using the original target summary labels, and using either constituent or sentence segments. We see our proposed alignment method performs significantly better for all mod-Model Seg Method R-1 R-2 R-L METEOR CB sent baseline 33.1 5.5 30.0 13.9 sent R-wtd 35.8 6.9 33.4 15.2 const R-wtd 36.2 6.9 35.4 15.2 K sent baseline 34.3 6.4 31.6 14.6 sent R-wtd 35.6 6.9 33.2 15.0 const R-wtd 36.2 6.9 35.2 15.1 N sent baseline 34.6 6.4 31.9 14.6 sent R-wtd 35.7 7.0 33.3 15.1 const R-wtd 35.9 7.0 35.2 15.0 Table 4: ROUGE-F1, METEOR for generated summaries. "Baseline" is the method used for that model. els. ROUGE-L in particular increases 10% to 18% relatively over the baselines. Moreover, it would seem at first glance that the K and N baseline models perform better than the CB baseline, however this difference has nothing to do with the architecture choice. When we use our extractive targets, all three models perform similarly, suggesting that the differences are mainly due to small, but important, differences in their methods for generating extractive targets.
Human Evaluation: Given questions about the reliability of ROUGE (Novikova et al., 2017;Chaganty et al., 2018), we perform human evaluation to assess which system is best at content selection. We use a lightweight, sampling based approach for pyramid analysis that relies on crowd-sourcing, proposed by Shapira et al. (2019), and correlates well with the original pyramid method (Nenkova et al., 2007). We ask the crowd workers to indicate which of the sampled reference summary content units are conveyed in the generated summary. 7 We evaluated our best system + alignment on extraction of sentences and of constituents (CB R-wtd), along with a baseline system (CB Kalign), 8 using the crowd-sourced pyramid evaluation method. To produce readable summaries for extracted constituents, each extracted constituent is included along with the context of the containing sentence (black text in Figure 3). We find that CB Sent R-wtd has significantly higher content overlap with reference summaries in Table 5.

Discussion and Conclusion
We present a new challenging task for summarization of novel chapters. We show that sentence-

System
Pyramid Score CB K-align 17.9 CB Sent R-wtd 18.9 CB Const R-wtd 18.1 level, stable-matched alignment is better than the summary-level alignment used in previous work and our proposed R-wtd method for creating gold extracts is shown to be better than other similarity metrics. The resulting system is the first step towards addressing this task. While both human evaluation and automated metrics concur that summaries produced with our new alignment approach outperform previous approaches, they contradict on the question of whether extraction is better at the constituent or the sentence level. We hypothesize that because we use ROUGE to score summaries of extracted constituents without context, the selected content is packed into the word budget; there is no potentially irrelevant context to count against the system. In contrast, we do include sentence context in the pyramid evaluation in order to make the summaries readable for humans and thus, fewer constituents make it into the generated summary for the human evaluation. This could account for the increased score on automated metrics.
It is also possible that smaller constituents can be matched to phrases within the summary with metrics such as ROUGE, when they actually should not have counted. In future work, we plan to experiment more with this, examining how we can combine constituents to make fluent sentences without including potentially irrelevant context.
We would also like to further experiment with abstractive summarization to re-examine whether large, pre-trained language models (Liu and Lapata, 2019) can be improved for our domain. We suspect these models are problematic for our documents because they are, on average, an order of magnitude larger than what was used for pretraining the language model (512 tokens). Another issue is that the pre-trained language models are very large and take up a substantial amount of GPU memory, which limits how long the input document can be. While truncation of a document may not hurt performance in the news domain due to the heavy lede bias, in our domain, truncation can hurt the performance of the summarizer.
A simple but tough-to-beat baseline for sentence embeddings.
In 5th

A.1 Acknowledgments
We would like to thank Spandana Gella for her contributions to the project. We would like to thank Jonathan Steuck, Alessandra Brusadin, and the rest of the AWS AI Data team for their invaluable feedback in the data annotation process. We would finally like to thank Christopher Hidey, Christopher Kedzie, Emily Allaway, Esin Durmus, Fei-Tzin Lee, Feng Nan, Miguel Ballesteros, Ramesh Nallapati, and the anonymous reviewers for their valuable feedback on this paper.

A.2 Example Chapter and Summaries
We show the full text of Chapter 11, The Awakening by Kate Chopin in Figure 4. We show three reference summaries in Figure 5, and two generated summaries using our best alignment method in Figure 6. While there are differences in length and level of detail, there are also clearly similarities in covered content.

A.3 Target Word Length for Summaries
The target word length for generated summaries is a function of the input chapter word count (wc chapter ). We divide the train set into 10 quantiles, and in each quantile (or bin), associate it to the mean compression ratio (CR): Where wc refsumm is the word count of the reference summary, and CR i is the compression ratio of the i-th quantile item. The target word length for the generated summary (wc gen summ ) is given by: Generated summaries are created by extracting segments with the highest model probability until this budget is reached (without truncation). Oracle summaries also use this target word length, but may be shorter if the original summary had few segments (as we extract one chapter segment for each summary segment).

A.4 SCU Evaluation Task Setup
To obtain the distractors, we sample 2 SCUs from different chapters from the same book. We insert one of them, the positive distractor, into the generated summary, as well as into the list of statements, so it will always be correct. We insert the other, the negative distractor, only into the list of statements, so it will always be incorrect.

A.5 Constituent Extraction algorithm
Algorithm 1 extracts subtrees from a constituent parse tree. These subtrees are constituents, and break down sentences into meaningful spans of text. Constituents are one of 1. A relative clause 2. The highest level S or SBAR node in its subtree with (NP, VP) children 3. The highest level VP node above 2) 4. The remaining nodes in the tree that were not extracted with 1), 2) or 3) "What are you doing out here, Edna? I thought I should find you in bed," said her husband, when he discovered her lying there. He had walked up with Madame Lebrun and left her at the house. His wife did not reply. "Are you asleep?" he asked, bending down close to look at her. "No." Her eyes gleamed bright and intense, with no sleepy shadows, as they looked into his. "Do you know it is past one o'clock? Come on," and he mounted the steps and went into their room. "Edna!" called Mr. Pontellier from within, after a few moments had gone by. "Don't wait for me," she answered. He thrust his head through the door. "You will take cold out there," he said, irritably. "What folly is this? Why don't you come in?" "It isn't cold; I have my shawl." "The mosquitoes will devour you." "There are no mosquitoes." She heard him moving about the room; every sound indicating impatience and irritation. Another time she would have gone in at his request. She would, through habit, have yielded to his desire; not with any sense of submission or obedience to his compelling wishes, but unthinkingly, as we walk, move, sit, stand, go through the daily treadmill of the life which has been portioned out to us. "Edna, dear, are you not coming in soon?" he asked again, this time fondly, with a note of entreaty. "No; I am going to stay out here." "This is more than folly," he blurted out. "I can't permit you to stay out there all night. You must come in the house instantly." With a writhing motion she settled herself more securely in the hammock. She perceived that her will had blazed up, stubborn and resistant. She could not at that moment have done other than denied and resisted. She wondered if her husband had ever spoken to her like that before, and if she had submitted to his command. Of course she had; she remembered that she had. But she could not realize why or how she should have yielded, feeling as she then did. "Leonce, go to bed," she said, "I mean to stay out here. I don't wish to go in, and I don't intend to. Don't speak to me like that again; I shall not answer you." Mr. Pontellier had prepared for bed, but he slipped on an extra garment. He opened a bottle of wine, of which he kept a small and select supply in a buffet of his own. He drank a glass of the wine and went out on the gallery and offered a glass to his wife. She did not wish any. He drew up the rocker, hoisted his slippered feet on the rail, and proceeded to smoke a cigar. He smoked two cigars; then he went inside and drank another glass of wine. Mrs. Pontellier again declined to accept a glass when it was offered to her. Mr. Pontellier once more seated himself with elevated feet, and after a reasonable interval of time smoked some more cigars. Edna began to feel like one who awakens gradually out of a dream, a delicious, grotesque, impossible dream, to feel again the realities pressing into her soul. The physical need for sleep began to overtake her; the exuberance which had sustained and exalted her spirit left her helpless and yielding to the conditions which crowded her in. The stillest hour of the night had come, the hour before dawn, when the world seems to hold its breath. The moon hung low, and had turned from silver to copper in the sleeping sky. The old owl no longer hooted, and the water-oaks had ceased to moan as they bent their heads. Edna arose, cramped from lying so long and still in the hammock. She tottered up the steps, clutching feebly at the post before passing into the house. "Are you coming in, Leonce?" she asked, turning her face toward her husband. "Yes, dear," he answered, with a glance following a misty puff of smoke. "Just as soon as I have finished my cigar." Figure 4: Full chapter text. Note that this is short at 847 words, as the median chapter length is 3168 words. NovelGuide summary: Mr. Pontellier is surprised to find Edna still outside when he returns from escorting Madame Lebrun home. In a small but no doubt significant exchange-considering the events of the evening, and the novel's title-her distant and unperceiving husband asks her, "Are you asleep?" Edna, with eyes "bright and intense," definitively replies, "No." Although he asks her to come in to the house with him, she refuses, and remains outside, exercising her own will.
As if trying to outlast his wife, Mr. Pontellier smokes cigar after cigar next to her. Gradually, Edna succumbs to her need for sleep. She feels "like one who awakens gradually out of a . . . delicious, grotesque, impossible dream . . . ." As described in Chapter VII, then, Edna is once again undergoing what might be called a "negative" "awakening"-an "awakening" to the realities of her present life-as opposed to the "positive" awakening to new possibilities and her own self-direction, to which the nighttime swim began to expose her. As if to underscore her failure to "awaken" to herself, the chapter ends with a scene of tables being turned: as Edna goes in, she asks her husband if he will be joining her. He says he will, as soon as he has finished his last cigar. While the narrator does not record Mr. Pontellier's tone of voice, the comments seem almost scornful, mockingly echoing Edna's earlier self-assertion. Constituent R-wtd: |I thought I should find you in bed , " ||said her husband , |when he discovered her |lying there . |He had walked up with Madame Lebrun and left her at the house . ||She heard him moving about the room ; |every sound indicating impatience and irritation . |" This is more than folly , " |he blurted out . ' I ca n't |permit you to stay out there all night .||But she could not realize |why or how she should have yielded , feeling as she then did . |He smoked two cigars ; |then he went inside and drank another glass of wine . She tottered up the steps , |clutching feebly at the post before passing into the house . |she asked , |turning her face toward her husband . | Sentence R-wtd: | I thought I should find you in bed , " said her husband , when he discovered her lying there . | | He had walked up with Madame Lebrun and left her at the house . | | His wife did not reply . | | " This is more than folly , " he blurted out . | | You must come in the house instantly . " | | Edna began to feel like one who awakens gradually out of a dream , a delicious , grotesque , impossible dream , to feel again the realities pressing into her soul . | | She tottered up the steps , clutching feebly at the post before passing into the house . | | she asked , turning her face toward her husband . | | " Just as soon as I have finished my cigar . " |  Reading the summary, we see that we should answer "Present" for both questions. There can be up to 12 questions -we omit here for brevity. Note that in our evaluation, we counted both "Present" and "Partially Present" as a match.