Content Planning for Neural Story Generation with Aristotelian Rescoring

Long-form narrative text generated from large language models manages a fluent impersonation of human writing, but only at the local sentence level, and lacks structure or global cohesion. We posit that many of the problems of story generation can be addressed via high-quality content planning, and present a system that focuses on how to learn good plot structures to guide story generation. We utilize a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle's Poetics. We find that stories written with our more principled plot-structure are both more relevant to a given prompt and higher quality than baselines that do not content plan, or that plan in an unprincipled way.


Introduction
Despite many recent advances in Natural Language Generation, successful creative narrative composition remains elusive. Current neural approaches are plagued by difficulty in mastering structure, will veer between topics, and lack long-range cohesion. They successfully imitate the fluency and style of human writing, but on closer inspection sentences do not fit together to form a whole, and the reader is left with the impression that the generation has no content (See et al., 2019). This lack of structure also degrades the relevance of generations conditioned on a prompt or other source text -a strong language model will repeat key phrases from a given prompt but will not remain on topic. These issues are illustrated in the Naive Generated Story in Table 1, where many of the sentences individually are fine, but do not fit together as one story, and do not all relate to the prompt.
We hypothesise that this problem can be addressed with a focus on deeper latent narrative structures. In Aristotle's Poetics, one of the most enduring treatises on the craft of writing good stories, the philosopher lays out the elements of story in order of importance. They are: 1. event choice and arrangement 2. character 3. relevant content 2 4. diction An amateur masters skills later in the list, but mastery of event choice and event arrangement is what distinguishes a good writer (Aristotle). Next is character, then relevance, and only finally do style and diction matter.
This philosophical framework fits remarkably well into the traditional Natural Language Generation Pipeline approach that emphasizes Content Planning (Reiter and Dale, 1997). The pipeline divides generation into three steps: Content Planning, Microplanning and Surface Realization, where at each step input is modified and refined, getting closer to the final textual output. Incorporating a plot in order to generate stories can then be viewed as a proxy for Content Planning/MicroPlanning before a language model makes use of it to convert it to a readable and grammatically correct natural language output (Surface Realization).
Inspired by both the Aristotelian and Content Planning Frameworks, we develop a novel system for story generation. We focus on developing a system that can learn to expertly select events, characters, and relevant content, and write good plot structures. After the work on the plot is complete, a large language model can then do what it does best and fill in the descriptions, details, and local specifics of each story. Table 1: Our proposed plot and story generation structure. We generate a Naive Plot, revise it with Aristotelian rescorers, then generate a story. In plots, <V> denotes verbs while <A{0, 1, 2} > denote arguments. ent {0...n} are entities. We removed newline symbol <P> from the generated stories and detokenized for better display.
For plot generation, we employ a few eventchoice and event-arrangement rescoring models which assist in building the arc and cohesion of the plot, a character rescoring model that helps select which characters appear where, and a relevance model that is responsible for keeping the plot structure and the story on topic. As both improving plot-generation via rescoring and using an Aristotelian framework for neural generation are novel concepts, there is no previous work on how to implement them in practice.
Our contributions are: 1) we propose to leverage the principled Aristotelian framework for content planning, 2) we propose an implementation of the framework using a revision-based approach via several rescoring models 3) we show strong experimental results against 4 baselines.

Background
Existing work in neural story generation has established the strength of adding a content planning stage to structure the generated content (Yao et al., 2019;Fan et al., 2019) (discussed in more detail in Section 7). Specifically, this line of work trains a pipeline with one model that generates from prompt → plot and another that generates from prompt + plot → story. It modifies the standard conditional generation task with a source x = x 1 ...x n (in this case, a prompt) and target y = y 1 ...y n (in this case, a story) to condition also on an abstract intermediate representation z. Note that the approach is not truly modelling p(y|x) since that would involve summing over all z. Instead, it models p(y,z|x) = p(z|x)p(y|z, x), but only shows the generated story y at inference time.
This is a more controllable task than opendomain generation conditioned on only a prompt x, provided that a good interim structure z can be learnt. We follow this line and explore ways to improve plot planning to close the gap between stories generated from gold plots and those from model-generated plots. Plot Representation As there are no large datasets with parallel gold-standard plots and stories, all work on plot generation depends on silverstandard plots extracted from stories. We follow Fan et al. (2019) to represent plots in Semantic Role Labelling (SRL) format. We run coreference resolution to identify entities, and use a compression algorithm to discard less salient information.

Approach
We focus on learning the best interim plot structure between the input prompt and the story surface realisation. As such, we learn the plot model p(z|x) and the story model p(y|z, x) by fine-tuning a pretrained conditional language model BART  on 1) pairs of prompts and extracted silver standard plot-structures, and 2) pairs of prompts + plots and stories, respectively. Full implementation details can be found in Appendix A.4. We propose to modify the decoding objective to incorporate input from each Aristotelian rescoring model a ∈ A (the complete set of rescoring models, detailed further in Section 3.1) and re-rank the original, or "naive" plot model hypotheses, bringing the plot representation closer to each rescoring model's specialty and desireable story attribute. A diagram of our final system in Figure 1 shows each step of the generation process. The modified decoding objective becomes: where λ j is the learned weight of the score given by a j , as detailed in Section 3.3.

Aristotelian Rescoring Models
For all of our rescoring models, we train classifiers to distinguish positive examples -the silver extracted plots, and negative examples -plots that are worse with regard to the aspect that we desire to encode, given the prompt x. The intuition is that if the rescoring model can learn a particular principle, it can assist the plot-generation language model in creating content that encapsulates available. We used AllenNLP (Gardner et al., 2018) to run the SRL model  and Co-reference model  and determined our own compression algorithm experimentally. Further details in Appendix A.6 that principle. Mathematically, the classifiers learn p(l|x, z) = p(x,z,l) p(x,z) , and we use the probability of the plot being a positive example (a more plausible plot) as our Aristotelian rescoring model: What differs for each model a j that specialises in a different Aristotelian aspect is the set of negative examples that we generated to capture the type of information it has learnt to discriminate, and the features it learns. We give more details about each Aristotelian rescorer as follows. Event Rescorers. The SRL extracted plots provide us with a structure that is very similar to event representations in event extraction literature. We thus consider an Event to be composed of an actionbased verb and its subject and object (a verb, subject, object tuple). 4 We experiment with three different ways to construct positive and negative event examples. SRL based plots are structured, and a random shuffle would be trivial to distinguish, so we need more nuanced ways to learn good event choice and arrangement. We try: • inter-sentence shuffled events we permute all sentences as a full chunk, and keep all events within a sentence together. • intra-sentence shuffled events we permute the event tuples within a sentence, but keep each verb and its arguments together. • verb-shuffled events we permute only the event verbs within a sentence, leaving their arguments and contexts unchanged.
Each of these has specific strengths and weaknesses. Inter-sentence shuffling is closest to work on Narrative Event Chains (Chambers and Jurafsky, 2008) and script-learning, which represent the fact that certain events are more likely to causally follow other events rather than precede them. However, since inter-sentence noising scopes globally over the entire plot structure, it is a harder task and may be difficult for the model to discriminate patterns. Intra-sentence shuffling is the same task but restricted to a more local scale, which makes the patterns clearer and more learnable but cannot capture long-distance Event Chains inter-sententially. It is also more sensitive to the style of a given story, as stories have a variable number of events per sentence. Finally, verb-shuffling focuses on verbs as the salient element of an event, and should teach both principles of verb ordering and of verb suitability for context, and avoid artifacts from reordering arguments. However, since verbs are shuffled naively, the task can in some cases be too easy due to differences in verb selectional preferences. 5 Character Rescorers. We represent character trajectory by distinguishing which character should appear at what point in the story. We create training examples by taking each entity and all the preceding plot tokens up until the entity, and having the rescoring model choose between the true entity and a randomly sampled entity. The character rescorer must then distinguish between points in the plot when a pre-existing entity should reappear, and if so which one, or whether a new entity should be introduced. The intuition is that this should encode typical patterns of a character's actions and relationships in particular contexts. Relevance Rescorers. We take an approach similar to prior work on learning to discriminate between random and true continuations of story sentences (Holtzman et al., 2018). We consider pairs of prompts and plots, where a positive example is the true plot and the negative is a randomly selected plot from elsewhere in the training data. This prompt and plot pairing is a much more difficult task than pairing context and continuation sentences, but once trained, this rescorer is expected to tell which kinds of plot words, verbs, and SRL patterns belong with which kinds of prompts.

Rescoring Model Architecture
There is an inherent tension in training a useful rescoring model: discrimination tasks for which even simple models can perform well may have inherent data artifacts and therefore not be helpful for modeling plots. However, discrimination tasks that are so hard that all models have low accuracy are also likely to be unhelpful. We experiment with three different architectures for our rescorers. We start with ngram-baseline models 6 to better judge the baseline difficulty of a given task and take artifacts of data creation into account. This is more informative than random chance accuracy. We also experiment with augmented versions of the CNN architectures used in Holtzman et al. (2018), and RoBERTa models , and find RoBERTa to have the best performance for each Aristotelian concept.
1. XGBoost with ngrams: We used n-grams in range of (1,4) as features and trained a XG-Boost model with 200 gradient boosted trees with a maximum tree depth of 5. 2. CNN with maxpooling: We used a CNNbased architecture (Holtzman et al., 2018) but augmented it with BART position and subword encodings because our event tasks are ordered, so pooled or averaged representations that don't represent permutations differently would be indistinguishable. 3. RoBERTa-large  has shown excellent performance on various sentence pair classification tasks. We expect this large pre-trained language model to be more efficient in being able to discriminate between a well-formed sequence of words and a poorer one. To this end we finetune RoBERTa-large with a classification specific final layer as the final option to build rescorer models.
Accuracies for different rescorer architectures by aspect are shown in Table 2. As we hypothesised from the nature of many of the tasks (Section 3.1), the inter-sentence shuffled task is more difficult because the noising is global. This is reflected in the barely above chance scores of the ngrambaseline. The Verb-shuffling high ngram-baseline performance shows that our suspicions about this task being easier were also correct. Intra-shuffling was the only surprise, and turned out to be more difficult than we expected and to have the largest gap between baseline ngram and CNN performance. RoBERTa scores are high across the board, so we use RoBERTa for all models in the final system.

Mixture Weight Training and Ablations
We learn optimal weights for rescorers online during decoding using a held-out validation set V , and use these weights during inference via sampling. We minimize Margin Ranking Loss of the negative log probability of each validation sample between the gold (z) and hypothesised (ẑ) plot structures.
where i indexes the word position, n denotes the plot length, f λ is the same as in Equation 1, and we are training the λ weights with this objective. 7 We train mixture weights both for combinations of rescorers and for ablations using each rescorer individually, to isolate the contribution of each one. Mixture weight training accuracy is in Table 3, 8 which we report as Ranking Accuracy, the number of samples where the generation has higher probability than the gold. There we also include our automated plot metrics on the validation set for each ablation (further detail on those metrics in Section 4.2). As Table 3 shows, Inter-event is the strongest of the individual rescorers, though all five together achieve the best performance. This seems to indicate that each method of creating negative event examples is encoding a separate helpful piece of information, rather than one of them alone being the best approach.

Experimental Setup
Dataset. We use the Writing Prompts dataset (Fan et al., 2018), which is a large collection of user-generated stories along with their associated prompts from Reddit, to benchmark our models. It  is particularly suited to this task since it is both hierarchical (contains pairs of titles and stories, which enables the use of a plot as an interim step) and contains many diverse long-form stories that are very challenging to learn to structure. 9

Baselines
We compare our generations to the two strongest recent story generation systems as well as two ablated versions of our own system. 10 Targeted Common Sense Grounding Model. Mao et al. (2019) propose a multi-task learning scheme to achieve quantitatively better common sense reasoning in pre-trained language models by leveraging auxiliary training signals from datasets designed to provide common sense grounding.
Knowledge-Enhanced Commonsense Model. To further capture causal and temporal dependencies between sentences in a reasonable story, they employ multi-task learning which combines a discriminative objective to distinguish true and fake stories during fine-tuning. Prompt to Story. This fine tunes the BART model directly with the prompt and story pairs without access to a plot structure. Naive Plot. This utilizes a plot structure to write a story, but does not incorporate the Aristotelian Rescorer ensemble.

Metrics
We use automatic metrics to evaluate 1000 randomly selected prompts and their associated plot We again use Vocab:Token ratio as a rough diversity metric. We report also inter-story trigram repetition rates and intra-story trigram repetition rates (Yao et al., 2019). 13 For all of these metrics, there is a tension between diversity metrics, which bring a generation closer to human quality, and fluency metrics, which can degrade as diversity increases. Human Metrics. We run two separate experiments to measure improvement in our target areas of Relevance and Overall Quality. We postprocess stories for human review by detokenizing 14 , removing special end-of-sentence tokens, and truncating to 250 (whitespace separated) words. We have Mechanical Turk workers rate all systems' 11 Workers were paid $12/hr and instructed to disregard punctuation and spelling. Surveys, as well as further details on compensation rates, are in Appendix C 12 We report these for comparability to Fan et al. (2019); as they do we identify verbs via https://spacy.io/ 13 The former is a diversity metric -if stories look fine but inter-story repetition is high, it means the language model has learned to tell only very similar stories even when conditioned on diverse prompts. Intra-story trigram repetition is a fluency metric, and measures the proportion of trigrams within a single story that are repeated. 14 We use MosesDetokenizer (Koehn et al.,200)    outputs on the same prompt comparatively on a likert score (1-5) across both metrics. But since likert scores are well known to exhibit a central tendency bias, it is likely to be unreliable on distinguishing between systems that are close in performance, particularly as reading 5 long generations introduces significant cognitive load. Therefore, we further conduct pairwise comparisons between the top 3 systems in the likert experiment. Test Data Selection. In contrast to previous work on this dataset, our 110 human titles are randomly sampled from a filtered version of the test set. Writing Prompts has a one-to-many relationship between the prompts and stories. 15 The dataset also contains an artifact of the sort of topic that is upvoted on reddit 16 , so many test prompts are minor variations on the same topic. We hypothesise that some of the gap between reported performance in papers on this dataset and performance in the wild is due to the artificially high similarity between training and test prompts, so we randomly sample from the test set, but exclude prompts with extremely high lexical overlap with training prompts. 17

Results
Plots. As is shown in Table 4, the Aristotelian plot brings the generated plot structure closer to the gold plot Vocab:Token ratio, though there is still a large gap. This improvement comes at a slight expense of number of entities per plot, which is 15 Of the 303358 prompts, only 1/3 (107665) are unique 16 Mostly aliens 17 72% of prompts were excluded. We used sequencematcher https://docs.python.org/3/library/ difflib.html and spacy vector similarity https:// spacy.io/ to exclude prompts with a similarity of 1 to any prompt in the training data when stopwords are removed. The two systems gave identical results. GUAN   systems are a result of those models sometimes deteriorating into nonsensical text, as both of those systems have extremely low human judgement 18 Though we do observe a similar magnitude increase in Diverse Verbs with the introduction of plots, our baseline Prompt-to-Story model has a higher % than their best model, reflecting the recent performance improvements in pretrained language models. We report Unique Verbs for comparability, but do not find it to be useful metric since it is not normalised by length or token count. scores. 19 The likert scores favor the Aristotelian Plot system with regard to relevance but favor the Prompt to Story baseline for Overall Qualitythough as can be seen from the variance in 2a, all three BART systems are too close together to be reliably distinguished via likert metrics. The pairwise comparisons in Figure 2b do differentiate the three systems with strong statistical significance, showing the superiority of Aristotelian Plot over both the Prompt to Story and Naive Plot systems with respect to both Relevance and the Overall Quality of the final stories.

Analysis
We analysed the patterns in reported user confidence and found that 8% of prompts are low user confidence (<3) and 8% are high confidence (>4.5), so we look further into these as examples where the top three systems are minimally and maximally distinguishable. We include examples of outputs for these prompts for all three models in Appendix A.3.
For Overall Quality, there are no low confidence titles, and the Aristotelian plot system is preferred for all of the high confidence ones. In Relevance, the low confidence prompts have split win-rates across all models (essentially random) and the prompts show that all systems struggle with the meta-level concepts. The Writing Prompts dataset varies from concrete (A story of a cat who saves 19 Examples from each baseline system for the prompt in Table 1   a girl from her depression) to prompts requiring other types of meta-knowledge. Prototypical low confidence titles are of two forms: 1) A day in this life in a world where everything is written in poetry, and 2) Write a story where I can't tell whether the protagonist is a hero or a villain. The type 1 stylistic instruction prompt requires knowledge about the distinction between instructions for style and content, and all models fail. The type 2 metacontent type of prompt requires a finer level of control of the structure of the plot and story than the Aristotelian system or any of the other models can manage. This kind of prompt presents an interesting case for future story generation work, as it is simple enough for human authors to be popular on a forum, but far beyond the capabilities of current language models. At the high confidence Relevance prompts, the Aristotelian system wins all but one. Those stories highlight the way that adding and then improving on plot structures assists relevance by keeping a story on topic throughout an entire generation (see Appendix A.3 Table 8).
To assess the cases where the Aristotelian Plot did not improve over the BART baselines, we measure both word and verb incorporation rates, in Table 6 20 . These measure the Levenshtein distance over the sequence of plot words (excluding verbs) or plot verbs that are included within the final story. While the high incorporation rates show that the story model does utilize the plot, there is a gap between the current utilisation and the possible upper bound. The focus of our work is on improving plot generation, but we hypothesise that modifications to the story model to improve incorporation rates would further widen that performance gap between the three systems, as it will give the plot more influence over the story surface realisation and ensure that plot improvements appear downstream.
20 Yao et al. (2019) also use word incorporation rates, but theirs are not comparable to ours, as we both include verbs as their own separate metric (due to their importance in our structure) and make this an ordered metric (rather than set intersection as they do), which is necessary because our generated stories are much longer 7 Related Work Story Generation without Plots. Diverse efforts have focused on generating stories. Fan et al. (2018) re-purpose an approach for Neural Machine Translation to translate from prompt to a story via Convolutional Seq2Seq models. Guan et al. (2020);Mao et al. (2019) use a similar approach, however they incorporate structured commonsense knowledge from external datasets or knowledge bases to improve a story generated from a prompt. Story Generation with Plots. Riedl and Young (2010) use refinement search as a technique to balance between character and plot for solving the narrative generation problem. Li et al. (2013) use plot graphs for story generation that model the intended logical flow of events in the virtual world as a set of precedence constraints between plot events. Martin et al. (2018) decompose the problem of story generation into generation of successive events (event2event) followed by generation of natural language sentences from events (event2sentence).Yao et al. (2019) improve their LSTM-generated ROC-Stories by extracting keyword-based plot-like structures, or storylines, and using these in a pipeline to generate storylines and then stories. Fan et al. (2019) experiment with numerous techniques for representing story plots on the WritingPrompts dataset, and find Semantic Role Labelling with Entity Anonymization and Compression to work best. More recently  propose a latent variable model to learn how to generate outlines for neural story generation. Learning Story Aspects via Rescoring. Holtzman et al. (2018) generate continuation sentences from context sentences, and introduce using a mix of collaborative discriminators that each learn one Grician maxim of conversation, and use them to rerank their RNN story model output. Goldfarb-Tarrant et al. (2019) use those discriminators with the system of Yao et al. (2019) as part of a collaborative story writing task with an LSTM and human writers. However, none of them apply rescorers to the plot. There is no other work that uses discriminators or rescores for plot structures or to try to train them based on different principles.

Conclusion and Future Work
We have shown that Content Planning via an interim plot structure representation can be combined with the use of rescoring models to inject Aristotelian story-writing principles into the plot struc-ture. We found that this results in generated stories that are both more relevant and higher quality than stories that are generated directly from prompts or that use plots without Aristotelian rescoring. Our findings also suggest future work on additional ways to incorporate story principles into plot generation. Although our Aristotelian plots improved over the naive plot, there remains gaps in quality between generated and gold plot structures. There is also further work to be done in investigating what story models are best able to utilise improved plots, which would enable plot improvements to be even more effective.

A.1 Data Quality in Mechanical Turk Studies
We require that Turkers doing pairwise story comparison report their confidence in their decisions and are clear that this makes no difference in their remuneration, and then we use patterns of confidence scores to find areas where models are very distinguishable and areas where they are not.
We additionally create manual True-False questions for each prompt used in human evaluation to be used as an attention check. We experimentally found that extractive attention checks (this keyword is in the title) were ineffective, but manually created simple ones such as The protagonist is a child were very effective at filtering out poor quality responses. 15% of all responses failed attention checks, and we excluded that data and reran those human evaluations until we had a full dataset. We did additionally verify that none of our questions were overly difficult by manually reviewing all attention checks that < 80% of respondents passed.

A.2 Dataset Statistics
The Writing Prompts dataset contains 303358 pairs of prompts and stories. Stories are already tokenized in the available dataset (Fan et al., 2018). Like (Fan et al., 2018) do we truncate to a maximum length of 1000 words per story (with an average of 700 words) and replace words that appear less than 10 times with an UNK token.
We split the dataset into Train/Validation/Test splits of 65/10/10 for fine-tuning Language Models, with the remaining 15% split into 10 for finetuning Rescoring models, and 5 for training Mixture Weights and generally running metrics and tests.

A.3 Example Outputs For High and Low Confidence Scores
All examples in Tables 7, 8, and 9 are unedited save removal of special separation tokens and detokenization. Line breaks are added by the model.

A.4 Implementation Details
We fine-tune BART with same hyperparameters mentioned in 21 with the exception of MAX-TOKENS (size of each mini-batch, in terms of the 21 https://github.com/pytorch/fairseq/ blob/master/examples/bart/README. summarization.md Prompt: Earth has gained a new season, unlike anything currently known Aristotelian Plot System Generated Story: The Earth is more beautiful than it thought! It's the most beautiful planet. The sun was shining bright, and the air was crisp, yet the air itself was cold. Everything was so bright. It was the perfect day for earth to be beautiful. Everyone on the planet Earth had lived a long and happy life, this planet has been home to the human race for thousands of years. The humans were a species that had been able to survive through Earth's harsh environment. the humans could survive in these harsh conditions. But one day Earth had a new life, and we were the first to notice it. In the middle of the night, we could see something, a small, dark cloud. A small, white cloud, it's size was almost invisible. Then that cloud grew. And it grew... The sun, shining bright.. There was no sun. All the light was black and it covered the sky. Every day the sky was a dark blue. No matter the time of day, it never seemed like the perfect time, it was a beautiful day. For a new season to begin. Yet no one could know why. Why did we not have a new cycle?" Naive Plot System Generated Story: "It's not a new day, John. " " What? What is it? You ca n't see?" The man was silent. He said, "It'd be the first time we 've all seen its as we know it, but it's the first we're seen in over a million years. It is a new world, one with no sun. " A few seconds of silence." Its not a New Day." the man said. "It're a new season," said the woman. We' ve seen them all. I mean, we all know by now. You know, I 've-seen, you' ve always seen-" "We'll see us. But you'll be the First, you were the first one to see, we'll all be the Last, we will all be, we have all been the first of our species. There is no one more, no one is better than me." "The woman was right. Prompt to Story Generated Story: I was in my bed, in my own room, in the middle of a forest, when the world stopped. It was the first time in my life that I had felt I had a sense of peace. My family and me, myself, were gone, we had all been taken from us. The only thing our family had left were the three children and my world, but my wife and I were the only survivors. There were we, the three of us. But ourselves, our children were all gone. And the only reason why me and our family survived was because ours had become ourselves. It was as if both of us were one with the world. As all the children were gone and the adults had left myself to fend for myself, you and your family. No longer were you and I to be each other. Now we were just two. All our memories of us, all ours. Nothingness. Only all of us, and my children, each of us...