Generating Narrative Text in a Switching Dynamical System

Early work on narrative modeling used explicit plans and goals to generate stories, but the language generation itself was restricted and inflexible. Modern methods use language models for more robust generation, but often lack an explicit representation of the scaffolding and dynamics that guide a coherent narrative. This paper introduces a new model that integrates explicit narrative structure with neural language models, formalizing narrative modeling as a Switching Linear Dynamical System (SLDS). A SLDS is a dynamical system in which the latent dynamics of the system (i.e. how the state vector transforms over time) is controlled by top-level discrete switching variables. The switching variables represent narrative structure (e.g., sentiment or discourse states), while the latent state vector encodes information on the current state of the narrative. This probabilistic formulation allows us to control generation, and can be learned in a semi-supervised fashion using both labeled and unlabeled data. Additionally, we derive a Gibbs sampler for our model that can “fill in” arbitrary parts of the narrative, guided by the switching variables. Our filled-in (English language) narratives outperform several baselines on both automatic and human evaluations


Introduction
A narrative is a textualized sequence of events that serves as a coherent outline for an actual story (Prince, 2003). Effective narratives are typically built on top of higher level narrative scaffolds 1 which specify at an abstract level how the story should evolve along different dimensions. Example scaffolds include descriptions of the emotional trajectory of a story (Vonnegut, 1981;Reagan et al., 2016), the goals of characters throughout * Author now at Microsoft 1 We use the term scaffold as an umbrella term to cover many types of plans and structures that underlie stories.
Tom didn't know why his internet speed was so slow.
Tom wasn't sure what to do with his computer.
He thought he would fix it himself.
Tom was surprisingly good.
Tom was happy to be surfing the internet again Table 1: A sample filled in narrative generated by our SLDS model given the first and last sentences as input (grayed out), the middle 3 sentences are imputed by our model (bold).
the story (Meehan, 1977;Turner, 1993), or the abstract types of events that may occur (Martin et al., 2018). The parts of a scaffold are generic, and like Propp's originally proposed narrative functions (Propp, 1928), can be reused across stories. To be fully reusable, one needs to go beyond just identifying what the elements of the narrative scaffold are, and also indicate how each scaffold element changes the properties of the current story state. We refer to the explication of these transformation/transitions as the narrative dynamics.
Prior work on automatic narrative generation has a rich history of modeling both narrative scaffolds and narrative dynamics (Meehan, 1977;Lebowitz, 1985;Turner, 1993;Young, 2006, 2010a). The modeling of both narrative scaffold and dynamics often imbued these systems with a greater degree of control for the user in generating stories, allowing users to flexibly specify desired outcomes or plot points (or more generally, the state of the narrative) that should be achieved at certain sections of the story. Constrained generation (for example, constraining the story to start and end with particular sentences, such as that given in Table 1), is an ability these systems often gained for free through the modeling of dynamics.
Though successful in this regard, this success has only been realized in closed domains, where the narrative scaffolds can be specified in a limited ontology and the dynamics operations can be written by hand (such as e.g. the action schemata of Riedl and Young (2010a)). Neural generation has since helped scale to open domains (Roemmele and Gordon, 2015;Khalifa et al., 2017) but not with the same level of control over the narrative. Several recent works have looked at adding the narrative scaffolding component back into neural text generating systems Martin et al., 2018;Yao et al., 2019;Xu et al., 2018;Fan et al., 2019). These systems however still do not utilize an explicit model of narrative dynamics, and are thus restricted in the controllability aspect.
In this work, we show how the insight of modeling the structure of a narrative along with general purpose dynamics can be combined with modern neural network based language models. We do this by explicitly modeling the narrative state with a latent vector, and modeling how this state transforms over time as a Switching Linear Dynamical System (SLDS). We show how this formulation captures the concepts of narrative dynamics and scaffolds in a way compatible with current neural generation systems. Finally we show that, by explicitly modeling these dynamics, our models obtain the ability to "fill in" narratives; all without being explicitly trained to do so and with no further training required. We evaluate our model with both human evaluation and several automatic measures 2 and show that our model outperforms several strong baselines.

A Switching Dynamical System for Narrative Generation
In this section, we give a brief overview of Switching Dynamical systems and how they can be used to capture both a scaffold of the narrative as well as the narrative dynamics. We then describe in detail the components of our model and its relation to existing models.

Narrative Dynamics in a Dynamical System
The specifics of the narrative (characters, setting, etc.), will differ between stories, but as Propp (1928) notes, the way they transition to the next point in the narrative (what we refer to as "narrative dynamics") is often shared. Let's say that, as done often, we represent the 'narrative specifics' at time 2 Evaluation in this paper is done on English text data step 3 i with a latent vector Z i . A natural way to explicitly model how this state evolves over time that fits with the above observation is as a Linear Dynamical System: Where A is a matrix, shared across all narratives, and Σ is a noise term that takes into consideration idiosyncrasies different narratives will have 4 . The fact that the shared transition matrix A is linear means that narratives will have linearly analogous trajectories through time, despite having different details (comparable to stories with different settings but matching structures such as Ran/King Lear, Ulysses/Odyssey, etc). Of course, the fatal flaw of the model is that it assumes there exists only one transition matrix, and thus only one possible way to transition through a narrative!

Narrative Scaffolds as Switching Variables
A more fitting model would thus be a Switching Linear Dynamical System (Ackerson and Fu, 1970;Chang and Athans, 1978;Murphy, 1998). In an SLDS, we assume there exists a set of K different sets of dynamics, At time step i + 1, one of these sets of dynamics is used. The one used depends on the value of a discrete variable at time step i + 1 called the switching variable, S i+1 ∈ {1, ...K}: There is a switching variable S i associated with each time step. The switching variable value itself evolves over time by a prior Markov process, P (S i+1 |S i ) 5 . This top level chain of switching variables thus forms our narrative scaffold, indicating what transitions we must go through in the narrative, with the dynamics matrices indicating how they transition.

Narrative Scaffold -Emotional Trajectory
What the switching variables actually represent can be chosen by the user. Straightforward narrative scaffolds include event sequences (Martin et al., 2018), keywords (Yao et al., 2019), or latent template ids (Wiseman et al., 2018). More complex but potentially more informative scaffolds may be created using concepts such as story grammar nonterminals (Lakoff, 1972;Thorndyke, 1977), or character action taken throughout a story (Riedl and Young, 2010b).
In our work, we use the sentiment trajectory of the narrative as the scaffold. That is, each S i for a sentence indicates the overall coarse sentiment of the sentence (Positive, Negative, or Neutral). Though simple, the overall sentiment trajectory of a narrative is important in defining the high level 'shape' of a narrative often shared among different narratives (Vonnegut, 1981;Reagan et al., 2016). Furthermore, sentiment trajectory has been shown to be fairly useful in story understanding tasks (Chaturvedi et al., 2017;Liu et al., 2018). We discuss in the conclusion future directions for using different types of scaffolds.

The Full Model
The final component of the model is a conditional language model that generates sentence i conditioned on the current Z i , and all previous sentences, X :i . Generation continues until an <eos> is reached. This conditional language model may be parameterized as desired, but in this work, we parameterize it as an RNN neural network language model.  Figure 1: SLDS Generative model-S i is a discrete state (sentiment of a sentence in a multi-sentence narrative ). Z i is a continuous latent vector that is conditioned on to generate the ith sentence in the narrative , X i . The dynamics of the narrative are completely captured in the dynamical system controlling the latent vector Z. How to transition from Z i to Z i+1 is determined by the state variable S i+1 . Arrows from X i to X i+2 have been left out for clarity.
The graphical model for our SLDS is pictured in Figure 1. The model consists of three sets of variables: (1) Switching variables S 1 , ..., S N , (2) Latent state variables Z 1 , ..., Z N capturing the details of the narrative at sentence i, (3) The sentences themselves X 1 , ...X N , where each sentence X i has n i words, x i 1 , ...x i n i . The joint over all variables factorizes as below into the following components (X :i stands for all sentences before X i ): ) Narrative Scaffold Planner: The factor P (S i |S i−1 ) is a transition matrix, which we calculate via count based statistics from training. It is fed in as prior knowledge and fixed.
Narrative Dynamics Network: The factor P (Z i |Z i−1 , S i ) is determined like a switching linear dynamical system: which is equivalent to drawing Z i from a Normal distribution with mean A S i Z i−1 and variance B S i B T S i . Conditional Language model: The factor P (X i |Z i , X :i ) is parameterized by an RNN language model conditioned on the latent Z i .

Learning and Posterior Inference
Due to the conditionals parameterized by neural networks we use amortized variational inference in a manner similar to Variational AutoEncoders (Kingma and Welling, 2013), both to learn an approximate posterior q(S, Z|X) and to learn the generative model parameters by maximizing a lower bound on the data likelihood (ELBO). We assume that the approximate posterior factorizes as follows: Like in VAEs, computing these individual factors is done through a parameterized function called the inference or recognition network whose parameters are trained jointly with the generative model. In our case there are two forms for the factors in our posterior: (1) The first form, q(S i |X) = q S i is parameterized by a classifier that takes in the set of sentences X and outputs a categorical distribution over the switching variables. (2) The that output the mean and variance, respectively, of a Gaussian over Z i .
Borrowing terminology from VAEs, the approximate posterior (the factors given above) act as an 'encoder', while the generative model from the previous section can be seen as the 'decoder'. This type of training has been previously used in (Krishnan et al., 2015(Krishnan et al., , 2017Fraccaro et al., 2016Fraccaro et al., , 2017Karl et al., 2017).

Lower bound formula & exact training algorithm
As mentioned previously, we optimize all parameters (including the variational factor functions) by optimizing a lower bound on the data likelihood. The model may be trained either with supervision labels for the switching states (in our case, sentiment labels) or without supervised labels. If one is training without the sentiment labels, then the lower bound on the marginal likelihood (and thus our optimization objective) may be written as follows: The derivation for this objective is identical to that found in (Krishnan et al., 2017;Fraccaro et al., 2016), and simply relies on using properties of iterated expectations. All expectations are estimated with Monte Carlo samples.
If training with the sentiment labels S 1 , ..., S N , then the objective is similar (but without the sampling of the switching states), and is augmented with an additional supervision objective as done in Kingma et al. (2014): The final training procedure for a single narrative is: (1) For each sentence (starting from the first), sample the switching state S i from q(S i |X).
(2) For each sentence (starting from the first), sample the latent Z i from q(Z i |S i , Z i−1 , X). (3) Evaluate the data likelihood and KL term(s) with these samples. (4) Take the gradients of objective w.r.t. all parameters, using the reparameterization trick for q Z i (Kingma and Welling, 2013) or the Gumbel-Softmax 6 trick for q S i (Jang et al., 2017), and optimize.

Interpolations via Gibbs Sampling
One of the benefits of probabilistic formulation is the possibility (if an inference procedure can be found) of generating narratives with specific constraints, where the constraints may be specified as clamped variables in the model. In this section, we show how narratives may be generated conditioned on arbitrary bits and pieces of the narrative already filled in, using approximate Gibbs sampling. This allows one to, for example, interpolate a narrative given the first and the last sentence (similar to how earlier story generation systems were able to generate with a given end goal in mind). Some examples of these interpolations generated by our system can be found in Table 3. We give the equations and summarize the algorithm in the next sections.

Conditionals for Gibbs Sampling
For our Gibbs sampling algorithm we give the narrative scaffold (switching variables), S 1 , ..., S T ∈ S and a set of observed sentences, X + . This may be any set of sentences (the first and last, just the second sentence, etc) as inputs to the system. We wish to find values for the unobserved sentences in set X − by sampling from P (X − , Z 1 , ..., Z T |S, X + ). We perform this sampling via Gibbs sampling. Two different forms of conditionals need to be derived to do this. One over Z i conditioned on everything else, and one over X i conditioned on everything else.
One can show 7 that the distribution over Z i conditioned on everything else will be approximately proportional to the following Gaussian: To find the second conditional, one can use the d-separation properties of the graph to find that it is proportional to: These two distributions are simply factors of our conditional language model, and both terms can thus be evaluated easily. In theory, one could use this fact to sample the original conditional via Metropolis-Hastings. Unfortunately, we found this approach to be too slow in practice. We observed that the simple heuristic of deterministically assigning X i to be the greedy decoded output of the conditional language model P (X i |X :i , Z i ) works well, as evidenced by the empirical results. We leave it for future work to research different conditional language model parameterizations allowing easy sampling from this conditional 8

Gibbs Sampling Interpolation Overview
The variables in the Gibbs sampler are first initialized using some heuristics (see Supplemental). After initialization, performing interpolations with Gibbs sampling follows a two step process: First, for each Z i , sample a value Z from equation (1) and set Z i to Z . Then, for each X i in X − , find a new value for X i by running greedy decoding using the conditional language model.

Dataset and Preprocessing
We use the ROCStories corpora introduced in Mostafazadeh et al. (2016). It contains 98,159 short commonsense stories in English as training, and 1,570 stories for validation and test each. Each story in the dataset has five-sentences and captures causal and temporal commonsense relations. We limit our vocabulary size to 16,983 based on a perword frequency cutoff set to 5. For sentiment tags, we automatically tag the entirety of the corpus with the rule based sentiment tagger, Vader (Hutto and Gilbert, 2014), and bucket the polarity scores of Vader into three tags: neutral, negative, and positive. These tags form the label set of the S variables in our SLDS model. We tokenize the stories with Spacy tokenizer (Honnibal and Montani, 2017). Each sentences in the input narrative has an <eos> tag except for the S2S model discussed below.

Switching Linear Dynamical System (SLDS)
The SLDS has RNN encoder and decoder networks with single layer GRU cells of hidden size 1024 and an input embedding size of 300. We train the model using Adam with the defaults used by PyTorch. We stop training when the validation loss does not decrease for 3 consecutive epochs. Training details for all models and baselines remain same as above unless otherwise mentioned.

Baselines
Language Model (LM) : We train a two layer recurrent neural language model with GRU cells of hidden size 512.

Sequence-to-Sequence Attention Model (S2S)
We train a two layer neural sequence to sequence model equipped with bi-linear attention function with GRU cells of hidden size 512. Sentiments tags for a narrative (1 for each sentence) are given as input to the model and the corresponding sentences are concatenated together as the output with only one <eos> tag at the end. This model is trained with a 0.1 dropout. This model is comparable to the static model of (Yao et al., 2019), and other recent works employing a notion of scaffolding into neural generation (albeit adapted for our setting).

Linear Dynamical System (LDS)
We also train a linear dynamical system as discussed in Section 2.1 as one of our baselines for fair comparisons. Apart from having just a single transition matrix this model has the same architectural details as SLDS.

Semi-Supervised SLDS (SLDS-X%)
To gauge the usability of semi-supervision, we also train semi-supervised SLDS models with varying amount of labelled sentiment tags unlike the original model which uses 100% tagged data. We refer to these as SLDS-X%, where X is the % labelled data used for training: 1%, 10%, 25%, and 50%.

Evaluations
As described above, our model is able to perform narrative interpolations via an approximate Gibbs sampling procedure. At the core of our evaluations is thus a fill-in-the-sentences task. We provide 1 or 2 sentences, and require the model to generate the rest of the narrative . We evaluate this via automatic evaluations as well as with crowd-sourced human evaluations. We also report perplexity to evaluate the models' ability to fit the data. Lastly, we look at whether the transitions learned by the SLDS models capture what they are intended to capture: does using the transition matrix associated with a sentiment tag (positive/negative/neutral) lead to a generated sentence with that sentiment?

Generating the Interpolations
For the SLDS models, the interpolations are generated via the Gibbs sampling algorithm described earlier.
In all experiments for the SLDS models we draw 50 samples (including burn in samples) and output the interpolation that maximizes the probability of the given sentence(s). Since the baselines do not have the means for doing interpolations, we simulate 'interpolations' for the baselines; we draw 1000 samples using top k (with k=15) truncated sampling (conditioned on the given initial sentences, if available). We then output the sample that maximizes the probability of the clamped sentences around which we are interpolating the others. We allow the S2S access to the gold sentiment tags.
To give a lower bound on the performance of the SLDS model, we do not provide it with gold tags. We instead provide the SLDS model with the seminoisy 9 tags that are output from q(S i |X). 10

Automatic Evaluation of Interpolations
We automatically evaluate on four different types of interpolations (where different combinations of sentences are removed and the model is forced to regenerate them), We evaluate the generations with the ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) metrics using the true sentences as targets. Table 2 shows the automatic evaluation results from interpolations using our proposed models and baselines. The #Sent(s) column indicates which sentence(s) were removed, and then regenerated by the model. We gave the baselines a slight edge over SLDS because they pick the best out of 1000 samples while SLDS is only out of 50. The SLDS models see their largest gain over the baseline models when at least the first sentence is given as an input. The baseline models do better when the first and second sentence need to be imputed. This is likely due to the fact that having access to the earlier sentences allows a better initialization for the Gibbs sampler. Surprisingly, the semi-supervised variants of the SLDS models achieve higher scores. The reasons for this is discussed below in the Perplexity section.

Annotation Scheme
As automatic evaluation metrics are not sufficient to assess the quality of any creative task such as narrative generation, we measure the quality of the generations through human evaluation of 200 stories on the Amazon Mechanical Turk platform. We provided Turkers with two generated narratives from two different models, each with five sentences. The first and last sentences were fed to each model as input, and the middle three sentences were generated. Each pair of narratives is graded by 3 users each with two tasks: (1) to rank on a scale of 0-3 each of the sentences except the first one on the basis of its coherency with the previous sentence(s) and (2) compare and rank the two narratives based on their overall coherency, ie how well the story connects the starting/ending sentences.  Table 2: F1 scores for ROUGE-1, 2, and L and METEOR (M) (default mode score) for randomly sampled 500 stories from the test set. #Sents(s) column represents the "fill in" sentence(s) that the models generated using Gibbs sampling. Our SLDS models pick the best of 50 samples, the baselines models pick the best of 1000 samples Ed was playing baseball in his yard. Last week I had an idea. He was running down the hill.

Human Evaluation Results
I was so nervous that I decided to make a presentation. His ball was coming towards him.
I soon found it hard to come up with new ideas. It was very scary! I didn't think it would be so hard. Ed was scared.
But then, an idea came to me and I was back on track.
Ben has always wanted to learn how to play the piano Tim was always on his bike during the summer.
His parent bought him one. He had a lot of fun. Ben enrolls in a piano class with a local tutor.
One day he decided to cut his bike down. Ben practiced every day.
He hit a rock and fell off the bike and hit a tree. He gets better with every lesson.
He broke his arm.

Language Modeling Perplexity Score
As our models are essentially language models, we evaluated their per-sentence negative log-likelihood and per-word perplexity scores 11 , which can be viewed as an indirect measure of how well a system works as a generative model of narrative text. For the SLDS and LDS models these scores are approximations, an upper bound (the negative of the ELBO) to the actual values. For the other two models the scores are exact. A good model should assign low perplexity scores to its test set. In Table 5 SLDS achieves the lowest scores, implying that it is able to model the data distribution well. In Table 6 we also calculate the perplexity scores for the semisupervised SLDS models to assess the effectiveness of semi-supervised training. Surprisingly, the models with less supervision scored better in terms of perplexity. One possibility for this might be the use of the soft Gumbel-Softmax in the semi-supervised models.

Evaluation of Transition Dynamics
One matter of interest is whether or not the transitions are capturing what they are supposed to capture, appropriate sentiment. Since we used the sentiment tagger Vader for training tags, we again utilize it to evaluate whether using transitions of a certain sentiment actually leads the model to produce outputs with the given sentiment. To perform this evaluation, we give as input to our models (and the S2S baseline) the sentiment tags for a sentence and allow it to generate a sentence conditioned on these sentiment tags. We then tag the generated sentences with Vader and see if the sentiment tags match the originals. We calculate the F1 score across all sentiment tags and report the macro average. In Table 7 we see that having labels is incredibly important for meaningful transitions.
There is a large drop in F1 as the amount of labels given to the model is decreased. The SLDS model that is trained with 100% of the labels performs a little better than even S2S, despite not having direct access to the sentiment labels (SLDS only uses the sentiment labels to decide which transition to use while the S2S model uses attention directly on the sentiment labels).

Related Work
Story/narrative generation has a rich history in the field of AI. Many early systems were based on structured formalisms for describing common narrative structures (Lakoff, 1972;Thorndyke, 1977;Meehan, 1977), many being inspired by the initial work of (Propp, 1928). There has been a swath of recent work that has looked to add some semblance of a 'narrative scaffold' back into generation methods Martin et al., 2018;Yao et al., 2019;Xu et al., 2018). Many of these methods work as conditional LMs (conditioned directly on the scaffold). This line of work may be combined with our formalization as well, by conditioning the generation on the switching state as well, as done in the model of Barber (2006). Recent work by Tambwekar et al. (2019) has similar goals to ours in permitting more controlability in generation systems, developing a RL-based system that allows users to specify an end goal for a story (by specifying the event class that is desired to appear at the end). Their work differs from ours in that it does not deal with text directly, modeling only the sequences of events in the narrative. It may be possible to utilize this model as the scaffolding component in our model (utilizing their RL policy for the scaffold planner, rather than the simple

Conclusion and Future Work
In this paper, we formulated the problem of narrative generation as a switching dynamical system. We showed how this formulation captures notions important in narrative generation, such as narrative dynamics and scaffolds. We developed an approximate Gibbs sampling algorithm for the model that permits the system to generate interpolations conditioned on arbitrary parts of the narrative, and evaluated these interpolations using both human and automatic evaluations. Though in this work we used sentiment tags for our scaffolds/switching variables, future work may look at utilizing different kinds of information to guide the generation of narratives. Utilizing the main predicate of a sentence as a scaffold would be a logical next step, and may prove more informative than the sentiment trajectory. A scaffold such as this can take on many more possible values than a sentiment tag, and as such, it may prove difficult to assign a set of dynamics to each value. Another avenue for future work would deal with this possible problem. One potential solution could be to associate each switching variable value with a (learned) vector in a probability simplex, and use this vector to combine a small set of "primitive" dynamics matrices in order to get that value's associated set of dynamics. Table 8: F1 scores for ROUGE-1, 2, and L and METEOR (M) (default mode score) for randomly sampled 500 stories from the test set. #Sents(s) column represents the "fill in" sentence(s) that the models generated using Gibbs sampling. Our SLDS models pick the best of 50 samples, the baselines models pick the best of 1000 samples of some Z i given everything else as follows: The rest can be derived by taking the PDFs of the two Gaussian densities above, getting rid of constants that don't depend on Z i , multiplying them together, and completing the square to obtain the numerator of a Gaussian over Z i (such that Z i appears nowhere else in the equation). This numerator can then be multiplied by the normalizing constant (that does not depend on Z i ) to obtain exactly a Gaussian pdf with the mean and variance as given below:

Manual Error Analysis of Generations
We evaluate the quality of sentences of 20 generated narratives are not considered coherent by Turkers. We find that, in a broader context, 16 stories out of 20 are not good enough in terms of connecting the ending with the previous sentences. Also, in 14 out of 20 stories, mild off-topic sentences are introduced, which are aligned with the main topic of the story but not along with local coherency (i.e. the previous and the next sentences). When considering a narrower context, or sentence level, we confirm that only 9 out of 60 generated sentences are ungrammatical, so they fail to deliver their meaning.

Initializing the Gibbs Sampler
Initializing Z : We first initialize the Z variables in the sampler. This is done as follows: The Z variables are initialized in increasing order. If sentence X i is provided as input, then we sample from the approximate posterior q Z in order to initialize Z i . If X i is missing, then we sample using the dynamics distribution, P (Z i |Z i−1 , S i ). Since we initialize in increasing order, we are guaranteed to have Z i−1 .
Initializing X : We next initialize the missing text. Initializing the missing text X i is simply done by greedy decoding from the language model conditioned on Z i , and previous sentences.

Automatic Evaluation Results with Gold Labels
Below in Table 8 we provide the results for the automatic experiments with gold labels fed to our method. We find a sizable increase in performance by providing the gold labels, and thus use the automatic labels as noisy proxies in order to establish a lower bound on performance.