Modeling Protagonist Emotions for Emotion-Aware Storytelling

Emotions and their evolution play a central role in creating a captivating story. In this paper, we present the first study on modeling the emotional trajectory of the protagonist in neural storytelling. We design methods that generate stories that adhere to given story titles and desired emotion arcs for the protagonist. Our models include Emotion Supervision (EmoSup) and two Emotion-Reinforced (EmoRL) models. The EmoRL models use special rewards designed to regularize the story generation process through reinforcement learning. Our automatic and manual evaluations demonstrate that these models are significantly better at generating stories that follow the desired emotion arcs compared to baseline methods, without sacrificing story quality.


Introduction
Stories are an integral part of human culture. They allow us to express emotions, share knowledge, and to shape our perspective of the world (McKee, 2003). Stories are made interesting through emotions that connect the characters, their motivations, goals, and achievements (Vonnegut, 1981).
Cognitive scientists have pinpointed the central role of emotions in storytelling (Parkinson and Manstead, 1993;Hogan, 2011). Early automatic storytelling systems based on symbolic planning also showed that addressing character emotions for plot construction resulted in more diverse and interesting stories (Theune et al., 2004;Pérez y Pérez, 2007;Méndez et al., 2016). However, these studies were rule-based and limited to small-scale data. The advent of deep learning has shifted computational storytelling efforts towards neural methods (Martin et al., 2018;Yao et al., 2019). However, despite the broad recognition of its importance, neural story generation methods have not explored the modeling of emotional trajectory.

Title (input): Raw burger
Emotion arc (input): joy → anger → sadness Story (output): Tom went to a burger place with his friends. He ordered a burger. When he got it , he noticed that it was raw. Tom yelled at the waiter for it being raw. He was really disappointed. In this paper, we present the first study to take into account the emotional trajectory of the protagonist in neural story generation. Research in cognitive science has shown that while comprehending narratives, readers closely monitor the protagonist's emotional states (Komeda and Kusumi, 2006;Gernsbacher et al., 1992). However, emotions experienced by the protagonist might differ from the general emotions expressed in the story. For example, the general emotion of "My boss was very angry and decided to fire me." is anger, but the narrator's emotional reaction would be to feel upset. At any point in a story, we represent the protagonist's emotions using a set of basic emotions. The theory of basic emotions is well-accepted in psychology, but there is little consensus about the precise number of basic emotions. Plutchik (1982) proposed 8 primary emotions, and Ekman (1992) first proposed 7 and then changed to 6 basic emotions. Following recent theories (Jack et al., 2014;Gu et al., 2016), we choose anger, fear, joy, and sadness, to describe the protagonist's emotions. We additionally include neutral to account for cases with no strong emotions. We refer to these 5 emotions as basic emotions.
Moreover, emotions evolve within a narrative. For modeling the evolving emotions of the protagonist, we define an emotion arc for a story. Our definition is inspired by Prince's change-of-state formalization (Prince, 2009), which asserts that stories are about change. According to this theory, a story has three components: a starting state; an ending state; and events that translate the starting into the ending state. Motivated by this, we define the emotion arc as a sequence of three basic emotions that describe the starting, body, and ending emotional states of the protagonist.
Given a story title and the emotion arc of the protagonist as inputs, our goal is to generate a story about the title that adheres to the given emotion arc. Fig. 1 shows an example story generated by our model, where the protagonist's emotion evolves from joy to anger and then sadness.
To address this problem, we present three models based on GPT-2 (Radford et al., 2019) that incorporate the protagonist's emotion arc as a controllable attribute while preserving content quality: an Emotion Supervision (EmoSup), and two Emotion-Reinforced (EmoRL) models based on reinforcement learning. The EmoRL models use two Emotion-Consistency rewards, EC-EM and EC-CLF. EC-EM uses semantically enhanced emotion matching to encourage the model to adhere to the given emotion arc. It infers the protagonist's emotions in the generated stories using Commonsense Transformers, COMET (Bosselut et al., 2019). EC-CLF achieves the same goal using a classifier to infer the protagonist's emotions.
In the absence of a training corpus of stories labeled with the protagonist's emotions, we automatically annotate a large-scale story corpus using COMET. Our automatic and manual evaluations show that our models can not only express the desired emotion arcs but also produce fluent and coherent stories. Our contributions are: • We present the first study on modeling the emotion arc of the protagonist for neural story generation. • We propose two Emotion-Consistency rewards designed to enforce the desired emotion arcs using reinforcement learning. • We track the protagonist's emotions in a story (1) using a commonsense knowledge model based pipeline; and (2) through an emotion classifier trained using transfer learning from out-of-domain data. • We empirically demonstrate that our models can effectively generate stories that follow the desired emotion arc. We also illustrate how these models can find novel applications.

Related Works
Early story generation systems relied on symbolic planning (Lebowitz, 1987;Pérez y Pérez and Sharples, 2001;Porteous and Cavazza, 2009;Riedl and Young, 2010) or case-based reasoning (Gervás et al., 2005). Although these systems could ensure long-term coherence, they could only operate in predefined domains and required manual engineering. These problems have been somewhat alleviated by recent seq2seq storytelling models (Roemmele, 2016;Jain et al., 2017), some of which are based on intermediate representations (Martin et al., 2018;Xu et al., 2018;Fan et al., 2018b;Yao et al., 2019;Fan et al., 2019). Recent approaches have also used large-scale language models (LMs) based on Transformers (Vaswani et al., 2017), such as GPT-2 (Radford et al., 2019). Being trained on large amounts of data, these models can generate highly fluent text and find applications in story generation (Qin et al., 2019;Guan et al., 2020) and dialogue systems (Budzianowski and Vulić, 2019; Wolf et al., 2019). However, they lack the ability to dictate any auxiliary objective for the generated text, such as expressing specific attributes.
To address this, approaches such as conditional training or weighted decoding have been proposed to control different properties of the generated text such as sentiment, tense, speaker style, and length (Kikuchi et al., 2016;Hu et al., 2017;Ghazvininejad et al., 2017;Wang et al., 2017;Fan et al., 2018a). Tambwekar et al. (2019) use reinforcement learning for generating goal-driven story plots, which are sequences of event tuples. Dathathri et al. (2020) propose PPLM, which uses an attribute classifier to steer text generation without further training the LM.
More closely related to our work is generating automatic responses with a specific sentiment or emotion Zhou and Wang, 2018;Song et al., 2019). Modeling characters (Bamman et al., 2013, 2014Vala et al., 2015;Kim and Klinger, 2019;Krishnan and Eisenstein, 2015), their relationships (Agarwal et al., 2013;Iyyer et al., 2016;Chaturvedi et al., 2016;Srivastava et al., 2016;Chaturvedi et al., 2017a), and sentiment trajectory (Chaturvedi et al. (2017b), Chen et al. (2019) have been shown to be useful for story understanding, in general. However, there are limited works on incorporating characters or sentiment for story generation. Previ-ous work model characters but not sentiment (Clark et al., 2018;Liu et al., 2020). Weber et al. (2020) is a contemporary work that incorporates sentiment while "filling in" a narrative. Peng et al. (2018) and Luo et al. (2019) control the overall sentiment for story ending generation. These works are limited to coarse-grained sentiments and/or only target the ending sentence. Instead, we model the emotional trajectory of the protagonist as the story progresses, which is more central to storytelling than the overall sentiment.

Emotion-aware Storytelling
We first explain how our models can track the protagonist's emotional trajectory ( §3.1). We then define the problem statement ( §3.2) and followed by an introduction to our base storytelling model ( §3.3), which is used as the backbone of our three proposed models ( §3.4 and §3.5) 1 .

Tracking Protagonist's Emotions
In this work, we define the protagonist as the most frequently occurring character in a narrative (Morrow, 1985). Our two rewards for the EmoRL models ( §3.5) need to track the protagonist's emotions to guide the generation. For this, we obtain their emotions at various stages in the story using one of the following two approaches: Commonsense Transformer Our EC-EM reward uses a commonsense knowledge model to reason about the implicit emotional states of the protagonist. We use COMET (Bosselut et al., 2019), a knowledge base construction model trained on ATOMIC if-then knowledge triples (Sap et al., 2019). It contains information about everyday events and their causes and effects. Given an event and a relation, COMET can generate commonsense inferences about the relation. For tracking emotions, we use relations xReact and oReact that correspond to emotional reactions to events (more details on this in §4.1). Emotion Classifier Our EC-CLF reward captures the protagonist's emotions using an emotion classifier. For this, we adapt the pre-trained BERT large for multi-label classification over 5 basic emotions: anger, fear, joy, sadness, and neutral. Following Devlin et al. (2019), we use a fully-connected layer over the final hidden representation corresponding to the special classification token ([CLS]). We train this classifier in two steps.
First, we train this classifier on a humanannotated dataset for emotion identification in tweets (Mohammad et al., 2018), consisting of 6, 857 tweets, with binary labels for 11 emotions, among which we only focus on our basic emotions. On this dataset, the classifier achieves better or comparable performance to state-of-the-art results (Kant et al., 2019) (see Appendix B.1 for detailed results).
Next, in order to identify the protagonist's emotions from a given story-text, we further fine-tune the classifier on story training data that is automatically annotated with the protagonist's emotions using the pipeline described in §4.1. To evaluate the classifier, we obtain manual annotations for the protagonist's emotions on Amazon Mechanical Turk for a subset of 50 randomly selected stories (250 sentences) from our story corpus. Each sentence was annotated by 3 judges. Workers agreed with our emotion classifier 70% of the time (random agreement would be 20%). See Appendix B.2 for more details about these annotations.

Problem Statement
We formulate the emotion-aware storytelling task as follows: given a story title as a sequence of tokens t={t 1 , t 2 , ..., t m }, and an emotion arc for the protagonist as a sequence of basic emotions a={e 1 , e 2 , e 3 }, the task is to generate a story as a sequence of tokens y={y 1 , y 2 , ..., y n } that adheres to the title and emotion arc.

Transformer-based Storytelling Model
Our models are built upon a base storytelling model that can generate a story consistent with a given prompt (e.g., title). We choose GPT-2 (medium) (Radford et al., 2019) because our initial experiments demonstrated that it outperforms other state-of-the-art story generation models, in general ( §5.1). GPT-2 uses multiple Transformer blocks of multi-head self-attention and fully connected layers (the left box in Fig. 2). Since it was trained on a broad range of domains, we fine-tune it on a dataset of stories ( §4.1) by minimizing the negative conditional log-likelihood: where m and n denote the number of tokens in the title and story respectively. h l i is the l-th layer's output at the i-th position computed through transformer block with the masked multi-head self attention mechanism, and h 0 i is a summation of token embedding W i and position embedding P i for the i-th token. y <i indicates left context.

Emotion Supervision (EmoSup) Model
The underlying idea behind our Emotion Supervision (EmoSup) model is to provide the emotion arc as an additional input similar to conditional training (Fan et al., 2018a;Kikuchi et al., 2016). Specifically, each title has the corresponding emotion arc prepended at the beginning, separated by a delimiter token (<$>). This way, emotion arcs receive special treatment (Kobus et al., 2017), since they are propagated to all of the story and the model learns to maximize p(y i |y <i , t, a).

Emotion-Reinforced (EmoRL) Models
The emotion arc guides the generation in EmoSup as an initial input. However, we want to continually supervise the model during the generation process. This motivates us to use a reinforcement learning framework. To deal with exposure bias, many previous works have optimized the evaluation measures (e.g., BLEU, ROUGE, CIDEr) as rewards (Rennie et al., 2017;Paulus et al., 2018). Here, we propose two Emotion Consistency rewards, EC-EM and EC-CLF, which optimize adherence to the desired emotion arc.

EC-EM Reward
This reward quantifies the alignment of the emotion arc of the generated story to the desired arc using the commonsense knowledge model, COMET. For an N -sentencelong generated story, we use COMET to obtain the protagonist's emotional reaction for each sentence, resulting in a sequence of emotion-phrases a g ={g 1 , g 2 , ..., g N } 2 . We then define the reward as a modified Levenshtein distance (Levenshtein, 2 N =5 for our dataset. Also, COMET's outputs, gis, are phrases representing emotional reactions. Details on obtaining emotional reactions during training are provided in Appendix A.1. 1966) between the generated reactions a g and the desired emotion arc a * ={e 1 , e 2 , e 3 }. This modification allows only two operations: (1) Deletion of an emotion-phrase (in a g ), and (2) Replacement of an emotion-phrase with a basic emotion at a cost proportional to semantic similarity between the two (e.g., happy to help and joy). Semantic similarities are computed using cosine similarity between the averaged GloVe embeddings (Pennington et al., 2014) 3 . The reward is defined as: where lev denotes the modified Levenshtein distance. We refer to the model that uses this reward as RL-EM. EC-CLF Reward This reward infers the protagonist's emotions in a given text using our emotion classifier ( §3.1). We first divide the generated story into segments: beginning, body, and ending 4 . Then, for each segment, we use the classifier to obtain the probability of the desired emotion. The reward is defined as the probabilities of the desired emotions averaged across the segments: where k is the number of tokens in the emotion arc (here, k=3), and e * j denotes the desired emotion for j-th segment x j . We refer to the model that uses this reward as RL-CLF. Policy Gradient For training, we use the RE-INFORCE algorithm (Williams, 1992) to learn a generation policy p θ of the storytelling model with parameters θ. Here, the model generates a sample story y s from the model's output distribution, and the goal is to minimize the negative expected reward, which is approximated by: We follow the self-critical training approach (Rennie et al., 2017), and take the reward of the greedily decoded storyŷ as the baseline reward (r(ŷ)). This ensures that with better exploration, the model learns to generate stories y s with higher rewards than the baselineŷ. Optimizing only with the RL loss mentioned above using the emotion-consistency rewards may increase the expected rewards, but at the cost of fluency and readability of the generated story. Therefore, we optimize the following mixed loss (Paulus et al., 2018): where γ is a hyper-parameter balancing the two loss functions. Our emotion-reinforced storytelling framework is depicted in Fig. 2 For training our models, we need stories annotated with emotion arcs of the protagonists. We annotated the stories in our dataset automatically using the multi-step annotation pipeline shown in Fig. 3. In step 1, we identify all characters and their mentions in a story using coreference resolution. In step 2, we identify the character with the most mentions as the protagonist (e.g., 'Iris' who is mentioned in 4 sentences). Then, in step 3, in each sentence of the story, we identify the protagonist's role as Agent or Other using its dependency parse 5 . The protagonist is an Agent if he/she is the subject of the main verb in the sentence and Other otherwise (e.g., Iris's role is Other in sentence 4 and Agent in all other sentences). Next, in step 4, we obtain the emotional reaction of the protagonist in each sentence using COMET. Given a context, c, and relation type, r, COMET can yield the emotional reaction of the Agent (r=xReact) and Others (r=oReact). Depending on the protagonist's role in the sentence, we use the appropriate relation to get their emotional reaction, g, and COMET's confidence in the prediction, ϕ g . In sentences without an explicit mention of the protagonist, his/her role is assigned as Other, and we use oReact since the event in that sentence will affect all characters of the story, including the protagonist (e.g., sentence 4 in Fig. 3).
Step 4 gives the protagonist's emotions for each sentence of the story, but the emotion arc has to represent them for the three segments: beginning, body, and end. The stories in our corpus are 5sentence long, and following previous work on this corpus (Chaturvedi et al., 2017b), we segment them in 1:3:1 ratio. For the protagonist's emotion in the body (middle 3 sentences), we take the emotion of the sentence in which COMET was most confident (e.g., 'annoyed' for the body of the running example in Step 5).
Note that since COMET's outputs, gs, are openended emotion-phrases, in step 6, we need to map these phrases to one of the 5 basic emotions using NRC Affect Intensity Lexicon (Mohammad, 2018). The lexicon is a list of words with their real-valued intensities for 4 non-neutral basic emotions. We represent the likelihood of g getting mapped to each of the basic emotions, e, as score e (e). For mapping, we first tokenize, lemmatize, and filter stop words from g. Then we find exact matches of g's tokens to words in the lexicon (along with the match-intensities). For each match, we increase score g (e) by the match-intensities. Finally, g is mapped to the basic emotion with the maximum score. An emotion-phrase with no matching tokens is mapped to neutral.
Note that we also experimented with the emotional reactions generated by COMET to constitute the emotion arc without mapping them to basic emotions. However, with more than 500 unique emotional reactions, the space of possible arcs became too large with too few training examples for each which prevented the models from effectively learning the pattern. The smaller set of basic emotions also made it more natural and manageable for the user to provide a desired emotion arc as input.

Implementation Details
We follow the training and inference settings of medium-size GPT-2 as in Radford et al. (2019) (for completion, we provide full details in Appendix A.2). Our models are implemented with the Texar toolkit (Hu et al., 2019).

Evaluation Measures
Automatic We adopt several automatic measures to evaluate the generated stories both on content quality and emotion faithfulness.
For evaluating the content quality, we use the following measures: (1) Perplexity as an indicator of fluency. A smaller value is better 6 . (2) BLEU, which is based on n-gram overlaps (Papineni et al., 2002). Following Guan et al. (2020), since BLEU scores become extremely low for large n, we used n=1, 2.
(3) Distinct-n (with n=1, 2, 3) measure the percentage of unique n-grams (Li et al., 2016). A high ratio indicates a high level of lexical diversity. (4) Repetition-4 is the percentage of gener-ated stories that repeat at least one 4-gram (Shao et al., 2019). A high value indicates redundancy in the generated text.
For evaluating the emotion faithfulness of a generated story, we adapt lexical measures (1) Segword and (2) Arc-word (Song et al., 2019). Given a desired emotion arc for a story, Seg-word is the percentage of the story's segments that contain emotion words corresponding to desired emotion. Correspondingly, Arc-word for a story is a binary score indicating if all of its segments contain emotion words corresponding to the desired emotions. We also define (3) Seg-acc and (4) Arc-acc for a generated story. Seg-acc for the story is the fraction of generated segments for which the emotion (as determined by the emotion classifier) exactly matches the desired emotion. Similarly, Arc-acc for a story is a binary score indicating if its emotion arc (as determined by the emotion classifier) exactly matches the desired emotion arc. We also use the reward functions, (5) EC-CLF and (6) EC-EM, to score a generated story. For all these measures, we report averaged scores across all stories generated by a model. Manual We also conduct a manual evaluation of generated stories using Amazon Mechanical Turk. Following Song et al. (2019), workers are asked to evaluate pair of stories on a 0-3 scale (3 being very good) from two different perspectives: (1) emotion faithfulness to assess whether it follows the desired emotion arc for the protagonist, and (2) content quality to indicate whether a story is fluent, logically coherent, and on-topic (related to the given title). Workers were also asked to indicate their overall preference by choosing the better story of the two while considering both aspects, or indicate that they are of equal quality. More details about evaluation measures are provided in Appendix A.3.

Results and Discussion
We first describe our experiments on choosing the base storytelling model ( §5.1) followed by evaluation of the proposed models ( §5.2).

Base Storytelling Model Results
As noted before, our models build upon a base storytelling model (GPT-2). We compared GPT-2 with the following state-of-the-art story generation models, given the title as input. (1)     Using various evaluation measures described earlier, our experiments showed that fine-tuned GPT-2 outperforms all baselines, indicating that it can serve as good base storytelling model. This is in-line with the observations made in Guan et al. (2020). Since this is not our focus, we report full results and details in Appendix B.3.

Emotion-Aware Storytelling Results
Baselines We use the following baselines in our experiments: (1) GPT-2+FT, our base GPT-2 model fine-tuned on the ROCStories corpus, for which emotion arcs are not provided as inputs; (2) Fu-sion+Emo and (3) Plan&Write+Emo, which are two of the strongest storytelling baselines (we prepended emotion arcs to titles in our experiments); and (4) PPLM (Dathathri et al., 2020) which can be extended to accept emotion arcs for controlling story generation. PPLM-3 and PPLM-5 indicate 3 and 5 iterations respectively 7 . Automatic Evaluation For automatic evaluation, we used the titles and automatically extracted emo- 7 We use the HuggingFace implementation: https: //github.com/huggingface/transformers/ tree/master/examples/text-generation/ pplm. For a fair comparison, we used GPT-2 fine-tuned on stories as the underlying generation model. tion arcs of the stories in our test set as input.
The evaluation results on content quality are shown in the top half of Table 1. Interestingly, even though the proposed models only aim to control emotion arc, they outperform GPT-2+FT on perplexity indicating better fluency. Among the proposed models, EmoSup obtains the best perplexity score mainly because that is what its loss function optimizes (as opposed to the mixed loss in EmoRL models). Overall, all of our proposed models outperform all baselines. In particular, RL-CLF has the highest BLEU scores, and RL-EM has the highest diversity and lowest repetition scores. All improvements over baselines are statistically significant (approximate randomization (Noreen, 1989), p < 0.05).
The evaluation results on emotion faithfulness are shown in the bottom half of Table 1. We see that, as expected, all models outperform GPT-2+FT, which is not provided the emotion arcs as inputs. Our proposed models also achieve significant improvements over all baselines (app. randomization, p < 0.05). In particular, RL-CLF achieves the best performance on almost all measures.
We also compare various models on the most common emotion arcs in our corpus. Fig. 4 shows the Arc-acc of various models on the 10 most common arcs. We can see that all models perform very well on "joy → joy → joy" as compared to other  Table 2: Manual evaluation results. For each criteria, we report the average improvements as well as the absolute scores for the two models, separated by a comma. RL-CLF is preferred over other methods (p < 0.05). emotion arcs. This is because this is the most common emotion arc (34% of the training data) in our corpus, which results in availability of significant number of training examples for this arc. Nevertheless, for all arcs, RL-CLF consistently outperforms all other models indicating a better control over the desired emotion arc.
These results indicate that while all proposed models can control the emotion arc of generated stories, RL-CLF achieves a good balance between both content and emotion quality. Manual Evaluation Since concerns have been raised about automatic evaluation of language generation, we also conduct a manual evaluation on Amazon Mechanical Turk. For this, we randomly sampled titles and emotion arcs of 100 instances from our test set, and generated stories using the models being evaluated. We compared five models, and so overall, there were 500 stories. We conduct pairwise comparisons of generated stories, and each pair was evaluated by 3 judges. Table 2 reports the average improvements as well as absolute scores for emotion faithfulness and content quality (evaluated independently) and also the overall preference of the judges. We first compare our two EmoRL models (Row 1). We see that RL-CLF improves over RL-EM on both emotion faithfulness and content quality. Overall, it is judged to be better than RL-EM 52.33% of the time and worse in only 38.00% cases. We then compare the better of the two, RL-CLF, with the uncontrolled GPT-2+FT (Row 2). We see that on average, RL-CLF model is not only better at adhering to the emotion arc by +0.76 points but also generates better content (improvement of +0.25 points) and its stories are preferred 60% of the times by humans. We observe similar results for comparison with EmoSup and PPLM-5. All improvements are statistically significant (app. randomization, p < 0.05).
Case Studies Since the proposed models can generate stories conditioned on the protagonist's emotion arc, they can be used to unfold a story in diverse situations for a given title. We demonstrate this capability in Table 3. It shows two examples where for the same title, our model (RL-CLF here) can generate stories that follow different emotion arcs for the protagonists.
Alternatively, given a story, the models can also be used to generate another story with a similar emotion arc (after automatically extracting the protagonist's emotion arc in the given story using the pipeline described in §4.1). For example, in Table 4 we show how RL-CLF can be used to generate a novel story in which the protagonist follows the same emotion arc as in the 'Cinderella' story. Note that the goal here is not necessarily to generate a similar narrative but a story that follows the same emotional trajectory.
We provide more qualitative examples in Appendix Figure 8. Title: fire injuries joy -sadness -joy My friends and I went camping this summer. We got in my van and went to the woods. We decided to light a campfire. While driving around , our tire popped and the fire started. We had to call the fire department for help and they were able to put out the fire. sadnesssadnessjoy The fire department was called to a house in the woods. The house was engulfed in flames. There were two people inside. One person was taken to the hospital by air ambulance. Luckily, the other person was treated for non-life threatening injuries. Title: dance fear -joyjoy Kelly was worried about her dance recital. She had practiced her dance for weeks. She decided to try out for the school's dance team. Kelly was nervous but knew she could do well. She was so excited she gave her best impression! sadnessjoy -joy I was very depressed. I went to a dance class with a friend of mine. We tried out some different moves. We got stuck dancing for a long time. The next day I tried out some new moves and got a standing ovation.

Conclusion
In this paper, we proposed the emotion-aware storytelling task for modeling the emotion arc of the protagonist. To this goal, we designed two emotionconsistency rewards using a commonsense transformer and an emotion classifier. Experiments demonstrated that our approach improved both content quality and emotion faithfulness of the generated stories. We also presented two case studies, which show interesting use cases of our model. In general, such models can have educational applications by enabling children to explore creative writing at an early age and addressing the literary learning needs of learners with disabilities. This paper is a step towards future research directions on planning emotional trajectory while generating stories. Using commonsense inferences about the effect of the events on emotional states of various characters of the story has the potential of generating more coherent, realistic, and engaging stories. In this work, we focused only on the protagonist, but future works can explore modeling motivations, goals, achievements, and emotional trajectory of all characters. Our approach is general and provides a blueprint for similar works going forward and can be used outside emotion-aware storytelling, e.g., for generating other emotional Input story: There was a girl called Cinderella who did all the work for her mean, ugly step sisters. One day, she got an invitation to go to a ball at the palace. A fairy Godmother appeared an made her a beautiful dress and a lovely carriage. After Cinderella left the ball, the prince looked everywhere for her. He eventually found her and they got married and lived happily ever after. Automatically extracted emotion arc: sadness → joy → joy Input Title: The wedding Output story: Ryan had been feeling really lonely lately. He decided he needed a way to make a friend. He decided to go to a wedding. When he got there he met a beautiful girl. Ryan had made a new friend that day ! content or text with other attributes or properties.
The various assumptions and choices made in this paper and the specific characteristics of the dataset we chose can introduce biases and errors. For example, COMET is a discourse-agnostic model, and separately extracting emotional reactions for each sentence may fail to maintain emotional consistency with the rest of the narrative. Such sources of errors and biases need further systematic investigation.

A.1 Obtaining Emotion Reactions During RL-EM Training
During self-critical training, for each training instance, two stories are generated: y s which is sampled from the model's probability distribution, and y which is greedily decoded. Then for each sampled or greedy story, the value of the reward is computed. For computing the EC-EM reward, we need to identify and track the protagonist and obtain his/her emotional reactions for each sentence of the sampled or greedy story. This requires identifying the protagonist and determining his/her role, Agent or Other, in every sentence so that the appropriate argument for COMET (xReact or oReact) can be chosen. In principle, this can be done using the annotation pipeline described in §4.1 of the main paper. However, doing this is computationally prohibitive during training as the pipeline requires running dependency parsing for each sentence and coreference resolution for every sampled or greedy story. To this end, upon analyzing our corpus and some generated stories, we devised several heuristics that approximate the tasks (i.e. identifying protagonists and their roles) with high accuracy. We describe these heuristics below.
For identifying the protagonist, we use the following heuristics. The first heuristic is based on the observation that if the narrator features in a story, the story primarily focuses on the narrator and his/her experiences, thus making him/her the protagonist. So our first heuristic is that if first person pronouns (I, We) appear in the story, they are considered the protagonist. Our second heuristic is based on the observation that the protagonist is usually introduced fairly early in the story. Especially, in our case, where the stories are 5sentence long, the protagonist appears mostly in the first couple of sentences of the story. With this in mind, we define the first noun that appears in a lexicon of common protagonists as the protagonist of the story. This lexicon consists of terms for Male_Char, Female_Char, Social_Group, Generic_People, and the NLTK name corpus. Example of these terms are shown in Table 5. This also let's us identify the gender of the protagonist (using the lexicon category that the protagonist belongs to) and hence the pronouns that will be used in the following sentences to refer to the protagonist. This combination of protagonist's mentions and pronouns lets us track him/her throughout the story.
For identifying the role of the protagonist, if the first noun that occurs in a sentence matches the protagonist or his/her corresponding pronoun, we assume that the protagonist's role is Agent, otherwise the role is Other. Depending on this role, we use xReact or oReact when obtaining protagonist's emotional reaction in that sentence using COMET.

A.2 Training Hyper-parameters
Our proposed models follow the setting of mediumsized GPT-2 (Radford et al., 2019) (345 million parameters) that used a 24-layer decoder-only transformer, 1024-dimensional hidden states, and 16 attention heads. The stories are encoded using BPE with vocabulary size of 50, 257. We set the maximum sequence length to 128 tokens, as it is large enough to contain complete stories and additional inputs. We use Adam optimization (Kingma and Ba, 2015) with an initial learning rate of 10 −5 and minibatch of size 4. For stability, we first pre-train the models with teacher forcing until convergence, then fine-tune them with the mixed loss. Hyperparameter γ = 0.97 is tuned manually on the validation set. All models were trained until there was no improvement on the validation set performance. We use a NVIDIA GTX 1080 Ti GPU machine to train our models. At inference time, we generate stories using top-k sampling scheme (Fan et al., 2018b) with k=40 and a softmax temperature of 0.7. It took about 3 hours to generate stories for our test set of size 9, 796.
For generating commonsense inferences about the protagonist's emotions using COMET, we use greedy decoding algorithm since it has been shown to have superior performance as evaluated by humans (Bosselut et al., 2019).

A.3 Evaluation Measures
In this section we provide details about the automatic and manual measures used to evaluate our models.
Automatic To compute Arc-word and Segword measures, we use NRC Affect Intensity Lexicon (Mohammad, 2018). This lexicon contains words with corresponding emotion-intensities for different basic emotions. To find emotionally expressive words in a given piece of text (e.g., a story segment), we create a dictionary of words with emotion intensity higher than 0.5 for each of our basic emotions. Manual During the manual evaluation, we conducted pairwise comparison of the models on Amazon Mechanical Turk (AMT). To ensure high quality of evaluation, we selected turkers that had an approval rate greater than 97%, had at least 1, 000 approved HITS, and were located in the U.S. For each pairwise annotation, we showed the inputs (title and emotion arc) and two stories generated using the two models being compared. In order to avoid biases, we randomly shuffled the order in which the stories from the two models were shown to the turkers. We provided instructions to the turkers explaining the annotations and also provided examples. Following this process, each pair of stories was annotated by three turkers. Fig. 6 shows a screenshot of our setup on AMT.

B.1 Emotion Classification
Our EC-CLF reward captures the protagonist's emotions using our emotion classifier. In this section we provide details about its evaluation.
We first evaluate the classifier on the tweets corpus (Mohammad et al. several strong baselines (Kant et al., 2019). For this comparison, we trained all models on the training set of the corpora and tested them on a held-out test set. The models were evaluated using Jaccard Index based accuracy, and Micro and Macro F1 scores. This evaluation set-up (train-validation-test splits and choice of evaluation metrics) is as suggested in the challenge that provided the corpus (SemEval Task1:E-c challenge). The results of this comparison is shown in the top half of Table 6. We can see that our emotion classifier, BERT large , is superior or competitive with other models.
The results reported above show that the model performs well for emotion classification in tweets. However, our goal is to design a model that can be used to track protagonist's emotions in stories. As described in the main paper, we further fine-tuned this classifier on our automatically annotated story corpora (described in the paper in §4.1). We also evaluated the classifier on a held-out portion of this story corpora consisting of about 1, 201 stories (6, 005 sentences in total). The results are reported in the last row of Table 6. The classifier achieves a (Jaccard Index) accuracy of 61.75% and micro and macro F1 scores of 0.650 and 0.557 respectively. Note that this is different from the evaluation reported in the paper, which was conducted on a subset of stories annotated by humans.

B.2 Manual Annotation for Protagonist's Emotions
As described in the paper ( §3.1), the emotion classifier was also evaluated on a subset of 50 randomly selected stories (250 sentences) manually annotated for the emotions experienced by their protagonists. This annotation was done on Amazon Mechanical Turk. To ensure good quality of the annotations, we selected turkers who had an approval rate greater than 97%, had at least 1, 000 approved HITS, and were located in the U.S.  Fig. 7 shows a screenshot of our setup. The Fleiss kappa (inter-annotator agreement) was also moderate (κ = 0.55). We also analyzed the annotations to identify major sources of disagreements between the turkers. We found that most disagreements occurred between neutral and joy; and also between sadness and anger. The overall label distribution for this human annotated set is shown in Fig 5.

B.3 Base Storytelling Model
Baselines We compare our base storytelling model, GPT-2, with following state-of-the-art models: 1 All of these models are trained, validated and tested on the same data splits described in §4.1.

Results
The models are evaluated on content quality using the automatic measures described in the paper in §4.3. For comparison with other baselines, we compute word-level perplexity (PPL) for GPT-2+FT. That is, we normalize the total negative log probability of the target text by the number of word-level tokens (similar to the baselines), not the number of BPE tokens. Table 7 describes the results of this evaluation. We can see that fine-tuned GPT-2 performs better than all baselines for all measures. This demonstrates that it can be used as a good base storytelling model upon which our models are built.

B.4 Emotional-Aware Storytelling
Supplementary qualitative examples We provide more qualitative examples in Fig 8. In the figure we show stories generated by our model for a given title and emotion arc for the protagonist.   Title: A new outfit Emotion arc: joy → joy → joy Story: Sara went to the mall. She wanted to find something cute to wear. She noticed a beautiful red dress she liked. Sara bought the dress. She was happy that she found something nice.
Title: Wedding Emotion arc: joy → anger → joy Story: Ben is getting married. He is having fun. Suddenly , Ben gets into a huge fight with a guy. Ben gets out of the wedding early. Ben is so happy to have his wedding over.

Title: Study
Emotion arc: fear → fear → sadness Story: Tina was so nervous for her big exam. She did not study at all. She was so afraid she would fail. She ended up failing. Tina felt very sad.
Title: The big storm Emotion arc: joy → joy → sadness Story: It was a beautiful day in the city. The sky was clear and the weather was mild. The school was hosting an evening celebration. A big storm rolled through and the students were worried. The storm destroyed all the decorations and ruined the party.
Title: Delayed flight Emotion arc: joy → anger → sadness Story: Tom was going on a vacation. He decided to go to Florida. Unfortunately his flight was delayed. He was so frustrated he called his airline. His airline cancelled his flight.
Title: The new pet Emotion arc: neutral → joy → joy Story: Sam was walking around the neighborhood. She saw a cute little dog. She decided to take him home. He got along well with everyone. Sam was glad to have a companion.
Title: Larry practice yoga Emotion arc: fear → joy → joy Story: Larry has always felt nervous about yoga. He has tried many times to practice but has never gotten the hang of it. He decides to take a yoga class at his local yoga studio. He is amazed by the benefits and feels confident about his yoga practice. Larry is happy he learned to enjoy yoga. Figure 8: Qualitative examples of generated stories given a title and an emotion arc.