“My Way of Telling a Story”: Persona based Grounded Story Generation

Visual storytelling is the task of generating stories based on a sequence of images. Inspired by the recent works in neural generation focusing on controlling the form of text, this paper explores the idea of generating these stories in different personas. However, one of the main challenges of performing this task is the lack of a dataset of visual stories in different personas. Having said that, there are independent datasets for both visual storytelling and annotated sentences for various persona. In this paper we describe an approach to overcome this by getting labelled persona data from a different task and leveraging those annotations to perform persona based story generation. We inspect various ways of incorporating personality in both the encoder and the decoder representations to steer the generation in the target direction. To this end, we propose five models which are incremental extensions to the baseline model to perform the task at hand. In our experiments we use five different personas to guide the generation process. We find that the models based on our hypotheses perform better at capturing words while generating stories in the target persona.


Introduction
Storytelling through pictures has been dated back to prehistoric times -around 30,000 years ago, paintings of herds of animals like bisons, rhinos and gazelles were made in a cave in Southern France. However, these were not merely paintings, they were stories about the heroic adventures of humans. Since then visual storytelling has evolved from paintings to photography to motion pictures to video games. With respect to its timeline, neural * Both authors contributed equally to this work. generative storytelling has gained traction only recently. Recent research has focused on challenges in generating longer documents (Wiseman et al., 2017;Lau and Baldwin, 2016) as well as on predicting the next events in the story (Martin et al., 2018). Contemporary research has focused on using deep generative models to capture high-level plots and structures in stories (Fan et al., 2018). Recent years have also seen some work hinging on the event structures and scripts Rishes et al., 2013;. Generating an appropriate ending of a story was also studied by Guan et al. (2018) and Sharma et al. (2018). Research on generating stories from a sequence of images is anew Lukin et al., 2018;Kim et al., 2018;Hsu et al., 2018;Gonzalez-Rico and Fuentes-Pineda, 2018). Cavazza et al. (2009) have stressed the importance of expressing emotions in the believability of the automated storytelling system. Adapting a personality trait hence becomes crucial to capture and maintain interest of the audience. Associating the narrative to a personality instigates a sense of empathy and relatedness. Although there has been research in generating persona based dialog responses and generating stylistic sentences (Shuster et al., 2018;Fu et al., 2018;Prabhumoye et al., 2018;, generating persona based stories with different personality types narrating them has been unexplored. In this paper, we focus on generating a story from a sequence of images as if the agent belongs to a particular personality type. In specific, we choose to perform experimentations on visual story telling (Huang et al., 2016). This paper introduces a novel approach to generating visual stories in five different personality types. A key challenge to this end is the lack of large scale persona annotated stories. We address this by transferring knowledge from annotated data in dialog domain to the storytelling domain. We base our visual story generator model on Kim et al. (2018) and propose multiple techniques to induce the personalities in the latent representations of both the encoder and the decoder. The goal of our work is to learn the mapping between the latent representations of the images and the tokens of the story such that we encourage our generative model to generate tokens of a particular personality. We evaluate our generative models using the automatic metric of ROUGE (Lin, 2004) which takes into account the sentence level similarity in structure and thus roughly evaluates the matching of content. We acknowledge that there is a drop in this metric since our model is not trying to optimize generation alone but also adapt personality from a different dataset. We also evaluate the success of generating the story in the target personality type using automatic and qualitative analysis. The automatic metrics comprise of the classification accuracies rooted from the annotated data. We observe that one of the proposed models (LEPC, described in Section 3 performs slightly better at classification accuracies for most of the personas while retaining similar ROUGE scores.
The main contribution of this paper is showing simple yet effective approaches to narrative visual stories in different personality types. The paper also displays an effective way of using annotated data in the dialog domain to guide the generative models to a specified target personality.

Related Work
Visual Story Telling: Last decade witnessed enormous interest in research at the intersection of multiple modalities, especially vision and language. Mature efforts in image captioning (Hossain et al., 2019) paved way into more advanced tasks like visual question answering (Wu et al., 2017) and visual dialog (Das et al., 2017) , (Mostafazadeh et al., 2017). As an obvious next step from single shot image captioning lies the task of describing a sequence of images which are related to one another to form a story like narrative. This task was introduced as visual story telling by Huang et al. (2016), differentiating de-scriptions of images in isolation (image captions) and stories in sequences. The baseline model that we are leveraging to generate personality conditioned story generation is based on the model proposed by Kim et al. (2018) for the visual story telling challenge. Another simple yet effective technique is late fusion model by Smilevski et al. (2018). In addition to static images, Gella et al. (2018) have also collected a dataset of describing stories from videos uploaded on social media. Chandu et al. (2019) recently introduced a dataset for generating textual cooking recipes from a sequence of images and proposed two models to incorporate structure in procedural text generation from images.
Style Transfer: One line of research that is closely related to our task is style transfer in text. Recently generative models have gained popularity in attempting to solve style transfer in text with non-parallel data (Hu et al., 2017;Li et al., 2018). Some of this work has also focused on transferring author attributes (Prabhumoye et al., 2018), transferring multiple attributes (Lample et al., 2019;Logeswaran et al., 2018) and collecting parallel dataset for formality (Rao and Tetreault, 2018). Although our work can be viewed as another facet of style transfer, we have strong grounding of the stories in the sequence of images.
Persona Based Dialog: Persona based generation of responses has been studied by NLP community in dialog domain. (Li et al., 2016) encoded personas of individuals in contextualized embeddings that capture the background information and style to maintain consistency in the responses given. The embeddings for the speaker information are learnt jointly with the word embeddings. Following this work, (Zhou et al., 2018) proposed Emotional Chatting Machine that generates responses in an emotional tone in addition to conditioning the content. The key difference between former and latter work is that the latter captures dynamic change in emotion as the conversation proceeds, while the user persona remains the same in the former case.  release a huge dataset of conversations conditioned on the persona of the two people interacting. This work shows that conditioning on the profile infor-mation improves the dialogues which is measured by next utterance prediction. In these works, the gold value of the target response was known. For our work, we do not have gold values of stories in different personas. Hence we leverage annotated data from a different task and transfer that knowledge to steer our generation process.
Multimodal domain: With the interplay between visual and textual modalities, an obvious downstream application for persona based text generation is image captioning. Chandrasekaran et al. (2018) worked on generating witty captions for images by both retrieving and generating with an encoder-decoder architecture. This work used external resources to gather a list of words that are related to puns from web which the decoder attempts to generate conditioned on phonological similarity. Wang and Wen (2015) studied the statistical correlation of words associated with specific memes. These ideas have also recently penetrated into visual dialog setting. Shuster et al. (2018) have collected a grounded conversational dataset with 202k dialogs where humans are asked to portray a personality in the collection process. They have also set up various baselines with different techniques to fuse the modalities including multimodal sum combiner and multimodal attention combiner. We use this dataset to learn personas which are adapted to our storytelling model.

Models
We have a dataset of visual stories S = {S 1 , . . . , S n }. Each story S i is a set of sequence of five images and the corresponding text of the story i )}. Our task is to generate the story based on not only the sequence of the images but also closely following the narrative style of a personality type. We have five personality types (described in Section 4) P = {p 1 , . . . , p 5 } and each story is assigned one of these five personalities as their target persona. Here, each p i represents the one-hot encoding of the target personality for story i.e p 1 = [1, 0, 0, 0, 0] and so on till p 5 = [0, 0, 0, 0, 1]. Hence, we create a dataset such that for each story, we also have a specified target personality type i ); p i }. The inputs to our models are the sequence of images and the target personality type. We build generative models such that they are able to generate stories in the specified target personality type from the images. In this section, we first briefly describe classifiers that are trained discriminatively to identify each of the personalities and then move on to the story generation models that make use of these classifiers.
Here is an overview of the differences in the six models that we describe next.
1. The baseline model (Glocal) is a sequence to sequence model with global and local contexts for generating story sentence corresponding to each image.

The Multitask Personality Prediction (MPP)
model is equipped with predicting the personality in addition to generating the sentences of the story. This model also incorporates binary encoding of personality.
3. The Latent Encoding of Personality in Context (LEPC) model incorporates an embedding of the personality as opposed to binary encoding.
4. The Latent Encoding of Personality in Decoder (LEPD) model augments personality embedding at each step in the decoder, where each step generates a token.

Stripped Encoding of Personality in Context
(SEPC) is similar to LEPC but encodes personality embedding after stripping the mean of the story representation.
6. Stripped Encoding of Personality in Decoder (SEPD) is similar to LEPD but encodes personality embedding after stripping the mean of the story representation. This is similar to the intuition behind SEPC.

Classification
We use convolutional neural network (CNN) architecture to train our classifiers. We train five separate binary classifiers for each of the personality types. The classifiers are trained to predict whether a sentence belongs to a particular personality or not. We train the classifiers in a supervised manner. We need labeled data to train each of the classifiers. Each sample of text x in the respective datasets of each of the five personality types has a label in the set {0, 1}. Let θ p j C denote the parameters of the classifier for personality p j where j ∈ {1, . . . , 5}. Each classifier is trained with the following objective: We use cross entropy loss to calculate L p j C for each of the five classifiers. The classifiers accept continuous representations of tokens as input.

Story Generation
We present five extensions to incorporate personality based features in the generation of stories.
(1) Baseline model (Glocal): We first describe the baseline model that is used for visual story telling. This is based on the model (Kim et al., 2018) that attained better scores on human evaluation metrics. It follows an encoder-decoder framework translating a sequence of images into a story. From here on, we refer to this model as glocal through the rest of the paper owing to the global and local features in the generation of story sequence at each step (described in this section).
The image features for each of the steps are extracted with a ResNet-152  post resizing to 224 X 224. The features are taken from the penultimate layer of this pretrained model and the gradients are not propagated through this layer during optimization. These features are passed through a fully connected layer to obtain the final image features. In order to obtain an overall context of the story, the sequence of the image features are passed through a Bi-LSTM. This represents the global context of the story. For each step in the generation of the story, the local context corresponding to the specificity of that particular image is obtained by augmenting the image features (local context) to the context features from the Bi-LSTM (global context). These glocal features are used to decode the story sentence at each step. This concludes the encoder part of the story. The decoder of each step in the story also uses an LSTM which takes the same glocal feature for that particular step at each time step. Hence there are 5 glocal features feeding into each time step in the decoder.
For simplicity in understanding, we use the fol-lowing notations throughout model descriptions to represent mathematical formulation of the generation models. Subscript k indicates the k th step or sentence in a story. Subscript i indicates the i th story example. The story encoder is represented as Encoder which comprises of the features extracted from the penultimate layer of ResNet-152 concatenated with the global context features from the Bi-LSTM. The entirety of this representation in encoder and the glocal features obtained is represented using z k for the k th step or sentence in the story.
Now, the generation of a sentence in the story is represented as follows: The generated sentencex k is obtained from each of the output wordsx t k which is generated by conditioning on all of the prior wordsx <t k and the glocal feature obtained as z k .
Personality based Generation: In the rest of the section, we are going to describe the incremental extensions to the baseline to adapt the model to perform persona based story generation.
(2) Multitask Personality Prediction (MPP): The intuition behind the hypothesis here is to provide the personality information to the model and also enable it to predict the personality along with the generation of the story. The obvious extension to provide personality information is to incorporate the one-hot encoding p i ∈ P of the five personas in the context before the decoder. The visual story telling data is split into five predetermined personalities as described in Section 4. For each story, the corresponding personality is encoded in a one hot representation and is augmented to the glocal context features. These features are then given to the decoder to produce each step in the story. The model is enabled to perform two tasks: the primary task is to generate the story and the secondary task is to predict the personality of the story. The classifiers described in Section 3.1 are used to perform personality prediction. Formally, the generation process is represented by: Here, we condition the generation of each word on the glocal context features z k , binary encoding of the personality p i and the words generated till that point.
The cross entropy loss for generation is L g and the loss for the prediction of each of the personalities is L p j C given by Eq 1. The overall loss optimized for this model is: The overall model is optimized on this total loss. We use cross entropy loss for each of the individual losses. We give a higher weight α to the story generation and equally distribute the remaining (1−α) among each of the 5 personalities.
(3) Latent Encoding of Personality in Context (LEPC): This model is an incremental improvement over MPP model. The key difference is the incorporation of personality as an embedding that captures more centralized traits in the words belonging to that particular personality. For each of the five personality types, we have a latent representation of the personality (P), as opposed to the binary encoding in MPP model. Similar to the earlier setting, this average personality feature vector is concatenated with the glocal context vector The generation step is formally represented as: This means that z k is concatenated with P to give personality informed representation; and the generation of each word is conditioned on these concatenated features z k , binary encoding of the personality p i and the words generated so far.
(4) Latent Encoding of Personality in Decoder (LEPD): Instead of augmenting the personality traits to the context as done in LEPC model, they could be explicitly used in each step of decoding. The latent representation of the personality (P) is concatenated with the word embedding for each time step in the decoder.
The generation of each of the words is conditioned on the words generated so far that are already concatenated with the average vector for the corresponding personality, and the glocal features along with the binary encoding of the personality.
(5) Stripped Encoding of Personality in Context (SEPC): In order to orient the generation more towards the personality, we need to go beyond simple augmentation of personality. Deriving motivation from neural storytelling 1 , we use a similar approach to subtract central characteristics of words in a story and add the characteristics of the personality. Along the same lines of calculating an average representation for each of the personalities, we also obtain an average representation of the story S. This average representation S intuitively captures the style of the story. Essentially, the story style is being stripped off the context and personality style is incorporated. The modified glocal feature that is given to the decoder is obtained as m = z k − S + P. The generation process is now conditioned on m instead of z k . Hence, the generation of each word in decoding is conditioned on the words generated so far (x <t k ), the binary encoding of the personality (p i ) and the modified representation of the context features (m).x Here, note that the context features obtained thus far are from the visual data and performing this operation is attempting to associate the visual data with the central textual representations of the personalities and the stories.
(6) Stripped Encoding of Personality in Decoder (SEPD): This model is similar to SEPC with the modification of performing the stripping at each word embedding in the decoder as opposed to the context level stripping. The time steps to strip features is at the sentence level in SEPC and is at word level in SEPD model. The LSTM based decoder decodes one word at a time. At each of these time steps, the word embedding feature E is modified as e k = E − S + P. This modification is performed in each step of the decoding process. These modified features are used to generate each sentence in the full story. The model is trained to generate a sentence in the story as described below:x The generation of each word is conditioned on the modified word embeddings using the aforementioned transformation (e <t k ), the binary encodings of the personalities (p i ) and the glocal context features.

Datasets
Coalescing the segments of personality and sequential generation together, our task is to generate a grounded sequential story from the view of a personality. To bring this to action, we describe the two sources of data we use to generate personality based stories in this section. The first source of data is focussed on generic story generation from a sequence of images and the second source of data includes annotations for personality types for sentences. We tailor a composition of these two sources to obtain a dataset for personality based visual storytelling. Here, we note that the techniques described above can be applied for unimodal story generation as well.
Visual Story Telling: Visual Storytelling is the task of generating stories from a sequence of images. A dataset for this grounded sequential generation problem was collected by Huang et al. (2016) and an effort for a shared task 2 was led in 2018. The dataset includes 40,155 training sequences of stories. It comprises of a sequence of images, descriptions of images in isolation and stories of images in sequences. We randomly divide the dataset into 5 segments (comprising of 8031 stories each) and each segment is associated with a personality.
Personality Dialog: Shuster et al. (2018) have provided a dataset of 401k dialog utterances, each of which belong to one of 215 different personalities. The dataset was collected through image grounded human-human conversations. Humans were asked to play the role of a given personality. This makes this dataset very pertinent for our task as it was collected through engaging image chat between two humans enacting their personalities.
For our task, we wanted to choose a set of five distinct personality types. Let the set of utterances that belong to each personality type be U p = {u 1 p , . . . , u n p } where p ∈ {1, . . . , 215}. We first calculate the pooled BERT representation (Devlin et al., 2018) of each of the utterances. To get the representation of the personality P, we simply average the BERT representations of all the utterances that belong to that personality. The representation of each personality is given by: This representation is calculated only on the train set of (Shuster et al., 2018). Since our goal is to pick five most distinct personality types, we have the daunting task of filtering the 215 personality types to 5. To make our task easier we want to group similar personalities together. Hence, we use K-Means Clustering to cluster the representations of the personalities into 40 clusters 3 . We get well formed and meaningful clusters which look like [Impersonal, Aloof (Detached, Distant), Apathetic (Uncaring, Disinterested), Blunt, Cold, Stiff]; [Practical, Rational, Realistic, Businesslike]; [Empathetic, Sympathetic, Emotional]; [Calm, Gentle, Peaceful, Relaxed, Mellow (Soothing, Sweet)] etc. We then build a classifier using the technique described in Section 3.1 to classify the utterances to belong to one of the 40 clusters. We pick the top five clusters that give the highest accuracy for the 40-way classification.
The five personality clusters selected are: • Cluster 1 (C1): Arrogant, Conceited, Egocentric, Lazy, Money-minded, Narcissistic, Pompous and Resentful • Cluster 2 (C2): Skeptical and Paranoid • Cluster 3 (C3): Energetic, Enthusiastic, Exciting, Happy, Vivacious, Excitable • Cluster 4 (C4): Bland and Uncreative • Cluster 5 (C5): Patriotic We build five separate classifiers, one for each personality cluster. Note that these clusters are also associated with personalities and hence are later referred as P followed by the cluster id in the following sections. To build the five binary classifiers, we create label balanced datasets for each cluster i.e we randomly select as many negative samples from the remaining 4 clusters as there are positive samples in that cluster. We use the train, dev and test split as is from (Shuster et al., 2018). The dataset statistics for each of the five clusters is provided in Table 1 Note that all the datasets have a balanced distribution of labels 0 and 1. For our experiments it does not matter that distribution of the number of samples is different because we build separate classifiers for each of the cluster and their output is treated as independent from one another.
As seen in  We finally calculate the representation P for each of the five clusters and the representation S of stories using equation 9. Note that S is calculated over the visual story tellind dataset. These representations are used by our generative models LEPC, LEPD, SEPC, and SEPD.

Experiments and Results
This section presents the experimental setup for the models described in Section 3. Each of the models are incremental extensions over the baseline glocal model. The hyperparameters used for this are as follows.
Hyperparameters: The hidden size of the Bi-LSTM encoder of the story to capture context is 1024. The dimensionality of the glocal context vector z k is 2048. A dropout layer of 50% is applied post the fully connected layer to obtain the image features and after the global features obtained from Bi-LSTM which is 2 layered. The word embedding dimension used is 256. The learning rate is 1e-3 with a weight decay of 1e-5. Adam optimizer is used with batch normalization and a momentum of 0.01. Weighting the loss functions differently is done to penalize the model more if the decoding is at fault as compared to not predicting the personality of the story. α is set to 0.5 and each of the individual personality losses are weighted by a factor of 0.1.
The rest of the 5 models use the same hyperparameter setting with an exception to word embedding dimension. The average personality (P) and the average story (S) representations are obtained from pre-trained BERT model.Hence this is a 768 dimensional vector. In order to perform the stripping of the story feature and adding the personality features to the word embeddings in the decoder, the word embedding dimension is matched to 768 in the SEPD model.

Quantitative Results
We perform two sets of experiments: (1) evaluating the performance of the models on capturing the personalities in the story and (2) performance Original grandma loves when all the kids come over to visit .
she will pick them them up and put them on her lap even though it <unk> .
the kids love each other as well giving lots of hugs and love .
grandma can not forget her little girl and gives her some love as well .
grandpa says it 's time for cake .
Glocal the family is having a great time .
they are playing with each other .
he is happy to see his grandson .
she is being silly the birthday girl is eating a cake .

MPP
[ male ] and his friends are having a great time .
they are all smiles for the camera .
everyone is enjoying their new family .
[ female ] is so excited to be there .
she is very happy about her birthday .

LEPC
the family was having a great time .
they were so happy to be together .
they were having a good time with grandson .
she was very excited to play with a kid .
he was surprised by all of his friends .

LEPD
the family was ready to see a lot of a party .
they had a great time . they were having a lot of fun .
we had a great day . he was happy to eat cake .

SEPC
the parade was very beautiful .
there were a lot of people there .
we were so happy to be a great time .
i had a great time . this was a picture of a little girl .
SEPD the family is a great time .
it was a lot of a big . there were a lot . i had a picture . they were a very .  Table 4: ROUGE L scores for the generated stories by each of our models of story generation. The former evaluation is performed using the pre-trained classifiers (3.1) on the personality dataset. We calculate the classification accuracy of the generated stories of the test set for the desired target personality. However, we need to note that the classification error of the models trained is reflected in this result as well. This evaluation is done at a sentence level i.e accuracy is calculated over each sentence of the story (each sentence of the story has the same target personality as that of the entire story). The performance of the generation is evaluated using the ROUGE score 4 . Although this captures the generic aspect of generation, the metric explicitly does not evaluate whether the story is generated on a conditioned personality. In future, we would also like to look at automatic evaluation of the generated stories with respect to incorporation of personalities. Table 3 shows the results of classification accuracy for each of the five personalities. Table 4 shows the results of ROUGE L evaluation. We acknowledge that there would be a deviation to this automatic score since optimizing the gold standard generation of story from training data is not our end goal. Rather our models make use of two distinct datasets and learn to transfer the traits annotated in personality dialog dataset into the visual story telling dataset.
Despite this, we notice that LEPC model gives comparative results to that of the glocal model in terms of story generation. It is noticed that LEPC model also gives slight improvement on the classification accuracies for most of the clusters (each cluster representing a personality). However this is an insufficient result to generalize that incorporating personality at context level performs better than that at the word level since the inverted stance is observed in SEPC and SEPD models. We plan to investigate this further by performing ablations and examine which operation is causing these models to perform weakly. Note that the SEPC model performs the best in incorporating personality in three of the five personality types. But this model takes a hit in the automatic score. This is because our generative models are dealing with competing losses or reconstruction of classification.

Qualitative Results
We present an example of the story generated by each of the models proposed in Figure 1. This example belongs to persona in cluster C3. The words corresponding to this cluster are highlighted with blue color in the persona conditioned generation of the stories. The main observation is that all of the five sentences in the story contain a word relevant to happiness for each of the MPP, LEPC and LEPD models. SEPC and SEPD models capture these happiness features in only two and one sentences respectively. The glocal model does not cater explicitly to the personality while our proposed models attempt to capture the persona tone in generation. This is observed in the fourth generated sentence in the sequence by each of our proposed models. While the glocal model uses the word 'silly', our models capture the tone and generate 'excited' and 'great'. Similarly for the fifth sentence, MPP, LEPC and LEPD generate 'happy', 'surprised' and 'happy' respectively.
It is observed that in most generated stories, the language model has taken a rough hit in the SEPD model. This is also substantiated in Figure 1. This seems to be due to stripping away the essential word embedding features that contribute to linguistic priors or language model. This could be potentially corrected by retaining the word embedding feature as is and augmenting it with the stripped features. Having presented these results, we notice that there is a significant scope for improving the generation of the story while capturing high level persona traits in generation.

Conclusions and Future Work
Automatic storytelling is a creative writing task that has long been the dream of text generation models. The voice conveying this story is the narrative style and this can be attributed to different personalities, moods, situations etc. In the case of persona based visual storytelling, this voice not only is aware of the grounded content to be conveyed in the images, but also has a model to steer the words in the narrative to characterize the persona.
A key challenge here is that there is no targeted data for this specific task. Hence we leverage annotations of persona from an external persona based dialog dataset and apply it on the visual storytelling dataset. We address this task of attribution of a personality while generating a grounded story by simple techniques of incorporating persona information in our encoder-decoder architecture. We propose five simple incremental extensions to the baseline model that captures the personality. Quantitatively, our results show that the LEPC model is improving upon the accuracy while at the same time not dropping the automatic scores. We also observe that the persona induced models are generating at least one word per sentence in the story that belong to that particular persona. While automatically evaluating this can be tricky, we adapt a classification based evaluation of whether the generated output belongs to the persona class or not. In the future, we hope to also perform human evaluations for measuring both the target personality type of the generated and story and its coherence.
There is yet a lot of scope in incorporating the persona in the word embeddings. This is an ongoing work and we plan on investigating the relatively poor ROUGE performance of the SEPC and SEPD models and rectify them by equipping them with language model information. We also plan to work towards a stable evaluation protocol for this task in the future.