Learning Personas from Dialogue with Attentive Memory Networks

The ability to infer persona from dialogue can have applications in areas ranging from computational narrative analysis to personalized dialogue generation. We introduce neural models to learn persona embeddings in a supervised character trope classification task. The models encode dialogue snippets from IMDB into representations that can capture the various categories of film characters. The best-performing models use a multi-level attention mechanism over a set of utterances. We also utilize prior knowledge in the form of textual descriptions of the different tropes. We apply the learned embeddings to find similar characters across different movies, and cluster movies according to the distribution of the embeddings. The use of short conversational text as input, and the ability to learn from prior knowledge using memory, suggests these methods could be applied to other domains.


Introduction
Individual personality plays a deep and pervasive role in shaping social life. Research indicates that it can relate to the professional and personal relationships we develop (Barrick and Mount, 1993), (Shaver and Brennan, 1992), the technological interfaces we prefer (Nass and Lee, 2000), the behavior we exhibit on social media networks (Selfhout et al., 2010), and the political stances we take (Jost et al., 2009).
With increasing advances in human-machine dialogue systems, and widespread use of social media in which people express themselves via short text messages, there is growing interest in systems that have an ability to understand different personality types. Automated personality analysis based on short text analysis could open up a range of potential applications, such as dialogue agents that * The first two authors contributed equally to this work. sense personality in order to generate more interesting and varied conversations.
We define persona as a person's social role, which can be categorized according to their conversations, beliefs, and actions. To learn personas, we start with the character tropes data provided in the CMU Movie Summary Corpus by (Bamman et al., 2014). It consists of 72 manually identified commonly occurring character archetypes and examples of each. In the character trope classification task, we predict the character trope based on a batch of dialogue snippets.
In their original work, the authors use Wikipedia plot summaries to learn latent variable models that provide a clustering from words to topics and topics to personas -their persona clusterings were then evaluated by measuring similarity to the ground-truth character trope clusters. We asked the question -could personas also be inferred through dialogue? Because we use quotes as a primary input and not plot summaries, we believe our model is extensible to areas such as dialogue generation and conversational analysis.
Our contributions are: 1. Data collection of IMDB quotes and character trope descriptions for characters from the CMU Movie Summary Corpus. 2. Models that greatly outperform the baseline model in the character trope classification task. Our experiments show the importance of multi-level attention over words in dialogue, and over a set of dialogue snippets. 3. We also examine how prior knowledge in the form of textual descriptions of the persona categories may be used. We find that a 'Knowledge-Store' memory initialized with descriptions of the tropes is particularly useful. This ability may allow these models to be used more flexibly in new domains and with

Related Work
Prior to data-driven approaches, personalities were largely measured by asking people questions and assigning traits according to some fixed set of dimensions, such as the Big Five traits of openness, conscientiousness, extraversion, agreeability, and neuroticism (Tupes and Christal, 1992). Computational approaches have since advanced to infer these personalities based on observable behaviors such as the actions people take and the language they use (Golbeck et al., 2011). Our work builds on recent advances in neural networks that have been used for natural language processing tasks such as reading comprehension (Sukhbaatar et al., 2015) and dialogue modeling and generation (Vinyals and Le, 2015;Li et al., 2016;Shang et al., 2015). This includes the growing literature in attention mechanisms and memory networks (Bahdanau et al., 2014;Sukhbaatar et al., 2015;Kumar et al., 2016).
The ability to infer and model personality has applications in storytelling agents, dialogue systems, and psychometric analysis. In particular, personality-infused agents can help "chit-chat" bots avoid repetitive and uninteresting utterances (Walker et al., 1997;Mairesse and Walker, 2007;Li et al., 2016;Zhang et al., 2018). The more recent neural models do so by conditioning on a 'persona' embedding -our model could help produce those embeddings.
Finally, in the field of literary analysis, graphical models have been proposed for learning character personas in novels (Flekova and Gurevych, 2015;Srivastava et al., 2016), folktales (Valls-Vargas et al., 2014), and movies (Bamman et al., 2014). However, these models often use more structured inputs than dialogue to learn personas.

Datasets
Characters in movies can often be categorized into archetypal roles and personalities. To understand the relationship between dialogue and personas, we utilized three different datasets for our models: (a) the Movie Character Trope dataset, (b) the IMDB Dialogue Dataset, and (c) the Character Trope Description Dataset. We collected the IMDB Dialogue and Trope Description datasets, and these datasets are made publicly available 1 .

Character Tropes Dataset
The CMU Movie Summary dataset provides tropes commonly occurring in stories and media (Bamman et al., 2014). There are a total of 72 tropes, which span 433 characters and 384 movies. Each trope contains between 1 and 25 characters, with a median of 6 characters per trope. Tropes and canonical examples are shown in Table 1.

IMDB Dialogue Snippet Dataset
To obtain the utterances spoken by the characters, we crawled the IMDB Quotes page for each movie. Though not every single utterance spoken by the character may be available, as the quotes are submitted by IMDB users, many quotes from most of the characters are typically found, especially for the famous characters found in the Character Tropes dataset. The distribution of quotes per trope is displayed in Figure 1. Our models were trained on 13,874 quotes and validated and tested on a set of 1,734 quotes each.
We refer to each IMDB quote as a (contextualized) dialogue snippet, as each quote can contain several lines between multiple characters, as well as italicized text giving context to what might be happening when the quote took place. Figure 2 show a typical dialogue snippet. 70.3% of the quotes are multi-turn exchanges, with a mean of 3.34 turns per multi-turn exchange. While the character's own lines alone can be highly indicative of the trope, our models show that accounting for context and the other characters' lines and context improves performance. The context, for instance, can give clues to typical scenes and actions that are associated with certain tropes, while the other characters' lines give further detail into  the relationship between the character and his or her environment.

Character Trope Description Dataset
We also incorporate descriptions of each of the character tropes by using the corresponding descriptions scraped from TVTropes 2 . Each description contains several paragraphs describing typical characteristics, actions, personalities, etc. about the trope. As we demonstrate in our experiments, the use of these descriptions improves classification performance. This could allow our model to be applied more flexibly beyond the movie character tropes -as one example, we could store descriptions of personalities based on the Big Five traits in our Knowledge-Store memory.

Problem Formulation
Our goal is to train a model that can take a batch of dialogue snippets from the IMDB dataset and predict the character trope. Formally, let N P be the total number of character tropes in the character tropes dataset. Each character C is associated with a corresponding ground-truth trope category P . Let S = (D, E, O) be a dialog snippet associated with a character C, where D = [w D 1 , w D 2 ..., w D T ] refers to the character's own lines, E = 2 http://tvtropes.org denotes the other characters' lines. We define all three components of S to have fixed sequence length T and pad when necessary. Let N S be the total number of dialogue snippets for a trope. We sample a set of N diag (where N diag N S ) snippets from N S snippets related to the trope as inputs to our model.

Attentive Memory Network
The Attentive Memory Network consists of two major components: (a) Attentive Encoders, and (b) a Knowledge-Store Memory Module. Figure 3 outlines the overall model. We describe the components in the following sections.

Attentive Encoders
Not every piece of dialogue may be reflective of a latent persona. In order to learn to ignore words and dialogue snippets that are not informative about the trope we use a multi-level attentive encoder that operates at (a) the individual snippet level, and (b) across multiple snippets.

Attentive Snippet Encoder
The snippet encoder extracts features from a single dialogue snippet S, with attention over the words in the snippet. A snippet S = (D, E, O) is fed to the encoder to extract features from each of these textual inputs and encode them into an embedding space. We use a recurrent neural network as our encoder, explained in detail in Section 5.1.1. In order to capture the trope-reflective words from the input text, we augment our model with a self-attention layer which scores each word in the given text for its relevance. Section 5.1.2 explains how the attention weights are computed. The output of this encoder is an encoded snippet embedding S e = (D e , E e , O e ).

Attentive Inter-Snippet Encoder
As shown in Figure 3, the N diag snippet embeddings S e from the snippet encoder are fed to our inter-snippet encoder. This encoder captures inter-snippet relationship using recurrence over the N diag snippet embeddings for a given trope and determines their importance. Some of the dialogue snippets may not be informative about the trope, and the model learns to assign low attention scores to such snippets. The resulting attended Figure 3: Illustration of the Attentive Memory Network. The network takes dialogue snippets as input and predicts its associated character trope. In this example, dialogue snippets associated with the character trope "Bruiser with a Soft Corner" is given as input to the model. summary vector from this phase is the persona representation z, defined as: where γ s D , γ s E , γ s O are learnable weight parameters. D s , E s , O s refers to summary vectors of the N diag character's lines, contextual information, and other characters' lines, respectively. In Section 7, we experiment with models that have γ s E and γ s O set to 0 to understand how the contextual information and other characters' lines contribute to the overall performance.

Encoder
Given an input sequence (x 1 , x 2 , ..., x T ), we use a recurrent neural network to encode the sequence into hidden states (h 1 , h 2 , ..., h T ). In our experiments, we use a gated recurrent network (GRU) (Chung et al., 2014) over LSTMs (Hochreiter and Schmidhuber, 1997) because the latter is more computationally expensive. We use bidirectional GRUs and concatenate our forward and backwards hidden states to get ← → h t for t = 1, ..., T .

Attention
We define an attention mechanism Attn that computes s from the resultant hidden states ← → h t of a GRU by learning to generate weights α t . This can be interpreted as the relative importance given to a hidden state h t to form an overall summary vector for the sequence. Formally, we define it as: where f attn is a two layer fully connected network in which the first layer projects h t ∈ IR d h to an attention hidden space g t ∈ IR da , and the second layer produces a relevance score for every hidden state at timestep t.

Memory Modules
Our model consists of a read-only 'Knowledge-Store' memory, and we also test a recent readwrite memory. External memories have been shown to help on natural language processing tasks (Sukhbaatar et al., 2015;Kumar et al., 2016;Kaiser and Nachum, 2017), and we find similar improvements in learning capability.

Knowledge-Store Memory
The main motivation behind the Knowledge-Store memory module is to incorporate prior domain knowledge. In our work, this knowledge refers to the trope descriptions described in Section 3.3. Related works have initialized their memory networks with positional encoding using word embeddings (Sukhbaatar et al., 2015;Kumar et al., 2016;Miller et al., 2016). To incorporate the descriptions, we represent them with skip thought vectors (Kiros et al., 2015) and use them to initialize the memory keys K M ∈ IR N P ×d K , where N P is the number of tropes, and d K is set to the size of embedded trope description R D , i.e.
The values in the memory represent learnable embeddings of corresponding trope categories V M ∈ IR N P ×d V , where d V is the size of the trope category embeddings. The network learns to use the persona representation z from the encoder phase to find relevant matches in the memory. This corresponds to calculating similarities between z and the keys K M . Formally, this is calculated as: We iteratively combine our mapped persona representation z M with information from the memory r out . The above process is repeated n hop times. The memory mapped persona representation z M is updated as follows: where z 0 M = z M , and f r : IR d V → IR d K is a fullyconnected layer. Finally, we transform the resulting z n hop M using another fully-connected layer,

Read-Write Memory
We also tested a Read-Write Memory following Kaiser et. al (Kaiser and Nachum, 2017), which was originally designed to remember rare events. In our case, these 'rare' events might be key dialogue snippets that are particularly indicative of latent persona. It consists of keys, which are activations of a specific layer of model, i.e. the persona representation z, and values, which are the ground-truth labels, i.e. the trope categories. Over time, it is able to facilitate predictions based on past data with similar activations stored in the memory. For every new example, the network writes to memory for future look up. A memory with memory size N M is defined as: Memory Read We use the persona embedding z as a query to the memory. We calculate the cosine similarities between z and the keys in M , take the softmax on the top-k neighbors, and compute a weighted embeddingẑ M using those scores.

Memory Write
We update the memory in a similar fashion to the original work by (Kaiser and Nachum, 2017), which takes into account the maximum age of items as stored in A N M .

Objective Losses
To train our model, we utilize the different objective losses described below.

Classification Loss
We calculate the probability of a character belonging to a particular trope category P through Equation 11, where f P : IR d h → IR N P is a fullyconnected layer, and z is the persona representation produced by the multi-level attentive encoders described in Equation 1. We then optimize the categorical cross-entropy loss between the predicted and true tropes as in Equation 12, where N P is the total number of tropes, q j is the predicted distribution that the input character fulls under trope j, and p j ∈ {0, 1} denotes the ground-truth of whether the input snippets come from characters from the j th trope.

Trope Description Triplet Loss
In addition to using trope descriptions to initialize the Knowledge-Store Memory, we also test learning from the trope descriptions through a triplet loss (Hoffer and Ailon, 2015). We again use the skip thought vectors to represent the descriptions. Specifically, we want to maximize the similarity of representations obtained from dialogue snippets with their corresponding description, and minimize their similarity with negative examples. We implement this as: where f D : IR d h → IR ||R D || is a fully-connected layer. The triplet ranking loss is then Equation 14, where α T is a learnable margin parameter and s(·, ·) denotes the similarity between trope embeddings (R P ), positive (R D p ) and negative (R D n ) trope descriptions.

Trope Description Triplet Loss with Memory Module
If a memory module is used, we compute a new triplet loss in place of the one described in Equation 14. Models that use a memory module should learn a representationẑ M , based on either the prior knowledge stored in the memory (as in Knowledge-Store memory) or the top-k key matches (as in Read-Write memory), that is similar to the representation of the trope descriptions. This is achieved by replacing the persona embedding z in Equation 13 with the memory outputẑ M as shown in Equation 15, where f D M : IR d h → IR ||R D || is a fully-connected layer. To compute the new loss, we combine the representations obtained from Equations 13 and 15 through a learnable parameter γ that determines the importance of each representation. Finally, we utilize this combined representationR P to calculate the loss as shown in Equation 17.

Read-Write Memory Losses
When the Read-Write memory is used, we use two extra loss functions. The first is a Memory Ranking Loss J M R as done in (Kaiser and Nachum, 2017), which learns based on whether a query with the persona embedding z returns nearest neighbors with the correct trope. The second is a Memory Classification Loss J M CE that uses the values returned by the memory to predict the trope. The full details for both are found in Supplementary Section A.

Overall Loss
We combine the above losses through: where β = [β CE , β M CE , , β T , β M R ] are learnable weights such that i β i = 1. Depending on which variant of the model is being used, the list β is modified to contain only relevant losses. For example, when the Knowledge-Store memory is used, we set β M R = β M CE = 0 and β is modified We discuss different variants of our model in the next section.

Experiments
We experimented with combinations of our various modules and losses. The experimental results and ablation studies are described in the following sections, and the experimental details are described in Supplementary Section B. The different model permutation names in Table 2, e.g. "attn 3 tropetrip ks-mem ndialog16", are defined as follows: • baseline vs attn: The 'baseline' model uses only one dialogue snippet S to predict the trope, i.e. N diag = 1. Hence, the intersnippet encoder is not used. The 'attn' model operates on N diag dialogue snippets using the inter-snippet encoder to assign an attention score for each snippet S i . cates that the triplet loss on the trope descriptions was used. If '-500' is appended to 'tropetrip', then the 4800-dimensional skip embeddings representing the descriptions in Equations 15 and 17 are projected to 500 dimensions using a fully connected layer. • ks-mem vs. rw-mem: 'ks-mem' refers to the Knowledge-Store memory, and 'rw-mem' refers to the Read-Write memory. • ndialog: The number of dialogue snippets N diag used as input for the attention models. Any attention model without the explicit N diag listed uses N diag = 8.

Ablation Results
Baseline vs. Attention Model. The attention model shows a large improvement over the baseline models. This matches our intuition that not every quote is strongly indicative of character trope. Some may be largely expository or 'chitchat' pieces of dialogue. Example attention scores are shown in Section 7.2.   Though our experiments showed marginal improvement between using the 'char' data and the '3' data, we found that using all 3 inputs had greater performance for models with the triplet loss and read-only memory. This is likely because the others' lines and context capture more of the social dynamics and situations that are described in the trope descriptions. Subsequent results are shown only for the 'attn 3' models.
Trope Description Triplet Loss. Adding the trope description loss alone provided relatively small gains in performance, though we see greater gains when combined with memory. While both use the descriptions, perhaps the Knowledge Store memory matches an embedding against all the tropes, whereas the trope triplet loss is only provided information from one positive and one negative example.
Memory Modules.
The Knowledge-Store memory in particular was helpful. Initialized with the trope descriptions, this memory can 'sharpen' queries toward one of the tropes. The Read-Write memory had smaller gains in performance. It may be that more data is required to take advantage of the write capabilities.
Combined Trope Description Triplet Loss and Memory Modules. Using the triplet loss with memory modules led to greater performance when compared to the attn 3 model, but the performance sits around the use of either triplet only or memory only. However, when we increase the N diag to 16 or 32, we find a jump in performance. This is likely the case because the model has both increased learning capacity and a larger sample of data at every batch, which means at least some of the N diag quotes should be informative about the trope.

Attention Scores
Because the inter-snippet encoder provides such a large gain in performance compared to the baseline model, we provide an example illustrating the weights placed on a batch of N diag snippets. Figure 4 shows the attention scores for the character's lines in the "byronic hero" trope. Matching what we might expect for an antihero personality, we find the top weighted line to be full of confidence and heroic bluster, while the middle lines hint at the characters' personal turmoil. We also find the lowly weighted sixth and seventh lines to be largely uninformative (e.g. "I heard things."), and the last line to be perhaps too pessimistic and negative for a hero, even a byronic one.

Purity scores of character clusters
Finally, we measure our ability to recover the trope 'clusters' (with one trope being a cluster of its characters) with our embeddings through the purity score used in (Bamman et al., 2014). Equation 19 measures the amount of overlap between two clusterings, where N is the total number of characters, g i is the i-ith ground truth cluster, and c j is the j-th predicted cluster.
We use a simple agglomerative clustering method on our embeddings with a parameter k for the number of clusters. The methods in (Bamman et al., 2014) contain a similar hyper-parameter for the number of persona clusters. We note that the metrics are not completely comparable because not every character in the original dataset was found on IMDB. The results are shown in Table  3. It might be expected that our model perform better because we use the character tropes themselves as training data. However, dialogue may be noisier than the movie summary data; their better performing Persona Regression (PR) model also uses useful metadata features such as the movie genre and character gender. We simply note that our scores are comparable or higher.   (Bamman et al., 2014), and AMN is our attention memory network.

Application: Narrative Analysis
We collected IMDB quotes for the top 250 movies on IMDB. For every character, we calculated a character embedding by taking the average embedding produced by passing all the dialogues through our model. We then calculated movie embeddings by taking the weighted sum of all the character embeddings in the movie, with the weight as the percentage of quotes they had in the movie. By computing distances between pairs of character or movie embeddings, we could potentially unearth notable similarities. We note some of the interesting clusters below.

Conclusion
We used the character trope classification task as a test bed for learning personas from dialogue. Our experiments demonstrate that the use of a multilevel attention mechanism greatly outperforms a baseline GRU model. We were also able to leverage prior knowledge in the form of textual descriptions of the trope. In particular, using these descriptions to initialize our Knowledge-Store memory helped improved performance. Because we use short text and can leverage domain knowledge, we believe future work could use our models for applications such as personalized dialogue systems.