Learning Similarity between Movie Characters and Its Potential Implications on Understanding Human Experiences

While many different aspects of human experiences have been studied by the NLP community, none has captured its full richness. We propose a new task to capture this richness based on an unlikely setting: movie characters. We sought to capture theme-level similarities between movie characters that were community-curated into 20,000 themes. By introducing a two-step approach that balances performance and efficiency, we managed to achieve 9-27% improvement over recent paragraph-embedding based methods. Finally, we demonstrate how the thematic information learnt from movie characters can potentially be used to understand themes in the experience of people, as indicated on Reddit posts.


Introduction
What makes a person similar to another? While there is no definitive answer, some aspects that have been investigated in the NLP community are personality (Gjurković and Šnajder, 2018;Conway and O'Connor, 2016), demographics (Nguyen et al., 2016) as well as personal beliefs and intents (Sap et al., 2019). While each of these aspects is valuable on its own, they also seem somewhat lacking to sketch a complete picture of a person. Researchers who recognise such limitations seek to ameliorate them by jointly modelling multiple aspects at the same time (Benton et al., 2017). Yet, we intuitively know that as humans, we are more than the sum of the multiple aspects that constitutes our individuality. Our human experiences are marked by so many different aspects that interact in ways that we can not anticipate. What then can we do to better capture the degree of similarity between different people?
Finding similar movie characters can be an interesting first step to understanding humans better. Many characters are inspired by and related to true stories of people so understanding how to identify similarities between character descriptions might ultimately help us to better understand similarities in human characteristics and experiences. One way of defining what makes movie character descriptions similar is when community-based contributors on All The Tropes 2 classify them into the same theme (also known as a trope), with an example from the trope "Driven by Envy" shown in Table 1. Other themes (tropes) include "Parental Neglect", "Fallen Hero", and "A Friend in Need".
Such community-based curation allows All The Tropes to reap the same advantages as Wikipedia and open-sourced software: a large catalog can be created with high internal-consistency given the in-built self-correction mechanisms. This approach allowed us to collect a dataset of >100 thousand characters labelled with >20,000 themes without requiring any annotation cost. Based on this dataset, we propose a model that can be used to identify similar movie characters precisely yet efficiently. While movie characters may not be the perfect reflection of human experience, we ultimately show that they are good enough proxies when collecting a dataset of similar scale with real people would be extremely expensive.
Our key contributions are as follows: 1. We conduct a pioneering study on identifying similar movie character descriptions using weakly supervised learning, with potential implications on understanding similarities in human characteristics and experiences.
2. We propose a two-step generalizable approach that can be used to identify similar movie characters precisely yet efficiently and demonstrate that our approach performs at least 9-27% better than methods employing recent paragraph embedding-based approaches.
Superman's 1990s enemy Conduit. Conduit hates Superman because he knows if Superman wasn't around he would be humanity's greatest hero instead ...

Loki
Loki's constant scheming against Thor in his efforts to one-up him gave Odin and the rest of Asgard more and more reasons to hate Loki ... 3. We show that our model, which is trained on identifying similar movie characters, can be related to themes in human experience found in Reddit posts.
2 Related Work

Analysis of characters in film and fiction
Characters in movies and novels have been computationally analyzed by many researchers. Bamman et al. (2013Bamman et al. ( , 2014 attempted to cluster various characters into prototypes based on topic modelling techniques (Blei et al., 2003). On the other hand, Frermann and Szarvas (2017) and Iyyer et al. (2016) sought to cluster fictional characters alongside the relationships between them using recurrent neural networks and matrix factorization. While preceded by prior literature, our work is novel in framing character analysis as a supervised learning problem rather than an unsupervised learning problem. Specifically, we formulate it as a similarity learning task between characters. Tapping on fancurated movie-character labels (ie tropes) can provide valuable information concerning character similarity, which previous literature did not use. A perceptible effect of this change in task formulation is that our formulation allows movie characters to be finely distinguished amongst > 20000 themes versus < 200 in prior literature. Such differences in task formulation can contribute a fresh perspective into this research area and inspire subsequent research.
Furthermore, the corpus we use differs significantly from those used in existing research. We use highly concise character descriptions of around 200 words whereas existing research mostly uses movie/book-length character mentions. Concise character descriptions can exemplify specific trait-s/experiences of characters. This allows the differences between characters to be more discriminative compared to a longer description, which might include more points of commonality (going to school/work, eating and having a polite conversation). This means that such concise descriptions can eventually prove more helpful in understanding characteristics and experiences of humans.

Congruence between themes in real-life experiences and movie tropes
Mostly researched in the field of psychology, reallife experiences are often analyzed through asking individuals to document and reflect upon their experiences. Trained analysts then seek to classify such writing into predefined categories. Demorest et al. (1999) interpreted an individual's experience in the form of three key stages: an individual's wish, the response from the other and the response from the self in light of the response from the other. Each stage consists of around ten predefined categories such as wanting to be autonomous (Stage 1), being denied of that autonomy (Stage 2) and developing an enmity against the other (Stage 3). Thorne and McLean (2001) organized their analysis in terms of central themes. These central themes include experiences of interpersonal turmoil, having a sense of achievement and surviving a potentially life-threatening event/illness.
Both studies above code individuals' personal experiences into categories/themes that greatly resemble movie tropes. Because of this congruence, it is very likely that identifying similarity between characters in the same trope can inform about similarity between people in real-life. A common drawback of Demorest et al. (1999) and Thorne and McLean (2001) lie in their relatively small sample size (less than 200 people classified into tens of themes/categories). Comparatively, our study uses > 100,000 characters fine-grainedly labelled by fans into >20,000 tropes. As a result, this study has the potential of supporting a better understanding of tropes, which we have shown to be structurally similar to themes in real-life experiences.

Candidate selection in information retrieval
Many information retrieval pipelines involve first identifying likely candidates and then postprocessing these candidates to determine which among them are most suitable. The most widelyused class of approaches for this purpose is known as Shingling and Locally Sensitive Hashing (Leskovec et al., 2020;Rodier and Carter, 2020). Such approaches first represent documents as Bagof-Ngrams before hashing such representation into shorter integer-vector signatures. These signatures contain information on n-gram overlap between documents and hence encode lexical features that characterize similar documents. However, such approaches are unable to identify documents that are similar based on abstract semantic features rather than superficial lexical similarities. Recent progress in language modelling has enabled the semantic meaning of short paragraphs to be encoded beyond lexical features (Peters et al., 2018;Devlin et al., 2019;Howard and Ruder, 2018;Raffel et al., 2019). This has reaped substantial gains in text similarity tasks including entailment tasks (Bowman et al., 2015;Williams et al., 2018), duplicate questions tasks (Sharma et al., 2019;Nakov et al., 2017) and others (Cer et al., 2017;Dolan and Brockett, 2005). Yet, such progress has yet to enable better candidate selection based on semantic similarities. As a result, relatively naive approaches such as exhaustive pairwise comparisons and distance-based measures continue to be the dominant approach in identifying similar documents encoded into dense contextualized embeddings (Reimers and Gurevych, 2019). To improve this gap in knowledge, this study proposes and validates a candidate selection method that is compatible with recent progress in text representation.

Task formulation
There is a set of unique character descriptions from the All The Tropes (Character 0 , Character 1 ... Character n ), each associated with a non-unique trope (theme) (T rope 0 , T rope 0 ... T rope p ). Given this set, find the k (where k = 1, 5 or 10) most similar character(s) to each character without making explicit use of the trope association of each character. In doing so, the goal is to have a maximal proportion of most similar character(s) which share the same tropes.

Methods
In this section, we first discuss how we prepare the dataset and trained a BERT Next Sentence Prediction (NSP) model to identify similar characters. Based on this model, we present a 2-step Select and Refine approach, which can be utilized to find the most similar characters quickly yet effectively.

Dataset
Character descriptions from All The Tropes 3 were used. We downloaded all character descriptions that had more than 100 words because character descriptions that are too short are unlikely to provide sufficient textual information for comparing similarity with other character descriptions. We then filtered our data to retain only tropes that contain more than one character descriptions. Character descriptions were then randomly split into training and evaluation sets (evaluation set = 20%). Inspired by BERT NSP dataset construction Devlin et al. (2019), we generated all possible combinationpairs of character descriptions that are classified under each trope (i.e. an unordered set) and gave the text-pair a label of IsSimilar, For each IsSimilar pair in the training set, we took the first item, randomly selected a character description that is not in the same trope as the first item and gave the new pair a label of NotSimilar.
Descriptive statistics are available in Table 2.

Training BERT Next Sentence Prediction model
We trained a BERT Next Sentence Prediction model (English-base-uncased) 4 with the pretrained weights used as an initialization. As this model was trained to perform pair-wise character comparison instead of next sentence prediction, we thereafter name it as Character Comparison Model (CCM). All hyper-parameters used to train the model were default 5 except adjusting the maximum sequence length to 512 tokens (to adapt to the paragraph-length text), batch-size per GPU to 8 and epoch number to 2, as recommended by Devlin et al. (2019). Among the training set, 1% was separated as a validation set during the training process. We also used the default pre-trained BERT Englishbase-uncased tokenizer because only a small proportion of words (< 0.5%) in the training corpus were out-of-vocabulary, of which most were names. As a result, training took 3 days on 4 Nvidia Tesla P100 GPUs. 2) top_n characters are then selected using cosine similarity based on the Character Embedding Model or using a Siamese-BERT model, which has been omitted from the illustration for clarity (Section 4.3.1). This selection is then refined using the Character Comparison Model to create a similarity matrix, which can then be sorted to identified most similar characters.

Select and Refine
To address the key limitation of utilizing exhaustive pairwise comparison in practice -its impractically long computation time (≈ 10 thousand GPU-hours on Nvidia Tesla P100), we propose a two-step Select and Refine approach. The Select step first identifies a small set of likely candidates in a coarse but computationally efficient manner. Then, the Refine step re-ranks these candidates using a precise but computationally expensive model. In doing so, it combines their strengths to precisely identify similar characters while being computationally efficient. While the Select and Refine approach is designed for identifying similar characters, this novel approach can also be directly used in other tasks involving semantic similarities between a pair of texts.

Select
Characters that are likely to be similar to each character are first selected using a variant of our CCM model -named the Character Encoding Model (thereafter CEM). This model differs from the CCM model in that it does not utilize the final classifier layer. Therefore it can process a character description individually (instead of in pairs) to output an embedding that represents the character. The shared weights with CCM means that it encodes semantic information in a a similar way. This makes it likely that the most cosine similar character descriptions based on their character embedding are likely to have high (but not necessarily the highest) character-pair similarity.
Beyond the CEM, any model capable of efficiently generating candidates for similar character description texts in O(n) time can also be used for this Select step, allowing immense flexibility in the application of the Select and Refine approach. To demonstrate this, we also test a Siamese-BERT model for the Select step, with the details of its preparation in Section 5.2.
In this step, we effectively reduced the search space for the most similar characters. We choose top_n candidates characters which are most similar to each character, forming top_n most similar character-pairs. top_n is a hyper-parameter that can range from 1 to 500. Strictly speaking, this step requires O(n 2 ) comparisons to find the top_n most similar character-pairs. However, each cosine similarity calculation is significantly less computationally demanding compared to each BERT NSP operation (note that CCM is trained from an NSP model). This also applies to the Siamese-BERT model because character embeddings can be cached, meaning that only a single classification layer operation needs to be repeated O(n 2 ) times. This means that computational runtime is dominated by O(n) BERT NSP operations in the subsequent Refine step, given the huge constant factor for BERT NSP operations. Overall, this step took 0.25 GPU-hours.

Refine
The initial selection of candidates for most similar characters to each character will then be refined using the CCM model. This step is more computationally demanding (0.25 * top_n GPU-hours) but can more effectively determine the extent to which characters are similar. Character Comparison Model (CCM) will then only be used on the top_n most similar candidate character-pairs, reducing the number of operations for each character from the total number of characters (n chars ) to only top_n. As a consequence, the runtime complexity of the overall operation is reduced from O(n 2 chars ) to O(top_n · n chars ) == O(n chars ), given top_n is a constant.

Evaluation
In this section, we first present evaluation metrics and then present the preparation of baseline models including state-of-the-art paragraph-level embedding models. Finally, we analyze the performance of our models relative to baseline models.

Evaluation metrics
Recall @ k considers the proportion of all groundtruth pairs found within the k (1, 5 or 10) most similar characters to each character (Manning et al., 2008). Normalized Discounted Cumulative Gain @ k (nDCG @ k) is a precision metric that considers the proportion of predicted k most similar characters to each character that are in the ground-truth character-pairs. It also takes into account the order amongst top k predicted most similar characters (Wang et al., 2013). Mean reciprocal rank (MRR) identifies the rank of the first correctly predicted most similar character for each character and averages the reciprocal of their ranks. (Voorhees, 2000). Higher is better for all metrics.
Google Universal Sentence Encoder-large model 7 (USE) on Tensorflow Hub was used to obtain a 512-dimensional vector representation of each character description. Bag of Words (BoW) was implemented by lowercasing all words and counting the number of times each word occurred in each character description. BERT embedding of 768 dimensions were obtained by average-pooling all the word embedding of tokens in the second-tolast layer, as recommended by (Xiao, 2018). The English-base-uncased version 8 was used. For each type of embedding, the most similar characters were obtained by finding other characters whose embeddings are most cosine similar.
Siamese-BERT was obtained based on training a Siamese model architecture connected to a BERT base model on the training set in Section 4.1. We follow the optimal model configuration for sentence-pair classification tasks described in Reimers and Gurevych (2019), which involves taking the mean of all tokens embeddings in the final layer. With the mean embedding for each character description, an absolute difference between them was taken. The mean embedding for character A, mean embedding for character B and their absolute difference was then entered into a feedforward neural network, which makes the prediction. Siamese-BERT was chosen as a baseline due to its outstanding performance in sentence-pair classifi-cation tasks such as Semantic Textual Similarity (Cer et al., 2017) and Natural Language Inference (Bowman et al., 2015;Williams et al., 2018). For this baseline, the characters most similar to a character are those with the highest likelihood of being predicted IsSimilar with the character.

Suitability of Siamese-BERT and CEM
for Step 1: Select While the prohibitively high computational demands of exhaustive pairwise comparison (≈ 10 thousand GPU-hours) prevents a full-scale evaluation of the adequateness of Siamese-BERT and CEM for Step 1:Select, we conducted a small-scale experiment on 100 randomly chosen characters from the test set. First, an exhaustive pairwise comparison was conducted between these randomly chosen characters and all characters in the test set. From this, 100 characters with the highest CCM similarity value with each of the randomly chosen characters were identified. Next, various methods in Table 3 were attempted to identify 500 characters with the highest cosine similarity with the randomly chosen characters. Finally, the proportion of overlap between CCM and each method was calculated. Results demonstrate that Siamese-BERT and CEM have the greatest overlap and hence, the use of Siamese-BERT and CEM can select for the most number of highly similar characters to be refined by the CCM.

Selecting hyper-parameter top_n for
Step 2: Refine Based on Figure 2, the ideal top_n for the Select and Refine model with Siamese-BERT varies between 7 and 25 depending on the metric that is optimised for. In general, a lower value for top_n is preferred when optimizing for Recall@k and nDCG@k with smaller values of k. The metrics reported in Table 4 consist of the optimal value for each metric at various top_n. On the other hand, there is no ideal value for top_n when using the Select and Refine model with CEM. Instead, the metrics continue to improve over large values of top_n, albeit at a gradually reduced rate. However, due to practical considerations relating to GPU computation time, we terminated our search at top_n = 500 and report metrics for that value of top_n.
Together, this means that the Select and Refine model using Siamese-BERT achieves peak performance with significant less computational resources compared to the one using CEM (2-6 GPUhours vs. 125 GPU-hours).

Comparing Select and Refine models with baseline models
As shown in Table 4, the highest value for all metrics lies below 40% suggesting that identifying similar characters is a novel and challenging task. This is because there are only very few correct answers (characters from the same trope) out of 27,000 possible characters. The poor performance of the Bag-of-Words baseline also demonstrates that abstract semantic similarity between characters is significantly different from their superficial lexical similarity. In face of such challenges, the Select and Refine model using Siamese-BERT performed 9-27 % better on all metrics than the best performing paragraph-embedding-based baseline. This suggests the importance of refining initial selection of candidates instead of using them directly, even when the baseline model has relatively good performance.
Comparing the Select and Refine models, Siamese-BERT performed much better than CEM

Recall @ k (in %)
nDCG @ k (in %) MRR k = 1 k = 5 k = 10 k = 1 k = 5 k = 10 (in %)  while having a significantly low top_n, which means that less computational resources is required. The superior performance and efficiency of Siamese-BERT means that it is more suitable for

Select and Refine models
Step 1: Select. This is likely caused by the higher performance of Siamese-BERT as a baseline model. While it was surprising that using Siamese-BERT outperformed CEM, which directly shares weights with the CCM, such an observation also shows the relatively low coupling between the Select and Refine steps. This means that the Select and Refine approach that we propose can continue to be relevant when model architectures that are more optimized for each step are introduced in the future.
The significantly higher performance of Select and Refine models can be attributed to the ability of underlying BERT NSP architecture in our CCM to consider complex word relationships across the two character descriptions. A manual examination of correct pairs captured only by Select and Refine models but not baseline models revealed that these pairs often contain words relating to multiple common aspects. As an example, one character description contains "magic, enchanter" and "training, candidate, learn" while the other character in the ground-truth pair contains "spell, wonder, sphere" and "researched, school". Compressing these word-level aspects into a fixed-length vector would cause some important semantic information -such as the inter-relatedness between aspects -to be lost (Conneau et al., 2018). As a result, capturing similarities between these pairs prove to be difficult in baseline models, leading to sub-optimal ranking of the most similar characters.
6 Implications for understanding themes in real-life experiences 6.1 Relating movie characters to Reddit posts To demonstrate the potential applications of this study in understanding human experiences, we designed a task that can show how the model can be used with zero-shot transfer learning. Specifically, we used our model to identify the movie-characters that are most fitting to a description of people's life experiences. To do this, we collected 50 posts describing people's real-life experiences from a forum r/OffMyChest on Reddit 9 , on which people share their life experiences with strangers online. Then, we used our models to identify 10 movie characters (from our test set) that are most befitting to each post. For each of these 10 movie characters suggested by model, three graduate students independently rated whether the character matches the concepts, ideas and themes expressed in each post, while blind to information on which model the characters were generated by. Because the extent of similarity between a movie character and a Reddit post can be ambiguous, a binary annotation was chosen over a Likert scale for clarity of annotation. Annotators were instructed to annotate "similar" when they can specify at least one area of overlap between the concepts, ideas and themes of a Reddit post and a movie character. Examples of some characters that are indicated as "similar" to two posts are shown in Appendix A. Annotators agree on 94.2% of labels (Cohen's κ = 0.934). Where the annotators disagree, the majority opinion out of three is taken. From these annotations, Precision @ k (in %) k = 1 k = 5 k = 10  Precision @ k is calculated, considering the proportion of all characters identified within the k (1, 5 or 10) that are labelled as "similar" (Manning et al., 2008).
In Table 5, the performance of our Select and Refine models reflects a similar extent of improvement compared to our main learning task. This shows that the model that was trained to disambiguate movie character similarity can also determine the extent of similarity between movie characters and people's life experiences. Beyond the relative performance gains, the Select and Refine model on this task also demonstrates an excellent absolute performance of precision @ 1 = 98.00%. This means that our model can be used on this task without any fine-tuning.
Illustrating the difference in performance of the various models in Table 6, the better performing models on this task are generally better at capturing thematic similarities in terms of the abstract sense of recollection and memory, which are thematically more related to the Reddit post. Our Select and Refine model (with Siamese-BERT) is particularly effective at capturing both a sense of recollection as well as a sense of reverence towards a respected figure (historical figure and father respectively). In contrary, the poorer performing models contain phrase-level semantic overlap (USE: picture with facial recognition; BoW: killed and passed away; eyes and recognize) but fail to capture thematic resemblance. This suggests our learning of similarities between movie characters of the same trope can effectively transfer onto thematic similarities between written human experiences and movie characters.

Future directions
We are excited about the diversity of research directions that this study can complement. One possible area is social media analysis (Zirikly et al., 2019;Amir et al., 2019;Hauser et al., 2019). Researchers can make use of movie characters with known experiences (e.g. mental health, personal circumstances or individual interests) to identify similar experiences in social media when collecting large amounts of text labelled with such experiences directly is difficult.
Another area would be personalizing dialogue agents (Tigunova et al., 2020;Zhang et al., 2018). In the context of limited personality-related training data, movie characters with personality that are similar to a desired dialogue agent can be found. Using this, a dialogue agent can be trained with movie subtitle language data (involving the identified movie character). Thereby, the augmented linguistic data enables the dialogue agent to have a well-defined, distinct and consistent personality.
A final area that can benefit from this study is media recommendations (Rafailidis et al., 2017). Users might be suggested media content based on the extent to which movie characters resonate with their own/friends' experiences. Additionally, with social environments being formed in games (particularly social simulation games such as Animal Crossing, The Sims and Pokemon) as well as in virtual reality (Chu et al., 2020), participants can even assume the identity of movie characters that they are similar to, so as to have an interesting and immersive experience.

Reddit post
My father passed away when I was 6 so I didn't really remember much of him but the fact that I didn't recognize his picture saddens me.

Select and Refine
Siamese-BERT Sisko in Star Trek: Deep Space Nine (Past Tense) When he encountered an entry about the historical figure, passed comment about how closely Sisko resembled a picture of him (the picture, of course, being that of Sisko.) CEM Roxas in Kingdom Hearts: Chain of Memories His memories are wiped by Ansem the Wise and placed in a simulated world with a completely new identity

Baseline
Siamese-BERT Audrina, My Sweet Audrina by V.C Andrews is a girl living in the constant shadow of her elder sister who had died nine years before she was born CEM Macsen Wledig in The Mabinogion An amazing memory was an important necessity to the job, but remembering many long stories was much more important than getting one right after days of wandering around madly muttering BERT Kira in Push is made to think that her entire relationship with Nick was a false memory that she gave him and she's been pushing his thoughts the entire time they were together. USE EyeRobot in Fallout: New Vegas can recognize your face and voice with advanced facial and auditory recognition technology BoW Magneto took Ron the Death Eater Up to Eleven to show him as he "truly" was in Morrison's eyes, and ended with him (intended as) Killed Off for Real Table 6: Most similar character predicted by each model to a post from Reddit r/OffMyChest. Excerpts of Reddit post mildly paraphrased to protect anonymity.