Movie Script Summarization as Graph-based Scene Extraction

In this paper we study the task of movie script summarization, which we argue could enhance script browsing, give readers a rough idea of the script’s plotline, and speed up reading time. We formalize the process of generating a shorter version of a screenplay as the task of ﬁnding an optimal chain of scenes. We develop a graph-based model that selects a chain by jointly optimizing its logical progression, diversity, and importance. Human evaluation based on a question-answering task shows that our model produces summaries which are more informative compared to competitive baselines.


Introduction
Each year, about 50,000 screenplays are registered with the WGA 1 , the Writers Guild of America. Only a fraction of these make it through to be considered for production and an even smaller fraction to the big screen. How do producers and directors navigate through this vast number of scripts available? Typically, production companies, agencies, and studios hire script readers, whose job is to analyze screenplays that come in, sorting the hopeful from the hopeless. Having read the script, a reader will generate a coverage report consisting of a logline (one or two sentences describing the story in a nutshell), a synopsis (a two-to three-page long summary of the script), comments explaining its appeal or problematic aspects, and a final verdict as to whether the script merits further consideration. A script excerpt 1 The WGA is a collective term representing US TV and film writers.
We can't get a good glimpse of his face, but his body is plump, above average height; he is in his mid 30's. Together they easily lift the chair into the truck. MAN   from "Silence of the Lambs", an American thriller released in 1991, is shown in Figure 1.
Although there are several screenwriting tools for authors (e.g., Final Draft is a popular application which automatically formats scripts to industry standards, keeps track of revisions, allows insertion of notes, and writing collaboratively online), there is a lack of any kind of script reading aids. Features of such a tool could be to automatically grade the quality of the script (e.g., thumbs up or down), generate synopses and loglines, identify main characters and their stories, or facilitate browsing (e.g., "show me every scene where there is a shooting"). In this paper we explore whether current NLP technology can be used to address some of these tasks. Specifically, we focus on script summarization, which we conceptualize as the process of generating a shorter version of a screenplay, ideally encapsulating its most informative scenes. The resulting summaries can be used to enhance script browsing, give readers a rough idea of the script's content and plotline, and speed up reading time.
So, what makes a good script summary? According to modern film theory, "all films are about nothing -nothing but character" (Monaco, 1982). Beyond characters, a summary should also highlight major scenes representative of the story and its progression. With this in mind, we define a script summary as a chain of scenes which conveys a narrative and smooth transitions from one scene to the next. At the same time, a good chain should incorporate some diversity (i.e., avoid redundancy), and focus on important scenes and characters. We formalize the problem of selecting a good summary chain using a graph-theoretic approach. We represent scripts as (directed) bipartite graphs with vertices corresponding to scenes and characters, and edge weights to their strength of correlation. Intuitively, if two scenes are connected, a random walk starting from one would reach the other frequently. We find a chain of highly connected scenes by jointly optimizing logical progression, diversity, and importance.
Our contributions in this work are three-fold: we introduce a novel summarization task, on a new text genre, and formalize scene selection as the problem of finding a chain that represents a film's story; we propose several novel methods for analyzing script content (e.g., identifying important characters and their interactions); and perform a large-scale human evaluation study using a question-answering task. Experimental results show that our method produces summaries which are more informative compared to several competitive baselines.

Related Work
Computer-assisted analysis of literary text has a long history, with the first studies dating back to the 1960s (Mosteller and Wallace, 1964). More recently, the availability of large collections of digitized books and works of fiction has enabled researchers to observe cultural trends, address questions about language use and its evolution, study how individuals rise to and fall from fame, perform gender studies, and so on (Michel et al., 2010). Most existing work focuses on low-level analysis of word patterns, with a few notable exceptions. Elson et al. (2010) analyze 19th century British novels by constructing a conversational network with vertices corresponding to characters and weighted edges corresponding to the amount of conversational interaction. Elsner (2012) analyzes characters and their emotional trajectories, whereas Nalisnick and Baird (2013) identify a character's enemies and allies in plays based on the sentiment of their utterances. Other work (Bamman et al., 2013(Bamman et al., , 2014 automatically infers latent character types (e.g., villains or heroes) in novels and movie plot summaries.
Although we are not aware of any previous approaches to summarize screenplays, the field of computer vision is rife with attempts to summarize video (see Reed 2004 for an overview). Most techniques are based on visual information and rely on low-level cues such as motion, color, or audio (e.g., Rasheed et al. 2005). Movie summarization is a special type of video summarization which poses many challenges due to the large variety of film styles and genres. A few recent studies (Weng et al., 2009;Lin et al., 2013) have used concepts from social network analysis to identify lead roles and role communities in order to segment movies into scenes (containing one or more shots) and create more informative summaries. A surprising fact about this line of work is that it does not exploit the movie script in any way. Characters are typically identified using face recognition techniques and scene boundaries are presumed unknown and are automatically detected. A notable exception are Sang and Xu (2010) who generate video summaries for movies, while taking into account character interaction features which they estimate from the corresponding screenplay.
Our own approach is inspired by work in egocentric video analysis. An egocentric video offers a first-person view of the world and is captured from a wearable camera focusing on the user's activities, social interactions, and interests. Lu and Grauman (2013) present a summarization model which extracts subshot sequences while finding a balance of important subshots that are both diverse and provide a natural progression through the video, in terms of prominent visual objects (e.g., bottle, mug, television). We adapt their technique to our task, and show how to estimate character-scene correlations based on linguistic analysis. We also interpret movies as social networks and extract a rich set of features from character interactions and their sentiment which we use to guide the summarization process.

ScriptBase: A Movie Script Corpus
We compiled ScriptBase, a collection of 1,276 movie scripts, by automatically crawling web-sites which host or link entire movie scripts (e.g., imsdb.com). The retrieved scripts were then cross-matched against Wikipedia 2 and IMDB 3 and paired with corresponding user-written summaries, plot sections, loglines and taglines (taglines are short snippets used by marketing departments to promote a movie). We also collected metainformation regarding the movie's genre, its actors, the production year, etc. ScriptBase contains movies comprising 23 genres; each movie is on average accompanied by 3 user summaries, 3 loglines, and 3 taglines. The corpus spans years 1909-2013. Some corpus statistics are shown in Figure 2. The scripts were further post-processed with the Stanford CoreNLP pipeline (Manning et al., 2014) to perform tagging, parsing, named entity recognition and coreference resolution. They were also annotated with semantic roles (e.g., ARG0, ARG1), using the MATE tools (Björkelund et al., 2009 for training/development and 65 movies for testing.

The Scene Extraction Model
As mentioned earlier, we define script summarization as the task of selecting a chain of scenes representing the movie's most important content. We interpret the term scene in the screenplay sense. A scene is a unit of action that takes place in one location at one time (see Figure 1). We therefore need not be concerned with scene segmentation; scene boundaries are clearly marked, and constitute the basic units over which our model operates. Let M = (S,C) represent a screenplay consisting of a set S = {s 1 , s 2 , . . . , s n } of scenes, and a set C = {c 1 , . . . , c m } of characters. We are interested in finding a list S = {s i , . . . s k } of ordered, consecutive scenes subject to a compression rate m (see the example in Figure 3). A natural interpretation of m in our case is the percentage of scenes from the original script retained in the summary. The extracted chain should contain (a) important scenes (i.e., critical for comprehending the story and its development); (b) diverse scenes that cover different aspects of the story; and (c) scenes which highlight the story's progression from beginning to end. We therefore find the chain S maximizing the objective function Q(S ) which is the weighted sum of three terms: the story progression P, scene diversity D, and scene importance I: In the following, we define each of the three terms. Scene-to-scene Progression The first term in the objective is responsible for selecting chains representing a logically coherent story. Intuitively, this means that if our chain includes a scene where a character commits an action, then scenes involving affected parties or follow-up actions should also be included. We operationalize this idea of progression in a story in terms of how strongly the characters in a selected scene s i influence the transition to the next scene s i+1 : We represent screenplays as weighted, bipartite graphs connecting scenes and characters: The set of vertices V corresponds to the union of characters C and scenes S. We therefore add to the bipartite graph one node per scene and one node per character, and two directed edges for each scene-character and character-scene pair. An example of a bipartite graph is shown in Figure 4. We further assume that two scenes s i and s i+1 are tightly connected in such a graph if a random walk with restart (RWR; Tong et al. 2006;Kim et al. 2014) which starts in s i has a high probability of ending in s i+1 . In order to calculate the random walk stationary distributions, we must estimate the weights between a character and a scene. We are interested in how important a character is generally in the movie, and specifically in a particular scene. For w c,s , we consider the probability of a character being important, i.e., of them belonging to the set of main characters: where P(c ∈ main(M)) is some probability score associated with c being a main character in script M.
For w s,c , we take the number of interactions a character is involved in relative to the total number of interactions in a specific scene as indicative of the character's importance in that scene. Interactions refer to conversational interactions as well as relations between characters (e.g., who does what to whom): We defer discussion of how we model probability P(c ∈ Main(M)) and obtain interaction counts to Section 5. Weights w s,c and w c,s are normalized: We calculate the stationary distributions of a random walk on a transition matrix T , enumerating over all vertices v (i.e., characters and scenes) in the bipartite graph B: We measure the influence individual characters have on scene-to-scene transitions as follows. The stationary distribution r k for a RWR walker starting at node k is a vector that satisfies: where T is the transition matrix of the graph, e k is a seed vector, with all elements 0, except for element k which is set to 1, and ε is a restart probability parameter. In practice, our vectors r k and e k are indexed by the scenes and characters in a movie, i.e., they have length |S| + |C|, and their n th element corresponds either to a known scene or character. In cases where graphs are relatively small, we can compute r directly 4 by solving: The lth element of r then equals the probability of the random walker being in state l in the stationary distribution. Let r c k be the same as r k , but with the character node c of the bipartite graph being turned into a sink, i.e., all entries for c in the transition matrix T are 0. We can then define how a single character influences the transition between scenes s i and s i+1 as: where r s i [s i+1 ] is shorthand for that element in the vector r s i that corresponds to scene s i+1 . We use the INF score directly in Equation (3) to determine the progress score of a candidate chain.
Diversity The diversity term D(S ) in our objective should encourage chains which consist of more dissimilar scenes, thereby avoiding redundancy. The diversity of chain S is the sum of the diversities of its successive scenes: The diversity d(s i , s i+1 ) of two scenes s i and s i+1 is estimated taking into account two factors: (a) do they have any characters in common, and (b) does the sentiment change from one scene to the next: where d char (s i , s i+1 ) and d sen (s i , s i+1 ) respectively denote character and sentiment similarity between scenes. Specifically, d char (s i , s i+1 ) is the relative character overlap between scenes s i and s i+1 : d char will be 0 if two scenes share the same characters and 1 if no characters are shared. Analogously, we define d sen , the sentiment overlap between two scenes as: where the sentiment sen(s) of scene s is the aggregate sentiment score of all interactions in s: We explain how interactions and their sentiment are computed in Section 5. Again, d sen is larger if two scenes have a less similar sentiment. di f (s i , s i+1 ) becomes 1 if the sentiments are identical, and increasingly smaller for more dissimilar sentiments. The sigmoid-like function in Equation (15) scales d sen within range [0, 1] to take smaller values for larger sentiment differences (factor k adjusts the curve's smoothness).
Importance The score I(S ) captures whether a chain contains important scenes. We define I(S ) as the sum of all scene-specific importance scores imp(s i ) of scenes contained in the chain: The importance imp(s i ) of a scene s i is the ratio of lead to support characters within that scene: imp(s i ) = ∑ c: c∈C s i ∧c∈main(M) 1 ∑ c: c∈C s i 1 where C s i is the set of characters present in scene s i , and main(M) is the set of main characters in the movie. 5 I(s i ) is 0 if a scene does not contain any main characters, and 1 if it contains only main characters (see Section 5 for how main(M) is inferred).

Optimal Chain Selection
We use Linear Programming to efficiently find a good chain. The objective is to maximize Equation (2), i.e., the sum of the terms for progress, diversity and importance, subject to their weights λ. We add a constraint corresponding to the compression rate, i.e., the number of scenes to be selected and enforce their linear order by disallowing non-consecutive combinations. We use GLPK 6 to solve the linear problem.

Implementation
In this section we discuss several aspects of the implementation of the model presented in the previous section. We explain how interactions are extracted and how sentiment is calculated. We also present our method for identifying main characters and estimating the weights w s,c and w c,s in the bipartite graph.
Interactions The notion of interaction underlies many aspects of the model defined in the previous section. For instance, interaction counts are required to estimate the weights w s,c in the bipartite graph of the progression term (see Equation (5)), and in defining diversity (see Equations (15)- (17)). As we shall see below, interactions are also important for identifying main characters in a screenplay. We use the term interaction to refer to conversations between two characters, as well as their relations (e.g., if a character kills another). For conversational interactions, we simply need to identify the speaker generating an utterance and the listener. Speaker attribution comes for free in our case, as speakers are clearly marked in the text (see Figure 1). Listener identification is more involved, especially when there are multiple characters in a scene. We rely on a few simple heuristics. We assume that the previous speaker in the same scene, who is different from the current speaker, is the listener. If there is no previous speaker, we assume that the listener is the closest character mentioned in the speaker's utterance (e.g., via a coreferring proper name or a pronoun). In cases where we cannot find a suitable listener, we assume the current speaker is the listener.
We obtain character relations from the output of a semantic role labeler. Relations are denoted by verbs whose ARG0 and ARG1 roles are character names. We extract relations from the dialogue but also from scene descriptions. For example, in Figure 1 the description Suddenly, [...] he clubs her over the head contains the relation clubs(MAN,CATHERINE). Pronouns are resolved to their antecedent using the Stanford coreference resolution system (Lee et al., 2011).
Sentiment We labeled lexical items in screenplays with sentiment values using the AFINN-96 lexicon (Nielsen, 2011), which is essentially a list of words scored with sentiment strength within the range [−5, +5]. The list also contains obscene words (which are often used in movies) and some Internet slang. By summing over the sentiment scores of individual words, we can work out the sentiment of an interaction between two characters, the sentiment of a scene (see Equation (17)), and even the sentiment between characters (e.g., who likes or dislikes whom in the movie in general).

Main Characters
The progress term in our summarization objective crucially relies on characters and their importance (see the weight w c,s in Equation (4)). Previous work (Weng et al., 2009;Lin et al., 2013) extracts social networks where nodes correspond to roles in the movie, and edges to their co-occurrence. Leading roles (and their communities) are then identified by measuring their centrality in the network (i.e., number of edges terminating in a given node).
It is relatively straightforward to obtain a social network from a screenplay. Formally, for each movie we define a weighted and undirected graph: where vertices correspond to movie characters 7 , and edges denote character-to-character interactions. Figure 5 shows an example of a social network for "The Silence of the Lambs". Due to lack of space, only main characters are displayed, however the actual graph contains all characters (42 in this case). Importantly, edge weights are not normalized, but directly reflect the strength of association between different characters. We do not solely rely on the social network to identify main characters.
We estimate P(c ∈ main(M)), the probability of c being a leading character in movie M, using a Multi Layer Perceptron (MLP) and several features pertaining to the structure of the social network and the script text itself. A potential stumbling block in treating character identification as a classification task is obtaining training data, i.e., a list of main characters for each movie. We generate a gold-standard by assuming that the characters listed under Wikipedia's Cast section (or an equivalent section, e.g., Characters) are the main characters in the movie.
Examples of the features we used for the classification task include the barycenter of a character (i.e., the sum of its distance to all other characters), PageRank (Page et al., 1999), an eigenvectorbased centrality measure, absolute/relative interaction weight (the sum of all interactions a character is involved in, divided by the sum of all interactions in the network), absolute/relative number of sentences uttered by a character, number of times a character is described by other characters (e.g., He is a monster or She is nice), number of times a character talks about other characters, and type-tokenratio of sentences uttered by the character (i.e., rate of unique words in a character's speech). Using these features, the MLP achieves an F1 of 79.0% on the test set. It outperforms other classification methods such as Naive Bayes or logistic regression. Using the full-feature set, the MLP also obtains performance superior to any individual measure of graph connectivity.
Aside from Equation (4), lead characters also appear in Equation (19), which determines scene importance. We assume a character c ∈ main(M) if it is predicted by the MLP with a probability ≥ 0.5.

Experimental Setup
Gold Standard Chains The development and tuning of the chain extraction model presented in Section 4 necessitates access to a gold standard of key scene chains representing the movie's most important content. Our experiments concentrated on a sample of 95 movies (comedies and thrillers) from the ScriptBase corpus (Section 3). Performing the scene selection task for such a big corpus manually would be both time consuming and costly. Instead, we used distant supervision based on Wikipedia to automatically generate a gold standard.
Specifically, we assume that Wikipedia plots are representative of the most important content in a movie. Using the alignment algorithm presented in Nelken and Shieber (2006), we align script sentences to Wikipedia plot sentences and assume that scenes with at least one alignment are part of the gold chain of scenes. We obtain many-to-many alignments using features such as lemma overlap and word stem similarity. When evaluated on four movies 8 (from the training set) whose content was manually aligned to Wikipedia plots, the aligner achieved a precision of .53 at a recall rate of .82 at deciding whether a scene should be aligned. Scenes are ranked according to the number of alignments they contain. When creating gold chains at different compression rates, we start with the best-ranked scenes and then successively add lower ranked ones until we reach the desired compression rate.
System Comparison In our experiments we compared our scene extraction model (SceneSum) against three baselines. The first baseline was based on the minimum overlap (MinOv) of characters in consecutive scenes and corresponds closely to the diversity term in our objective. The second baseline was based on the maximum overlap (MaxOv) of characters and approximates the importance term in our objective. The third baseline selects scenes at random (averaged over 1,000 runs). Parameters for our models were tuned on the training set, weights for the terms in the objective were optimized to the following values: λ P = 1.0, λ D = 0.3, and λ I = 0.1. We set the restart probability of our random walker to ε = 0.5, and the sigmoid scaling factor in our diversity term to k = −1.2.
Evaluation We assessed the output of our model (and comparison systems) automatically against the gold chains described above. We performed experiments with compression rates in the range of 10% to 50% and measured performance in terms of F1. In addition, we also evaluated the quality of the extracted scenes as perceived by humans, which is necessary, given the approximate nature of our gold standard. We adopted a question-answering (Q&A) evaluation paradigm which has been used previously to evaluate summaries and document compression (Morris et al., 1992;Mani et al., 2002;Clarke and Lapata, 2010). Under the assumption that the summary is to function as a replacement for the full script, we can measure the extent to which it can be used to find answers to questions which have been derived from the entire script and are representative of its core content. The more questions a hypothetical system can answer, the better it is at summarizing the script as a whole.
Two annotators were independently instructed to read scripts (from our test set) and create Q&A pairs. The annotators generated questions relating to the plot of the movie and the development of its characters, requiring an unambiguous answer. They compared and revised their Q&A pairs until a common agreed-upon set of five questions per movie was reached (see Table 1 for an example). In addition, for every movie we asked subjects to name the main characters, and summarize its plot (in no more than four sentences). Using Amazon Mechanical Turk (AMT) 9 , we elicited answers for eight scripts (four comedies and thrillers) in four summarization con-  ditions: using our model, the two baselines based on minimum and maximum character overlap, and the random system. All models were assessed at the same compression rate of 20% which seems realistic in an actual application environment, e.g., computer aided summarization. The scripts were preselected in an earlier AMT study where participants were asked to declare whether they had seen the movies in our test set (65 in total). We chose the screenplays which had received the least viewings so as to avoid eliciting answers based on familiarity with the movie. A total of 29 participants, all self-reported native English speakers, completed the Q&A task. The answers provided by the subjects were scored against an answer key. A correct answer was marked with a score of one, and zero otherwise. In cases where more answers were required per question, partial scores were awarded to each correct answer (e.g., 0.5). The score for a summary is the average of its question scores. Table 2 shows the performance of SceneSum, our scene extraction model, and the three comparison systems (MaxOv, MinOv, Random) on the automatic gold standard at five compression rates. As can be seen, MaxOv performs best in terms of F1, followed by SceneSum. We believe this is an artifact due to the way the gold standard was created. Scenes with large numbers of main characters are more likely to figure in Wikipedia plot summaries and will thus be more frequently aligned. A chain based on maximum character overlap will focus on such scenes and will agree with the gold standard better compared to chains which take additional script properties into account. We further analyzed the scenes selected by Sce-neSum and the comparison systems with respect to their position in the script.  erage percentage of scenes selected from the beginning, middle, and end of the movie (based on an equal division of the number of scenes in the screenplay). As can be seen, the number of selected scenes tends to be evenly distributed across the entire movie. SceneSum has a slight bias towards the beginning of the movie which is probably natural, since leading characters appear early on, as well as important scenes introducing essential story elements (e.g., setting, points of view).

Results
The results of our human evaluation study are summarized in Table 4. We observe that SceneSum summaries are overall more informative compared to those created by the baselines. In other words, AMT participants are able to answer more questions regarding the story of the movie when reading SceneSum summaries. In two instances ("A Nightmare on Elm Street 3" and "Mumford"), the overlap models score better, however, in this case the movies largely consist of scenes with the same characters and relatively little variation ("A Nightmare on Elm Street 3"), or the camera follows the main lead in his interactions with other characters ("Mumford"). Since our model is not so character-centric, it might be thrown off by non-character-based terms in its objective, leading to the selection of unfavorable scenes. Table 4 also presents a break down of the different types of questions answered by our participants. Again, we see that in most cases a larger percentage is answered correctly when reading Sce-neSum summaries.
Overall, we observe that SceneSum extracts chains which encapsulate important movie content across the board. We should point out that although our movies are broadly classified as comedies and thrillers, they have very different structure and content. For example, "Little Athens" has a very loose plotline, "Living in Oblivion" has multi-  ple dream sequences, whereas "While She was Out" contains only a few characters and a series of important scenes towards the end. Despite this variety, SceneSum performs consistently better in our taskbased evaluation.

Conclusions
In this paper we have developed a graph-based model for script summarization. We formalized the process of generating a shorter version of a screenplay as the task of finding an optimal chain of scenes, which are diverse, important, and exhibit logical progression. A large-scale evaluation based on a question-answering task revealed that our method produces more informative summaries compared to several baselines. In the future, we plan to explore model performance in a wider range of movie genres as well as its applicability to other NLP tasks (e.g., book summarization or event extraction). We would also like to automatically determine the compression rate which should presumably vary according to the movie's length and content. Finally, our long-term goal is to be able to generate loglines as well as movie plot summaries.