Movie Plot Analysis via Turning Point Identification

According to screenwriting theory, turning points (e.g., change of plans, major setback, climax) are crucial narrative moments within a screenplay: they define the plot structure, determine its progression and segment the screenplay into thematic units (e.g., setup, complications, aftermath). We propose the task of turning point identification in movies as a means of analyzing their narrative structure. We argue that turning points and the segmentation they provide can facilitate processing long, complex narratives, such as screenplays, for summarization and question answering. We introduce a dataset consisting of screenplays and plot synopses annotated with turning points and present an end-to-end neural network model that identifies turning points in plot synopses and projects them onto scenes in screenplays. Our model outperforms strong baselines based on state-of-the-art sentence representations and the expected position of turning points.


Introduction
Computational literary analysis works at the intersection of natural language processing and literary studies, aiming to evaluate various theories of storytelling (e.g., by examining a collection of works within a single genre, by an author, or topic) and to develop tools which aid in searching, visualizing, or summarizing literary content.
Within natural language processing, computational literary analysis has mostly targeted works of fiction such as novels, plays, and screenplays. Examples include analyzing characters, their relationships, and emotional trajectories (Chaturvedi et al., 2017;Iyyer et al., 2016;Elsner, 2012), identifying enemies and allies (Nalisnick and Baird, 2013), villains or heroes (Bamman et al., 2014(Bamman et al., , 2013, measuring the memorability of quotes (Danescu-Niculescu-Mizil et al., 2012), characterizing gender representation in dialogue (Agarwal et al., 2015;Ramakrishna et al., 2015;Sap et al., 2017), identifying perpetrators in crime se-Turning Point Description 1. Opportunity Introductory event that occurs after the presentation of the setting and the background of the main characters.

Change of Plans
Event where the main goal of the story is defined. From this point on, the action begins to increase.

Point of No Return
Event that pushes the main character(s) to fully commit to their goal.

Major Setback
Event where everything falls apart (temporarily or permanently).
In this paper we are interested in the automatic analysis of narrative structure in screenplays. Narrative structure, also referred to as a storyline or plotline, describes the framework of how one tells a story and has its origins to Aristotle who defined the basic triangle-shaped plot structure representing the beginning (protasis), middle (epitasis), and end (catastrophe) of a story (Pavis, 1998). The German novelist and playwright Gustav Freytag modified Aristotle's structure by transforming the triangle into a pyramid (Freytag, 1896). In his scheme, there are five acts (introduction, rising movement, climax, return, and catastrophe). Several variations of Freytag's pyramid are used today in film analysis and screenwriting (Cutting, 2016).
In this work, we adopt a variant commonly employed by screenwriters as a practical guide for producing successful screenplays (Hague, 2017). According to this scheme, there are six stages (acts) in a film, namely the setup, the new situation, progress, complications and higher stakes, the final push, and the aftermath, separated by five turning points (TPs). TPs are narrative moments from which the plot goes in a different direction Recently divorced Meg Altman and her 11-year-old daughter Sarah have just purchased a four-story brownstone on New York City. The house's previous owner installed an isolated room used to protect the house's occupants from intruders. On the night the two move into the home, it is broken by Junior, the previous owner's grandson; Burnham, an employee of the residence's security company; and Raoul; a ski mask-wearing gunman.
The three are after $3 million in bearer bonds, which are locked inside a floor safe in the panic room..... As they begin the robbery, Meg wakes up and happens to see the intruders on the video monitors in the panic room. Before the three can reach them, Meg and Sarah run into the panic room and close the door behind them, only to find that the burglars have disabled the telephone. Intending to force them of the room, Burnham introduces propane gas into the room's air vents....Meg then taps into the main telephone line and gets through to her ex-husband Stephen, before the burglars cut them off.... Stephen arrives at the home and is taken hostage by Burnham and Raoul-who severely beats him. To make matters worse, Sarah, who has diabetes, suffers a seizure.
Her glucagon syringe is in a refrigerator outside the panic room. After using an unconscious Stephen to trick Meg into momentarily leaving the panic room, Burnham enters it, finding Sarah motionless on the floor..... After Burnham gives Sarah the injection, Sarah thanks him. Having earlier received a call from Stephen, two policemen arrive, which prompts Raoul to threaten Sarah's life. Sensing the potential danger to her daughter, Meg lies to the officers and they leave.
Meanwhile, Burnham opens the safe and removes the $22 million in bearer bonds inside. As the robbers attempt to leave, using Sarah as a hostage, Meg hits Raoul with a sledgehammer and Burnham flees. After a badly injured Stephen shoots at Raoul and misses, Raoul disables him and prepares to kill Meg with the sledgehammer, but Burnham, upon hearing Sarah's screams of pain, returns to the house and shoots Raoul dead, stating, "You'll be okay now", to Meg and her daughter before leaving.
The police, alerted by Meg's suspicious behavior earlier, arrive in force and capture Burnham. Later, Meg and Sarah, having recovered from their harrowing experience, begin searching the newspaper for a new home.  (TP1,  TP2, TP3, TP4, TP5, respectively) for the synopsis of the movie "Panic Room". (Thompson, 1999), and by definition they occur at the junctions of acts. Aside from changing narrative direction, TPs define the movie's structure, tighten the pace, and prevent the narrative from drifting. The five TPs and their definitions are given in Table 1.
We propose the task of turning point identification in movies as a means of analyzing their narrative structure. TP identification provides a sequence of key events in the story and segments the screenplay into thematic units. Common approaches to summarization and QA of long or multiple documents (Chen et al., 2017;Kratzwald and Feuerriegel, 2018;Elgohary et al., 2018) include a retrieval system as the first step, which selects a subset of relevant passages for further processing. However, Kočiskỳ et al. (2018) demonstrate that these approaches do not perform equally well for extended narratives, since individual passages are very similar and the same entities are referred to throughout the story. We argue that this challenge can be addressed by TP identification, which finds the most important events and segments the narrative into thematic units. Downstream processing for summarization or question answering can then focus on those segments that are relevant to the task.
Problematically for modeling purposes, TPs are latent in screenplays, there are no scriptwriting conventions (like character cues or scene headings) to denote where TPs occur, and their exact manifestation varies across movies (depending on genre and length), although there are some rules of thumb indicating where to expect a TP (e.g., the Opportunity occurs after the first 10% of a screenplay, Change of Plans is approximately 25% in). To enable automatic TP identification, we develop a new dataset which consists of screenplays, plot synopses, and turning point annotations. To save annotation time and render the labeling task feasible, we collect TP annotations at the plot synopsis level (synopses are a few paragraphs long compared to screenplays which are on average 120 pages long). An example is given in Figure 1. We then project the TP annotations via distant supervision onto screenplays and propose an end-toend neural network model which identifies TPs in full length screenplays.
Our contributions can be summarized as follows: (a) we introduce TP identification as a new task for the computational analysis of screen plays that can benefit applications such as QA and summarization; (b) we create and make publicly available the TuRnIng POint Dataset (TRI-POD) 1 which contains 99 movies (3,329 synopsis sentences and 13,403 screenplay scenes) annotated with TPs; and (c) we present an end-toend neural network model that identifies turning points in plot synopses and projects them onto scenes in screenplays, outperforming strong baselines based on state-of-the-art sentence representations and the expected position of TPs.

Related Work
Recent years have seen increased interest in the automatic analysis of long and complex narratives. Specifically, Machine Reading Comprehension (MRC) and Question Answering (QA) tasks are transitioning from investigating single short and clean articles or queries (Rajpurkar et al., 2016;Nguyen et al., 2016;Trischler et al., 2016) to large scale datasets that consist of complex stories (Tapaswi et al., 2016;Frermann et al., 2018;Kočiskỳ et al., 2018;Joshi et al., 2017) or require reasoning across multiple documents (Welbl et al., 2018;Wang et al., 2018;Dua et al., 2019;. Tapaswi et al. (2016) introduce a multi-modal dataset consisting of questions over 140 movies, while Frermann et al. (2018) attempt to answer a single question, namely who is the perpetrator in 39 episodes of the well-known crime series CSI, again based on multi-modal informa-tion. Finally, Kočiskỳ et al. (2018) recently introduced a dataset consisting of question-answer pairs over 1,572 movie screenplays and books.
Previous approaches have focused on finegrained story analysis, such as inducing character types (Bamman et al., 2013(Bamman et al., , 2014 or understanding relationships between characters (Iyyer et al., 2016;Chaturvedi et al., 2017). Various approaches have also attempted to analyze the goal and structure of narratives. Black and Wilensky (1979) evaluate the functionality of story grammars in story understanding, Elson and McKeown (2009) develop a platform for representing and reasoning over narratives, and Chambers and Jurafsky (2009) learn fine-grained chains of events.
In the context of movie summarization, Gorinski and Lapata (2018) automatically generate an overview of the movie's genre, mood, and artistic style based on screenplay analysis. Gorinski and Lapata (2015) summarize full length screenplays by extracting an optimal chain of scenes via a graph-based approach centered around the characters of the movie. A similar approach has also been adopted by Vicol et al. (2018), who introduce the MovieGraphs dataset consisting of 51 movies and describe video clips with character-centered graphs. Other work creates animated story-boards using the action descriptions of screenplays (Ye and Baldwin, 2008), extracts social networks from screenplays (Agarwal et al., 2014a), or creates xkcd movie narrative charts (Agarwal et al., 2014b).
Our work also aims to analyze the narrative structure of movies, but we adopt a high-level approach. We advocate TP identification as a precursor to more fine-grained analysis that unveils character attributes and their relationships. Our approach identifies key narrative events and segments the screenplay accordingly; we argue that this type of preprocessing is useful for applications which might perform question answering and summarization over screenplays. Although our experiments focus solely on the textual modality, turning point analysis is also relevant for multimodal tasks such as trailer generation and video summarization.

The TRIPOD Dataset
The TRIPOD dataset contains 99 screenplays, accompanied with cast information (according to IMDb), and Wikipedia plot synopses annotated with turning points. The movies were selected from the Scriptbase dataset (Gorinski and Lapata,  2015) based on the following criteria: (a) maintaining a variation across different movie genres (e.g., action, romance, comedy, drama) and narrative types (e.g., flashbacks, time shifts); and (b) including screenplays that are faithful to the released movies and their synopses as much as possible. In Table 2, we present various statistics of the dataset. Our motivation for obtaining TP annotations at the synopsis level (coarse-grained), instead of at the screenplay level (fine-grained) was twofold. Firstly, on account of being relatively short, synopses are easier to annotate than full-length screenplays, allowing us to scale the dataset in the future. Secondly, we would expect synopsis-level annotations to be more reliable and the degree of inter-annotator agreement higher; asking annotators to identify precisely where a turning point occurs might seem like looking for a needle in a haystack. An example of a synopsis with TP annotations is shown in Figure 1 for the movie "Panic Room". Each TP is colored differently, and both the chain of key events (colored text) and resulting segmentation ( § ) are illustrated. In an initial pilot study, the three authors acted as annotators for identifying TPs in movie synopses. They selected exactly one sentence per TP, under the assumption that all TPs are present. Based on the pilot, annotation instructions were devised and an annotation tool was created which allows to label synopses with TPs sentence-bysentence. After piloting the annotation scheme on 30 movies, two new annotators were trained using our instructions and in a second study, they doubly annotated five movies. The remaining movies in the dataset were then single annotated by the new annotators.
We computed inter-annotator agreement using two different metrics: (a) total agreement (TA), i.e., the percentage of TPs that two annotators agree upon by selecting the exact same sentence; and (b) annotation distance, i.e., the distance d[p i ,t p i ] between two annotations for a given TP, normalized by synopsis length: where N is the number of synopsis sentences and t p i and p i are the indices of the sentences labeled with TP i by two annotators. The mean annotation distance D is then computed by averaging distances d[p i ,t p i ] across all annotated TPs. The TA between the two annotators in our second study was 64.00% and the mean annotation distance was 4.30% (StDev 3.43%). The annotation distance per TP is presented in Table 5 (last line), where it is compared with the automatic TP identification results (to be explained later).
We also asked our annotators to annotate the screenplays (rather than synopses) for a subset of 15 movies. This subset serves as our goldstandard test set. Annotators were given synopses annotated with TPs and were instructed to indicate for each TP which scenes in the screenplay correspond to it. Six of the 15 movies were doubly annotated, so that we could measure agreement. Since annotators were allowed to choose a variable number of scenes for each TP, this changes slightly our agreement metrics.
Total Agreement (TA) now is the percentage of TP scenes the annotators agree on: where T , L are the TPs identified per annotator in a screenplay, and S i and G i are the indices of the scenes selected for TP i by the two annotators. Partial Agreement (PA) is the percentage of TPs where there is an overlap of at least one scene: And annotation distance D becomes the mean of the distances 2 d[S i , G i ] between two annotators normalized by M, the length of the screenplay: The TA and PA between the two annotators were 35.48% and 56.67%, respectively. The mean annotation distance was 1.48% (StDev 2.93%). The TA shows that the annotators rarely indicate the same scenes, even if they are asked to annotate an event in the screenplay that is described by a specific synopsis sentence. However, they identify scenes which are in close proximity in the screenplay, as PA and annotation distance reveal. This analysis validates our assumption that annotating the synopses first limits the degree of overall disagreement.

Turning Point Prediction Models
In this work, we aim to detect text segments which act as TPs. We first identify which sentences in plot synopses are TPs (Section 4.1); next, we identify which scenes in screenplays act as TPs via projection of goldstandard TP labels (Section 4.2); finally, we build an end-to-end system which identifies TPs in screenplays based on predicted TP synopsis labels (Section 4.3). All models we propose in this paper have the same basic structure; they take text segments i (sentences or scenes) as input and predict whether these act as TPs or not. Since the sequence, number, and labels of TPs are fixed (see Table 1), we treat TP identification as a binary classification problem (where 1 indicates that the text is a TP and 0 otherwise). Each segment is encoded into a multi-dimensional feature space x i which serves as input to a fully-connected layer with a single neuron representing the probability that i acts as a TP. In the following, we describe several models which vary in the way input segments are encoded.

Identifying Turning Points in Synopses
Context-Aware Model (CAM) A simple baseline model would compute the semantic representation of each sentence in the synopsis using a pretrained sentence encoder. However, classifying segments in isolation without considering the context in which they appear, might yield inferior semantic representations. We therefore obtain richer representations for sentences by modeling their surrounding context. We encode the synopsis with a Bidirectional Long Short-Term Memory (BiL-STM; Hochreiter and Schmidhuber 1997) network; and obtain contextualized representation cp i  x 1 x N . . .
x 1 x N . . . for sentence x i by concatenating the hidden layers of the forward − → h i and backward ← − h i LSTM, respectively: for a more detailed description, see the Appendix). Representation cp i is the input feature vector for our binary classifier. The model is illustrated in Figure 2a.
Topic-Aware Model (TAM) TPs by definition act as boundaries between different thematic units in a movie. Furthermore, long documents are usually comprised of topically coherent text segments, each of which contains a number of text passages such as sentences or paragraphs (Salton et al., 1996). Inspired by text segmentation approaches (Hearst, 1997) which measure the semantic similarity between sequential context windows in order to determine topic boundaries, we enhance our representations with a context interaction layer. The objective of this layer is to measure the similarity of the current sentence with its preceding and following context, thereby encoding whether it functions as a boundary between thematic sections. The enriched model with the context interaction layer is illustrated in Figure 2a.
After calculating contextualized sentence representations cp i , we compute the representation of the left lc i and right rc i contexts of sentence i (see Figure 2a, right-hand side). We select windows of fixed length l and calculate lc i and rc i by averaging the sentence representations within each window. Next, we compute the semantic similarity of the current sentence with each context representation. Specifically, we consider the element-wise product b i , cosine similarity c i and pairwise distance u i as similarity metrics: The interaction representation of sentence cp i with its left context is the concatenation of cp i , f l i , and the above similarity values (i.e., b i , c i , u i ): The interaction representation f r i for the right context rc i is computed analogously. We obtain the final representation of sentence i via concatenating f l i and f r i : y i = [ f l i ; f r i ; cp i ].

TP-Specific Information
Another variation of our model is to use TP-specific encoders instead of a single one (see Figure 2b). In this case, we employ five different encoders for calculating five different representations for the current synopsis sentence x i , each one with respect to a specific TP. These representations can be considered multiple views of the same sentence. We calculate the interaction of each view with the left and right context window, as previously, via the context interaction layer. Finally we compute the sentence representation y i by concatenating its individual contextenriched TP representations.

Entity-Specific Information
We also enrich our model with information about entities. We first apply co-reference resolution to the plot synopses using the Stanford CoreNLP toolkit (Manning et   2014) and substitute mentions of named entities whenever these are included in the IMDb cast list. We then obtain entity-specific sentence representations as follows. Our encoder uses a word embedding layer initialized with pre-trained entity embeddings and a BiLSTM for contextualizing word representations. We add an attention mechanism on top of the LSTM, which assigns a weight to each word representation. We compute the entity-specific representation e i for synopsis sentence i as the weighted sum of its word representations (for more details, see the Appendix). Finally, entity enriched sentence representations x i are obtained by concatenating generic vectors x i with entity-specific ones e i :

Identifying Turning Points in Screenplays
Identifying TPs in synopses serves as a testbed for validating some of the assumptions put forward in this work, namely that turning points mark narrative progression and can be identified automatically based on their lexical makeup. Nevertheless, we are mainly interested in the real-world scenario where TPs are detected in longer documents such as screenplays. Screenplays are naturally segmented into scenes, which often describe a self-contained event that takes place in one location, and revolves around a few characters. We therefore assume that scenes are suitable textual segments for signaling TPs in screenplays. Unfortunately, we do not have any goldstandard information about TPs in screenplays. We pro-  vide distant supervision by constructing noisy labels based on goldstandard TP annotations in synopses (see the description below). Given sentences labeled as TPs in a synopsis, we identify scenes in the corresponding screenplay which are semantically similar to them. We formulate this task as a binary classification problem, where a sentencescene pair is deemed either "relevant" or "irrelevant" for a given TP.

Distant Supervision
Based on the screenwriting scheme of Hague (2017), TPs are expected to occur in specific parts of a screenplay (e.g., the Climax is likely to occur towards the end). We exploit this knowledge as a form of distant supervision. We estimate the mean position for each TP using the gold standard annotation of the plot synopses in our training set (normalized by the synopsis length). The results are shown in Table 3, together with the TP positions postulated by screenwriting theory. We observe that our estimates agree well with the theoretical predictions, but also that some TPs (e.g., TP2 and TP3) are more variable in their position than others (e.g., TP1 and TP5). This leads us to the following hypothesis: each TP is situated within a specific window in a screenplay. Scenes that lie within the window are semantically related to the TP, whereas all other scenes are unrelated. In experiments we calculate a window µ ± σ based on our data (see Table 3). We compute scene representations based on the sequence of sentences that comprise it using a BiLSTM equipped with an attention mechanism (see Section 4.1). The final scene representation s is the weighted sum of the representations of the scene sentences. Next, the TP-scene interaction layer enriches scene representations with similarity values with each marked TP synopsis sentence t p as shown in Equations (5)-(7).
We again augment the above-described base model with contextualized sentence and scene representations using a synopsis and screenplay encoder. The synopsis encoder is the same one used for our sentence-level TP prediction task (see Section 4.1). The screenplay encoder works in a sim-ilar fashion over scene representations.
Topic-Aware Model (TAM) TAM enhances our screenplay encoder with information about topic boundaries. Specifically, we compute the representations of the left lc i and right rc i context window of the i th scene in the screenplay as described in Section 4.1. Next, we compute the final representation z i of scene sc i by concatenating the representations of the context windows lc i and rc i and the current scene sc i : z i = [lc i ; sc i ; rc i ]. There is no need to compute the similarity between scenes and context windows here as we now have goldstandard TP representations in the synopsis and employ the TP-scene interaction layer for the computation of the similarity between TPs and enriched scene representations z i . Hence, we directly calculate in this layer a scene-level feature vector that encodes information about the scene, its similarity to TP sentences, and whether these function as boundaries between topics in the screenplay.

Entity-Specific information
We can also employ an entity-specific encoder (see Section 4.1) for the representing the synopsis and scene sentences. Again, generic and entity-specific representations are combined via concatenation.

End-to-end TP Identification
Our ultimate goal is to identify TPs in screenplays without assuming any goldstandard information about their position in the synopsis. We address this with an end-to-end model which first predicts the sentences that act as TPs in the synopsis (e.g., TAM in Section 4.1) and then feeds these predictions to a model which identifies the corresponding TP scenes (e.g., TAM in Section 4.2).

Experimental Setup
Training We used the Universal Sentence Encoder (USE; Cer et al. 2018) as a pre-trained sentence encoder for all models and tasks; its performance was superior to BERT (Devlin et al., 2018) and other related pre-trained encoders (for more details, see the Appendix). Since the binary labels in both prediction tasks are imbalanced, we apply class weights to the loss function of our models. We weight each class by its inverse frequency in the training set (for more implementation details, see the Appendix).
Inference During inference in our first task (i.e., identification of TPs in synopses), we select one sentence per TP. Specifically, we want to track   the five sentences with the highest posterior probability of being TPs and sequentially assign them TP labels based on their position. However, it is possible to have a cluster of neighboring sentences with high probability, even though they all belong to the same TP. We therefore constrain the sentence selection for each TP within the window of its expected position, as calculated in the distribution baseline (Section 4.2).
For models which predict TPs in screenplays, we obtain a probability distribution over all scenes in a screenplay indicating how relevant each is to the TPs of the corresponding plot synopsis. We find the peak of each distribution and select a neighborhood of scenes around this peak as TPrelevant ones. Based on the goldstandard annotation, each TP corresponds to 1.77 relevant scenes on average (StDev 1.23). We therefore consider a neighborhood of three relevant scenes per TP. Table 4a reports our results on the development set (we extracted 20 movies from the original training set) which aim at comparing various model instantiations for the TP identification task. Specifically, we report the performance of a baseline model which is nei- ther context-aware nor utilizes topic boundary information against CAM and TAM. We also show two variants of TAM enhanced with TP-specific encoders (+ TP views) and entity-specific information (+ entities). Model performance is measured using the evaluation metrics of Total Agreement (TA) and annotation distance (D), normalized by synopsis length (equation (1)).

TP Identification in Synopses
The baseline model presents the lowest performance among all variants which suggests that state-of-the-art sentence representations on their own are not suitable for our task. Indeed, when contextualizing the synopsis sentences via a BiL-STM layer we observe an absolute increase of 4.00% in terms of TA. Moreover, the addition of a context interaction layer (see TAM row in Table 4a) yields an absolute TA improvement of 4.00% compared to CAM. Combining different TP views further improves by 3.00%, reaching a TA of 39.00%, and reducing D to 6.52%. Table 4b shows our results on the test set. We compare TAM, our best performing model against two strong baselines. The first one selects sentences that lie on the expected positions of TPs according to screenwriting theory; while the second one selects sentences that lie on the peaks of the empirical TP distributions in the training set (Section 4.2). As we can see, TAM (+ TP views) achieves a TA of 38.57% compared to 22.00% for the distribution baseline. And although entityspecific information does not have much impact on the development set, it yields a 2.76% improvement on the test set. A detailed break down of results per TP is given in Table 5. Interestingly, our model resembles human behavior (see row Human agreement): TPs 1, 4, and 5 are easiest to distinguish, whereas TPs 2 and 3 are hardest and frequently placed at different points in the synopsis.
We also conducted a human evaluation experi-  Table 6: Identification of TPs in screenplays; results are shown in percent using five-fold cross validation (TA: mean Total Agreement; PA: Partial Agreement; D: annotation distance D; standard deviation in brackets). ment on Amazon Mechanical Turk (AMT). AMT workers were presented with a synopsis and "highlights", i.e., five sentences corresponding to TPs. We obtained highlights from goldstandard annotations, the distribution baseline, and TAM (+ TP views). AMT workers were asked to read the synopsis and rank the highlights from best to worst according to the following criteria: (1) the quality of the plotline that they form; (2) whether they include the most important events and plot twists of the movie; and (3) whether they provide some description of the events in the beginning and end of the movie. In Figure 4 we show, proportionally, how often our participants ranked each model 1st, 2nd, and so on. Perhaps unsurprisingly, goldstandard TPs were considered best (and ranked 1st 42% of the time). TAM is ranked best 30% of the time, followed by the distribution baseline which was only ranked first 26% of the time. Overall, the average ranking positions for the goldstandard, TAM, and the baseline are 1.87, 1.98, and 2.16, respectively. Human evaluation therefore validates that our model outperforms the positionbased baselines.
TP Identification in Screenplays Our results are summarized in Table 6. For this task, we performed five-fold crossvalidation over our original goldstandard set to obtain a test-development split (recall we do not have goldstandard annotations for training). We report Total Agreement (TA), Partial Agreement (PA), and annotation distance D, normalized by screenplay length (Equations (2)-(4)). Aside from the theory and distribution-based baselines, we also experimented 3 with a com- probability that the scene is relevant to a specific TP. Vertical dashed lines are goldstandard TP scenes.
mon IR baseline which considers TP synopsis sentences as queries and retrieves a neighborhood of semantically similar scenes from the screenplay using tf*idf similarity. Specifically, we compute the maximum tf*idf similarity for all sentences included in the respective scene. We empirically observed that tf*idf's behavior can be erratic selecting scenes in completely different sections of the screenplay, and therefore constrain it by selecting scenes only within the windows determined by the position distributions (µ ± σ) for each TP. As far as our own models are concerned, we report results with goldstandard TP labels for CAM and TAM on their own and enriched with entity information.
We also built and end-to-end system based on TP predictions from TAM. As can be seen in Table 6, tf*idf approaches perform worse than position-related baselines. Overall, similar vocabulary across scenes and mentions of the same entities throughout the screenplay make tf*idf approaches insufficient for our tasks. The best performing model is TAM confirming our hypothesis that TPs are not just isolated key events, but also mark boundaries between thematic units and, therefore, segmentation-inspired approaches can be beneficial for the task. Results for entities are somewhat mixed; for CAM, the entity-specific information improves TA and PA but increases D, while it does not seem to make much difference for TAM. The performance of the end-to-end TAM model drops slightly compared to the same model using goldstandard TP annotations. However, it still remains competitive against the baselines, indicating that tracking TPs in screenplays fully automatically is feasible.
In Figure 5, we visualize the posterior distribution of various models over the scenes of the screenplay for the movie "Juno". The first panel shows the distribution baseline alongside goldstandard TP scenes (vertical lines). We observe that the distribution baseline provides a good approximation of relevant TP positions (which validates its use in the construction of noisy labels, Section 4.2), even though it is not always accurate. For example, TPs 1 and 3 lie outside the expected window in "Juno".
The second panel presents the TP predictions according to tf*idf similarity. We observe that scenes located in entirely different parts of the screenplay present high similarity scores with respect to a given TP due to vocabulary uniformity and mentions of the same entities throughout the screenplay. In the next panel we present the predictions of TAM. Adding synopsis and screenplay encoders yields smoother distributions increasing the probability of selecting TP scenes inside distinct regions of the screenplay, with sharper peaks and higher confidence.

Conclusions
We proposed the task of turning point identification in screenplays as a means of analyzing their narrative structure. We demonstrated that automatically identifying a sequence of key events and segmenting the screenplay into thematic units is feasible via an end-to-end neural network model. In future work, we will investigate the usefulness of TPs for summarization and question answering. We will also scale the TRIPOD dataset and move to a multi-modal setting where TPs are identified directly in video data.