Generating Animations from Screenplays

Automatically generating animation from natural language text finds application in a number of areas e.g. movie script writing, instructional videos, and public safety. However, translating natural language text into animation is a challenging task. Existing text-to-animation systems can handle only very simple sentences, which limits their applications. In this paper, we develop a text-to-animation system which is capable of handling complex sentences. We achieve this by introducing a text simplification step into the process. Building on an existing animation generation system for screenwriting, we create a robust NLP pipeline to extract information from screenplays and map them to the system’s knowledge base. We develop a set of linguistic transformation rules that simplify complex sentences. Information extracted from the simplified sentences is used to generate a rough storyboard and video depicting the text. Our sentence simplification module outperforms existing systems in terms of BLEU and SARI metrics.We further evaluated our system via a user study: 68% participants believe that our system generates reasonable animation from input screenplays.


Introduction
Generating animation from texts can be useful in many contexts e.g. movie script writing (Ma and Kevitt, 2006;Liu and Leung, 2006;Hanser et al., 2010), instructional videos (Lu and Zhang, 2002), and public safety (Johansson et al., 2004). Text-toanimation systems can be particularly valuable for screenwriting by enabling faster iteration, prototyping and proof of concept for content creators.
In this paper, we propose a text-to-animation generation system. Given an input text describing a certain activity, the system generates a rough animation of the text. We are addressing a practical setting, where we do not have any annotated data for training a supervised end-to-end system. The aim is not to generate a polished, final animation, but a pre-visualization of the input text. The purpose of the system is not to replace writers and artists, but to make their work more efficient and less tedious. We are aiming for a system which is robust and could be deployed in a production environment.
Existing text-to-animation systems for screenwriting ( §2) visualize stories by using a pipeline of Natural Language Processing (NLP) techniques for extracting information from texts and mapping them to appropriate action units in the animation engine. The NLP modules in these systems translate the input text into predefined intermediate action representations and the animation generation engine produces simple animation from these representations.
Although these systems can generate animation from carefully handcrafted simple sentences, translating real screenplays into coherent animation still remains a challenge. This can be attributed to the limitations of the NLP modules used with regard to handling complex sentences. In this paper, we try to address the limitations of the current text-to-animation systems. Main contributions of this paper are: We propose a screenplay parsing architecture which generalizes well on different screenplay formats ( §3.1). We develop a rich set of linguistic rules to reduce complex sentences into simpler ones to facilitate information extraction ( §3.2). We develop a new NLP pipeline to generate animation from actual screenplays ( §3). The potential applications of our contributions are not restricted to just animating screenplays. The techniques we develop are fairly general and can be used in other applications as well e.g. in-

Related Work
Translating texts into animation is not a trivial task, given that neither the input sentences nor the output animations have a fixed structure. Prior work addresses this problem from different perspectives (Hassani and Lee, 2016).
CONFUCIUS (Ma and Kevitt, 2006) is a system that converts natural language to animation using the FDG parser (Tapanainen and Järvinen, 1997) and WordNet (Miller, 1995). ScriptViz (Liu and Leung, 2006) is another similar system, created for screenwriting. It uses the Apple Pie parser (Sekine, 1998) to parse input text and then recognizes objects via an object-specific reasoner. It is limited to sentences having conjunction between two verbs. SceneMaker (Hanser et al., 2010) adopts the same NLP techniques as proposed in CONFUCIUS (Ma and Kevitt, 2006) followed by a context reasoning module. Similar to previously proposed systems, we also use dependency parsing followed by linguistic reduction ( §3.2).
Recent advances in deep learning have pushed the state of the art results on different NLP tasks (Honnibal and Johnson, 2015;Wolf et al., 2018;He et al., 2017). We use pre-trained models for dependency parsing, coreference resolution and SRL to build a complete NLP pipeline to create intermediate action representations. For the action representation ( §3.4), we use a key-value pair structure inspired by the PAR architecture (Badler et al., 2000), which is a knowledge base of representations for actions performed by virtual agents.
Our work comes close to the work done in the area of Open Information Extraction (IE) (Niklaus et al., 2018). In particular, to extract information, Clause-Based Open IE systems (Del Corro and Gemulla, 2013;Angeli et al., 2015; Schmidek and Barbosa, 2014) reduce a complex sentence into simpler sentences using linguistic patterns. However, the techniques developed for these systems do not generalize well to screenplay texts, as these systems have been developed using well-formed and factual texts like Wikipedia, Penn TreeBank, etc. An initial investigation with the popular Open IE system OLLIE (Open Language Learning for Information Extraction) (Mausam et al., 2012) did not yield good results on our corpus.
Previous work related to information extraction for narrative technologies includes the CARDI-NAL system (Marti et al., 2018;Sanghrajka et al., 2018), as well as the conversational agent PICA (Falk et al., 2018). They focus on querying knowledge from stories. The CARDINAL system also generates animations from input texts. However, neither of the tools can handle complex sentences. We build on the CARDINAL system. We develop a new NLP module to support complex sentences and leverage the animation engine of CAR-DINAL.
Recently, a number of end-to-end image generation systems have been proposed (Mansimov et al., 2015;Reed et al., 2016). But these systems do not synthesize satisfactory images yet, and are not suitable for our application. It is hoped that the techniques proposed in this paper could be used for automatically generating labelled da-ta (e.g. (text,video) pairs) for training end-to-end text-to-animation systems.

Text-to-Animation System
We adopt a modular approach for generating animations from screenplays. The general overview of our approach is presented in Figure 1. The system is divided into three modules: Script Parsing Module: Given an input screenplay text, this module automatically extracts the relevant text for generating the animation ( §3.1). NLP Module: It processes the extracted text to get relevant information. This has two submodules: • Text Simplification Module: It simplifies complex sentences using a set of linguistic rules ( §3.2). • Information Extraction Module: It extracts information from the simplified sentences into pre-defined action representations ( §3.4). Animation Generation Module: It generates animation based on action representations ( §3.5).

Script Parsing Module
Typically, screenplays or movie scripts or scripts (we use the terms interchangeably), are made of several scenes, each of which corresponds to a series of consecutive motion shots. Each scene contains several functional components 1 : Headings (time and location), Descriptions (scene description, character actions), Character Cues (character name before dialog), Dialogs (conversation content), Slug Lines (actions inserted into continuous dialog) and Transitions (camera movement). In many scripts, these components are easily identifiable by indentation, capitalization and keywords. We call these scripts well-formatted, and the remaining ones ill-formatted. We want to segment the screenplays into components and are mainly interested in the Descriptions component for animation generation.

Text Simplification Module
In a typical text-to-animation system, one of the main tasks is to process the input text to extract Syntactic Structure

Identify procedure
Transform procedure Coordination search if cc and conj in dependency tree cut cc and conj link. If conj is verb, mark it as new root; else replace it with its sibling node. Pre-Correlative Conjugation locates position of keywords: "either", "both","neither" removed the located word from dependency tree Appositive Clause find appos token and its head (none) glue appositive noun phrase with "to be" Relative Clause find relcl token and its head cut appos link, then traverse from root. Then, if no "wh" word present, put head part after anchor part; else, we divide them into 5 subcases ( the relevant information about actions (typically verbs) and participants (typically subject/object of the verb), which is subsequently used for generating animation. This works well for simple sentences having a single verb with one subject and one (optional) object. However, the sentences in a screenplay are complicated and sometimes informal. In this work, a sentence is said to be complicated if it deviates from easily extractable and simple subject-verb-object (and its permutations) syntactic structures and possibly has multiple actions mentioned within the same sentence with syntactic interactions between them. By syntactic structure we refer to the dependency graph of the sentence.
In the case of screenplays, the challenge is to process such complicated texts. We take the text simplification approach, i.e. the system first simplifies a complicated sentence and then extracts the relevant information. Simplification reduces a complicated sentence into multiple simpler sentences, each having a single action along with its participants, making it straightforward to extract necessary information.
Recently, end-to-end Neural Text Simplification (NTS) systems (Nisioi et al., 2017;Saggion, 2017) have shown reasonable accuracy. However, these systems have been trained on factual data such as Wikipedia and do not generalize well to screenplay texts. Our experiments with such a pretrained neural text simplification system did not yield good results ( §5.1). Moreover, in the context of text-to-animation, there is no standard labeled corpus to train an end-to-end system.
There has been work on text simplification using linguistic rules-based approaches. For exam-ple, (Siddharthan, 2011) propose a set of rules to manipulate sentence structure to output simplified sentences using syntactic dependency parsing. Similarly, the YATS system (Ferrés et al., 2016) implements a set of rules in the JAPE language (Cunningham et al., 2000) to address six syntactic structures: Passive Constructions, Appositive Phrases, Relative Clauses, Coordinated Clauses, Correlated Clauses and Adverbial Clauses.
Most of the rules focus on case and tense correction, with only 1-2 rules for sentence splitting. We take inspiration from the YATS system, and our system incorporates modules to identify and transform sentence structures into simpler ones using a broader set of rules.
In our system, each syntactic structure is handled by an Analyzer, which contains two processes: Identify and Transform. The Identify process takes in a sentence and determines if it contains a particular syntactic structure. Subsequently, the Transform process focuses on the first occurrence of the identified syntactic structure and then splits and assembles the sentence into simpler sentences. Both Identify and Transform use Part-of-Speech (POS) tagging and dependency parsing (Honnibal and Montani, 2017) modules implemented in spaCy 2.0 3 The simplification algorithm (Algorithm 1) starts with an input sentence and recursively processes it until no further simplification is possible. It uses a queue to manage intermediate simplified sentences, and runs in a loop until the queue is empty. For each sentence, the system applies each syntactic analyzer to Identify the correspon- Suddenly the glare of headlights illuminateds them. -

Open Clausal
The sophomore comes running[xcomp] through the kitchen.
The sophomore runs through the kitchen.
The sophomore comes.

Adjective
Stifler has a toothbrush hanging[acl] from his mouth.
A toothbrush hangs from Stifler's mouth.
Stifler has a toothbrush. ding syntactic structure in the sentence (line 14).
If the result is positive, the sentence is processed by the Transform function to convert it to simple sentences (line 16). Each of the output sentences is pushed by the controller into the queue (line 19).
The process is repeated with each of the Identify analyzers (line 13). If none of the analyzers can be applied, the sentence is assumed to be simple and it is pushed into the result list (line 21). We summarize linguistic rules in Table 1 and examples are given in Table 2. Next, we describe the coordination linguistic rules. For details regarding other rules, please refer to Appendix B.
Coordination: Coordination is used for entities having the same syntactic relation with the head and serving the same functional role (e.g. subj, obj, etc.). It is the most important component in our simplification system. The parser tags word units such as "and" and "as well as" with the dependency label cc, and the conjugated words as conj. Our system deals with coordination based on the dependency tag of the conjugated word.
In the case of coordination, the Identify function simply returns whether cc or conj is in the dependency graph of the input sentence. The Transform function manipulates the graph structure based on the dependency tags of the conjugated words as shown in Figure 2. If the conjugated word is a verb, then we mark it as another root of the sentence. Cutting cc and conj edges in the graph and traversing from this new root results in a new sentence parallel to the original one. In other cases, such as the conjugation between nouns, we simply replace the noun phrases with their siblings and traverse from root again.

Lexical Simplification
In order to generate animation, actions and participants extracted from simplified sentences are mapped to existing actions and objects in the animation engine. Due to practical reasons, it is not possible to create a library of animations for all possible actions in the world. We limit our library to a predefined list of 52 actions/animations, expanded to 92 by a dictionary of synonyms ( §3.5).
We also have a small library of pre-uploaded objects (such as "campfire", "truck" and others).
To animate unseen actions not in our list, we use a word2vec-based similarity function to find the nearest action in the list. Moreover, we use WordNet (Miller, 1995) to exclude antonyms. This helps to map non-list actions (such as "squint at") to the similar action in the list (e.g. "look"). If we fail to find a match, we check for a mapping while including the verb's preposition or syntactic object. We also use WordNet to obtain hypernyms for further checks, when the similarity function fails to find a close-enough animation. Correspondingly, for objects, we use the same similarity function and WordNet's holonyms. In-order traverse from the original root and the new root will result in simplified sentences as shown in Table 2.
Our list of actions and objects is not exhaustive. Currently, we do not cover actions which may not be visual. For out of list actions, we give the user a warning that the action cannot be animated. Nevertheless, this is a work in progress and we are working on including more animations for actions and objects in our knowledge base.

Action Representation Field (ARF): Information Extraction
For each of the simplified sentences, information is extracted and populated into a predefined key-value pair structure. We will refer to the keys of this structure as Action Representation Fields (ARFs). These are similar to entities and relations in Knowledge Bases. ARFs include: owner, target, prop, action, origin action, manner, modifier location, modifier direction, start-time, duration, speed, translation, rotation, emotion, partial start time (for more details see Appendix C). This structure is inspired by the PAR (Badler et al., 2000) architecture, but adapted to our needs.
To extract the ARFs from the simplified sentences, we use a Semantic Role Labelling (SRL) model in combination with some heuristics, for example creating word lists for duration, speed, translation, rotation, emotion. We use a pretrained Semantic Role Labelling model 4 based on a Bi-directional LSTM network (He et al., 2017) with pre-trained ELMo embeddings (Peters et al., 2018). We map information from each sentence to the knowledge base of animations and objects.

Animation Generation
We use the animation pipeline of the CAR-DINAL system. We plug in our NLP module in CARDINAL to generate animation. CARDINAL creates pre-visualizations of the text, both in storyboard form and animation. A storyboard is a series of pictures that demonstrates the sequence of scenes from a script. The animation is a 3-D animated video that approximately depicts the script. CARDINAL uses the Unreal game engine (Games, 2007) for generating pre-visualizations. It has a knowledge base of pre-baked animations (52 animations, plus a dictionary of synonyms, resulting in 92) and pre-uploaded objects (e.g. "campfire", "tent"). It also has 3-D models which can be used to create characters.

Text-to-Animation Corpus
We initially used a corpus of Descriptions components from ScreenPy (Winer and Young, 2017), in order to study linguistic patterns in the movie script domain. Specifically, we used the "heading" and "transition" fields from ScreenPy's published JSON output on 1068 movie scripts scraped from IMSDb. We also scraped screenplays from SimplyScripts and ScriptORama 5 . After separating screenplays into well-formatted and illformatted, Descriptions components were extracted using our model ( §3.1). This gave a corpus of Descriptions blocks from 996 screenplays.
The corpus contains a total of 525,708 Descriptions components. The Descriptions components contain a total of 1,402,864 sentences. Out of all the Descriptions components, 49.45 % (259,973) contain at least one verb which is in the animation list (henceforth called "action verbs"). Descriptions components having at least one action verb have in total 920,817 sentences. Out of the-

Evaluation and Analysis
There are no standard corpora for text-toanimation generation. It is also not clear how should such systems be evaluated and what should be the most appropriate evaluation metric. Nevertheless, it is important to assess how our system is performing. We evaluate our system using two types of evaluation: Intrinsic and Extrinsic. Intrinsic evaluation is for evaluating the NLP pipeline of our system using the BLEU metric. Extrinsic evaluation is an end-to-end qualitative evaluation of our text-to-animation generation system, done via a user study.

Intrinsic Evaluation
To evaluate the performance of our proposed NLP pipeline, 500 Descriptions components from the test set were randomly selected. Three annotators manually translated these 500 Descriptions components into simplified sentences and extracted all the necessary ARFs from the simplified sentences. This is a time intensive process and took around two months. 30 % of the Descriptions blocks contain verbs not in the list of 92 animation verbs. There are approximately 1000 sentences in the test set, with average length of 12 words. Each Descriptions component is also annotated by the three annotators for the ARFs.
Taking inspiration from the text simplification community (Nisioi et al., 2017;Saggion, 2017), we use the BLEU score (Papineni et al., 2002) for evaluating our simplification and information extraction modules. For each simplified sentence s i we have 3 corresponding references r 1 i , r 2 i and r 3 i . We also evaluate using the SARI (Xu et al., 2016) score to evaluate our text simplification module.

Sentence Simplification
Each action block a is reduced to a set of simple sentences S a = {s 1 , s 2 , ....s na }. And for the same action block a, each annotator t, t ∈ {1, 2, 3} produces a set of simplified sen- Since the simplification rules in our system may not maintain the original ordering of verbs, we do not have sentence level alignment between elements in S a and R t a . For example, action block a = He laughs after he jumps into the water is reduced by our system into two simplified sentences S a = {s 1 = He jumps into the water, s 2 = He laughs} by the temporal heuristics, while annotator 3 gives us R 3 a = {r 3 1 = He laughs, r 3 2 = He jumps into the water}. In such cases, sequentially matching s i to r j will result in a wrong (hypothesis, reference) alignment which is (s 1 , r 3 1 ) and (s 2 , r 3 2 ). To address this problem, for each hypothesis s i ∈ S a , we take the corresponding reference r t i ∈ R t a as the one with the least Levenshtein Distance (Navarro, 2001) to s i , i.e, As per this alignment, in the above example, we will have correct alignments (s 1 , r 3 2 ) and (s 2 , r 3 1 ). Thus, for each simplified sentence s i we have 3 corresponding references r 1 i , r 2 i and r 3 i . The aligned sentences are used to calculate corpus level BLEU score 6 and SARI score 7 .
The evaluation results for text simplification are summarized in Table 4. We compare against YATS (Ferrés et al., 2016) and neural end-to-end text simplification system NTS-w2v (Nisioi et al.,   2017). YATS is also a rule-based text simplification system. As shown in Table 4, our system performs better than YATS on both the metrics, indicative of the limitations of the YATS system. A manual examination of the results also showed the same trend. However, the key point to note is that we are not aiming for text simplification in the conventional sense. Existing text simplification systems tend to summarize text and discard some of the information. Our aim is to break a complex sentence into simpler ones while preserving the information. An example of a Descriptions component with BLEU 2 scores is given in Table 3. In the first simplified sentence, the space between Ellie and 's causes the drop in the score. But it gives exactly the same answer as both annotators. In the second sentence, the system output is the same as the annotator I's answer, so the BLEU 2 score is 1. In the last case, the score is low, as annotators possibly failed to replace her with the actual Character Cue Ellie. Qualitative examination reveals, in general, that our system gives a reasonable result for the syntactic simplification module. As exemplified, BLEU is not the perfect metric to evaluate our system, and therefore in the future we plan to explore other metrics.

ARF Evaluation
We also evaluate the system's output for action representation fields against gold annotations. In our case, some of the fields can have multiple (2 or 3) words such as owner, target, prop, action, origin action, manner, location and direction. We use BLEU 1 as the evaluation metric to measure the BOW similarity between system output and ground truth references. The results are shown in Table 5.
In identifying owner, target and prop, the system tends to use a fixed long mention, while annotators prefer short mentions for the same character/object. The score of prop is relatively lower than all other fields, which is caused by a systematic SRL mapping error. The relatively high accu-racy on the action field indicates the consistency between system output and annotator answers.
Annotation on the emotion ARF is rather subjective. Responses on the this field are biased and noisy. The BLEU 1 score on this is relatively low. For the other non-textual ARFs, we use precision and recall to measure the system's behavior. Results are shown in Table 6. These fields are subjective: annotators tend to give different responses for the same input sentence.
rotation and translation have Boolean values. Annotators agree on these two fields in most of the sentences. The system, on the other hand, fails to identify actions involving rotation. For example, in the sentence "Carl closes CARL 's door sharply" all four annotators think that this sentence involves rotation, which is not found by the system. This is due to the specificity of rules on identifying these two fields.
speed, duration and start time have high precision and low recall. This indicates the inconsistency in annotators' answers. For example, in the sentence "Woody runs around to the back of the pizza truck", two annotators give 2 seconds and another gives 1 second in duration. These fields are subjective and need the opinion of the script author or the director. In the future, we plan to involve script editors in the evaluation process.

Extrinsic Evaluation
We conducted a user study to evaluate the performance of the system qualitatively. The focus of the study was to evaluate (from the end user's perspective) the performance of the NLP component w.r.t. generating reasonable animations.
We developed a questionnaire consisting of 20 sentence-animation video pairs. The animations were generated by our system. The questionnaire was filled by 22 participants. On an average it took around 25 minutes for a user to complete the study.
We asked users to evaluate, on a five-point Likert scale (Likert, 1932), if the video shown was a reasonable animation for the text, how much of  the text information was depicted in the video and how much of the information in the video was present in the text ( Table 7). The 68.18 % of the participants rated the overall pre-visualization as neutral or above. The rating was 64.32 % (neutral or above) for the conservation of textual information in the video, which is reasonable, given limitations of the system that are not related to the NLP component. For the last question, 75.90 % (neutral or above) agreed that the video did not have extra information. In general, there seemed to be reasonable consensus in the responses. Besides the limitations of our system, disagreement can be attributed to the ambiguity and subjectivity of the task.
We also asked the participants to describe qualitatively what textual information, if any, was missing from the videos. Most of the missing information was due to limitations of the overall system rather than the NLP component: facial expression information was not depicted because the character 3-D models are deliberately designed without faces, so that animators can draw on them. Information was also missing in the videos if it referred to objects or actions that do not have a close enough match in the object list or animations list. Furthermore, the animation component only supports animations referring to a character or object as a whole, not parts, (e.g. "Ben raises his head" is not supported).
However, there were some cases where the NLP component can be improved. For example, lexical simplification failed to map the verb "watches" to the similar animation "look". In one case, syntactic simplification created only two simplified sentences for a verb which had three subjects in the original sentence. In a few cases, lexical simplification successfully mapped to the most similar animation (e.g."argue" to "talk") but the participants were not satisfied -they were expecting a more exact animation. We plan to address these shortcomings in future work.

Conclusion and Future Work
In this paper, we proposed a new text-toanimation system. The system uses linguistic text simplification techniques to map screenplay text to animation. Evaluating such systems is a challenge. Nevertheless, intrinsic and extrinsic evaluations show reasonable performance of the system. The proposed system is not perfect, for example, the current system does not take into account the discourse information that links the actions implied in the text, as currently the system only processes sentences independently. In the future, we would like to leverage discourse information by considering the sequence of actions which are described in the text (Modi and Titov, 2014;Modi, 2016). This would also help to resolve ambiguity in text with regard to actions Modi, 2017). Moreover, our system can be used for generating training data which could be used for training an end-to-end neural system.   else Other cases such as NOUN&NOUN, AD*&AD*, apply same rule