ScriptWriter: Narrative-Guided Script Generation

It is appealing to have a system that generates a story or scripts automatically from a storyline, even though this is still out of our reach. In dialogue systems, it would also be useful to drive dialogues by a dialogue plan. In this paper, we address a key problem involved in these applications - guiding a dialogue by a narrative. The proposed model ScriptWriter selects the best response among the candidates that fit the context as well as the given narrative. It keeps track of what in the narrative has been said and what is to be said. A narrative plays a different role than the context (i.e., previous utterances), which is generally used in current dialogue systems. Due to the unavailability of data for this new application, we construct a new large-scale data collection GraphMovie from a movie website where end- users can upload their narratives freely when watching a movie. Experimental results on the dataset show that our proposed approach based on narratives significantly outperforms the baselines that simply use the narrative as a kind of context.


Introduction
Narrative is generally understood as a way to tell a story. WordNet defines it as "a message that tells the particulars of an act or occurrence or course of events; presented in writing or drama or cinema or as a radio or television program" 1 . Narrative plays an important role in many natural language processing (NLP) tasks. For example, in storytelling, the storyline is a type of narrative, which helps generate coherent and consistent stories (Fan et al., 2018(Fan et al., , 2019. In dialogue generation, narrative can be used to define a global plan for the whole conversation session, so as to avoid generating inconsistent Figure 1: An example of part of a script with a narrative extracted from our GraphMovie dataset. The checked lines are from a ground-truth session, while the unchecked responses are other candidates that are relevant but not coherent with the narrative. and scattered responses (Xing et al., 2018;Tian et al., 2017;.
In this work, we investigate the utilization of narratives in a special case of text generationmovie script generation. This special form of conversation generation is chosen due to the unavailability of the data for a more general form of application. Yet it does require the same care to leverage narratives in general conversation, and hence can provide useful insight to a more general form of narrative-guided conversation. The dataset we use to support our study is collected from GraphMovie 2 , where an end-user retells the story of a movie by uploading descriptive paragraphs in his/her own words. More details about the dataset will be presented in Section 3.2. An example is shown in Figure 1, where the narrative is uploaded to retell several lines of a script in a movie. Our task is to generate/select the following lines by leveraging the narrative.
Our problem is closely related to dialogue generation that takes into account the context Zhang et al., 2018;Zhou et al., 2018b). However, a narrative plays a different and more specific role than a general context. In particular, a narrative may cover the whole story (a part of a script), thus a good conversation should also cover all the aspects mentioned in a narrative, which is not required with a general context. In this paper, we propose a new model called ScriptWriter to address the problem of script generation/selection with the help of a narrative. ScriptWriter keeps track of what in the narrative has been said and what is remaining to select the next line by an updating mechanism. The matching between updated narrative, context, and response are then computed respectively and finally aggregated as a matching score. As it is difficult to evaluate the quality of script generation, we frame our work in a more restricted case -selecting the right response among a set of candidates. This form of more limited conversation generationretrieval-based conversation -has been widely used in the previous studies Zhou et al., 2018b), and it provides an easier way to evaluate the impact of narratives.
We conduct experiments on a dataset we collected and made publicly available (see Section 5). The experiments will show that using a narrative to guide the generation/selection of script is a much more appropriate approach than using it as part of the general context. Our work has three main contributions: (1) To our best knowledge, this is the first investigation on movie script generation with a narrative. This task could be further extended to a more general text generation scenario when suitable data are available.
(2) We construct the first large-scale data collection GraphMovie to support research on narrativeguided movie script generation, which is made publicly accessible.
(3) We propose a new model in which a narrative plays a specific role in guiding script generation. This will be shown to be more appropriate than a general context-based approach.
2 Related Work 2.1 Narrative Understanding It has been more than thirty years since researchers proposed "narrative comprehension" as an important ability of artificial intelligence (Rapaport et al., 1989). The ultimate goal is the development of a computational theory to model how humans understand narrative texts. Early explorations used symbolic methods to represent the narrative (Turner, 1994;Bringsjord and Ferrucci, 1999) or rule-based approaches to generate the narrative (Riedl and Young, 2010). Recently, deep neural networks have been used to tackle the problem (Bamman et al., 2019), and related problems such as generating coherent and cohesive text (Cho et al., 2019) and identifying relations in generated stories (Roemmele, 2019) have also been addressed. However, these studies only focused on how to understand a narrative itself (e.g., how to extract information from a narrative). They did not investigate how to utilize the narrative in an application task such as dialogue generation.

Dialogue Systems
Existing methods of open-domain dialogue can be categorized into two groups: retrieval-based and generation-based. Recent work on response generation is mainly based on sequence-to-sequence structure with attention mechanism (Shang et al., 2015;Vinyals and Le, 2015), with multiple extensions (Li et al., 2016;Zhou et al., 2018aZhou et al., , 2020. Retrieval-based methods try to find the most reasonable response from a large repository of conversational data, instead of generating a new one Zhou et al., 2018b;Zhang et al., 2018). In general, the utterances in the previous turns are taken together as the context for selecting the next response. Retrieval-based methods are widely used in real conversation products due to their more fluent and diverse responses and better efficiency. In this paper, we focus on extending retrieval-based methods by using a narrative as a plan for a session. This is a new problem that has not been studied before.
Contrary to open-domain chatbots, task-oriented systems are designed to accomplish tasks in a specific domain (Seneff et al., 1998;Levin et al., 2000;Wang et al., 2011;Tur and Mori, 2011). In these systems, a dialogue state tracking component is designed for tracking what has happened in a dia- logue (Williams and Young, 2007;Henderson et al., 2014;Xu and Rudnicky, 2000). This inspires us to track the remaining information in the narrative that has not been expressed by previous lines of conversation. However, existing methods cannot be applied to our task directly as they are usually predefined for specific tasks, and the state tracking is often framed as a classification problem.

Story Generation
Existing studies have also tried to generate a story. Early work relied on symbolic planning (Meehan, 1977;Cavazza et al., 2002) and case-based reasoning (y Pérez and Sharples, 2001;Gervás et al., 2005), while more recent work uses deep learning methods. Some of them focused on story ending generation (Peng et al., 2018;Guan et al., 2019), where the story context is given, and the model is asked to select a coherent and consistent story ending. This is similar to the dialogue generation problem mentioned above. Besides, attempts have been made to generate a whole story from scratch (Fan et al., 2018(Fan et al., , 2019. Compared with the former task, this latter is more challenging since the story framework and storyline should all be controlled by the model. Some recent studies also tried to guide the generation of dialogues (Wu et al., 2019;Tang et al., 2019) or stories (Yao et al., 2019) with keywordsthe next response is asked to include the keywords. This is a step towards guided response generation and bears some similarities with our study. However, a narrative is more general than keywords, and it provides a description of the dialogue session rather than imposing keywords to the next response.

Problem Formulation
Suppose that we have a dataset D, in which a sample is represented as (y, c, p, r), where c = {s 1 , · · · , s n } represents a context formed by the preceding sentences/lines {s i } n i=1 ; p is a predefined narrative that governs the whole script session, and r is a next line candidate (we refer to it as a response); y ∈ {0, 1} is a binary label, indicating whether r is a proper response for the given c and p. Intuitively, a proper response should be relevant to the context, and be coherent and aligned with the narrative. Our goal is to learn a model g(c, p, r) with D to determine how suitable a response r is to the given context c and narrative p.

Data Collection and Construction
Data is a critical issue in research on story/dialogue generation. Unfortunately, no dataset has been created for narrative-guided story/dialogue generation. To fill the gap, we constructed a test collection from GraphMovie, where an editor or a user can retell the story of a movie by uploading descriptive paragraphs in his/her own words to describe screenshots selected from the movie. A movie on this website has, on average, 367 descriptions. A description paragraph often contains one to three sentences to summarize a fragment of a movie. It can be at different levels -from retelling the same conversations to a high-level description. We consider these descriptions as narratives for a sequence of dialogues, which we call a session in this paper. Each dialogue in a session is called a line of script (or simply a line).
To construct the dataset, we use the top 100 movies in IMDB 3 as an initial list. For each movie, we collect its description paragraphs from Graph-Movie. Then we hire annotators to watch the movie and annotate the start time and end time of the dialogues corresponding to each description paragraph through an annotation tool specifically developed for this purpose. According to the start and end time, the sequence of lines is extracted from the subtitle file and aligned with a corresponding description paragraph.
As viewers of a movie can upload descriptions freely, not all description paragraphs correspond to a narrative and are suitable for our task. For example, some uploaded paragraphs express one's subjective opinions about the movie, the actors, or simply copy the script. Therefore, we manually review the data and remove such non-narrative data. We also remove sessions that have less than two lines. Finally, we obtain 16,109 script sessions, each of which contains a description paragraph (narrative) and corresponding lines of the script. As shown in Table 1, on average, a narrative has about 25 words, and a session has 4.7 lines. The maximum number of lines in a session is 34.
Our task is to select one response from a set of candidates at any point during the session. By moving the prediction point through the session, we obtain a set of micro-sessions, each of which has a sequence of previous lines as context at that point of time, the same narrative as the session, and the next line to predict. The candidates to be selected contain one ground-truth line -the one that is genuinely the next line, together with one (in the training set) or nine (in the validation/test set) other candidates retrieved with the previous lines by Solr 4 . The above preparation of the dataset follows the practice in the literature  for retrieval-based dialogue.

Overview
A good response is required to be coherent with the previous lines, i.e., context, and be consistent with the given narrative. For example, "Just stay a little longer" can respond "Mama's going to worry about me" and it has no conflict with the narrative in Figure 1. Furthermore, as our target is to generate all lines in the session successively, it is also required that the following lines should convey the information that the former lines have not conveyed. Otherwise, only a part of the narrative is covered, and we will miss some other aspects specified in the narrative.
We propose an attention-based model called ScriptWriter to solve the problem. ScriptWriter follows a representation-matching-aggregation framework. First, the narrative, the context, and the response candidate are represented in multiple granularities by multi-level attentive blocks. Second, we propose an updating mechanism to keep track of what in a narrative has been expressed and explicitly lower their weights in the updated narrative so that more emphasis can be put on the remaining parts. Third, matching features are extracted between different elements: between context and response to capture whether it is a proper reply; between narrative and response to capture whether it is consistent with the narrative; and between context and narrative to implicitly track what in the narrative has been expressed in the previous lines. Finally, the above matching features are concatenated together and a final matching score is produced by convolutional neural networks (CNNs) and a multi-layer perceptron (MLP).

Representation
To better handle the gap in words between two word sequences, we propose to use an attentive block, which is similar to that used in Transformer (Vaswani et al., 2017). The input of an attentive block consists of three sequences, namely query (Q), key (K), and value (V). The output is a new representation of the query and is denoted as AttentiveBlock(Q, K, V) in the remaining parts. This structure is used to represent a response, lines in the context, and a narrative.
More specifically, given a narrative p = (w p 1 , · · · , w p np ), a line s i = (w s i 1 , · · · , w s i ns i ) and a response candidate r = (w r 1 , · · · , w r nr ), ScriptWriter first uses a pre-trained embedding table to map each word w to a d e -dimension embedding e, i.e., w ⇒ e. Thus the narrative p, the line s i and the response candidate r are represented by matrices P 0 = (e p 1 , · · · , e p np ), S 0 i = (e s i 1 , · · · , e s i ns i ) and R 0 = (e r 1 , · · · , e r nr ). Then ScriptWriter takes P 0 , {S 0 i } n i=1 and R 0 as inputs and uses stacked attentive blocks to construct multi-level self-attention representations. The output of the (l − 1) th level of attentive block is input into the l th level. The representations of p, s i , and r at the l th level are defined as follows: where l ranges from 1 to L. Inspired by a previous study (Zhou et al., 2018b), we apply another group of attentive blocks, which is referred to as cross-attention, to capture semantic dependency between p, s i and r. Considering p and s i at first, their cross-attention representations are defined by: Here, the words in the narrative can attend to all words in the line, and vice verse. In this way, some inter-dependent segment pairs, such as "stay" in the How much information remained in the narrative Stacked Attentive Blocks (Self-Attention)

Self-Attention
Figure 2: Updating mechanism in ScriptWriter. The representation of the narrative is updated by lines in the context one by one. The information that has been expressed is decayed. Thus the updated narrative focuses more on the remaining information.
line and "go home later" in the narrative, become close to each other in the representations. Similarly, we compute cross-attention representations between p and r and between r and s i at different levels, which are denoted as P l r , R l p , S l i,r and R l s i . These representations further provide matching information across different elements in the next step.

Updating Mechanism
We design an updating mechanism to keep track of the coverage of the narrative by the lines so that the selection of the response will focus on the uncovered parts. The mechanism is illustrated in Figure 2. We update a narrative gradually by all lines in the context one by one. For the i th line s i , we conduct a matching between S i and P by their cosine similarity at all levels (l) of attentive blocks: where j and k stand for the j th word in s i and k th word in p respectively. To summarize how much information in p has been expressed by s i , we compute a vector D i by conducting summations along vertical axis on each level in the matching map T s i ,p . The summation on the l th level is: where n p , n s i denotes the number of words in p and s i ; γ ∈ [0, 1] is a parameter to learn and works as a gate to control the decaying degree of the mentioned information. Finally, we update the narrative's representation as follows for the i th line s i in the context: The initial representation P l 0 is equal to P l defined in Equation (1). If there are n lines in the context, this update is executed n times, and (1 − D l ) will produce a continuous decaying effect.

Matching
The matching between the narrative p and the line s i is conducted based on both their self-attention and cross-attention representations, as shown in Figure 3.
First, ScriptWriter computes the dot product on these two representations separately as follows: where l ranges from 0 to L. Each element is the dot product of the j th word representation in S l i or S l i,p and the k th word representation in P l or P l s i . Then the matching maps in different layers are concatenated together as follows: where [; ] is concatenation operation. Finally, the matching features computed by the self-attention representation and the cross-attention representation are fused as follows: The matching matrices M p,r and M s i ,r for narrative-response and context-response are constructed in a similar way. For the sake of brevity, we omit the formulas. After concatenation, each cell in M s i ,p , M p,r or M s i ,r has 2(L + 1) channels and contains matching information at different levels.
The matching between narrative, context, and response serves for different purposes. Contextresponse matching (M s i ,r ) serves to select a response suitable for the context. Context-narrative matching (M s i ,p ) helps the model "remember" how much information has been expressed and implicitly influences the selection of the next responses. Narrative-response matching (M p,r ) helps the model to select a more consistent response with the narrative. As the narrative keeps being updated along with the lines in context, ScriptWriter tends to dynamically choose the response that matches what remains unexpressed in the narrative.

Aggregation
To further use the information across two consecutive lines, ScriptWriter piles up all the contextnarrative matching matrices and all the contextresponse matching matrices to construct two where n is the number of lines in the session. Then ScriptWriter employs 3D convolutions to distill important matching features from the whole cube. We denote these two feature vectors as f (c, p) and f (c, r). For narrative-response matching, ScriptWriter conducts 2D convolutions on M p,r to distill matching features between the narrative and the response, denoted as f (p, r).
The three types of matching features are concatenated together, and the matching score g(c, p, r) for ranking response candidates is computed by an MLP with a sigmoid activation function, which is defined as: where W and b are parameters. ScriptWriter learns g(c, p, r) by minimizing cross entropy with D. The objective function is formulated as: [y log(g(c, p, r)) (14) 5 Experiments

Evaluation setup
As presented in Table 1, we randomly split the the GraphMovie collection into training, validation and test set. The split ratio is 18:1:1. We split the sessions into micro-sessions: given a session with n lines in the context, we will split it into n microsessions with length varying from 1 to n. These micro-sessions share the same narrative. By doing this, the model is asked to learn to select one line as the response from a set of candidates at any point during the session, and the dataset, in particular for training, can be significantly enlarged.
We conduct two kinds of evaluation as follows: Turn-level task asks a model to rank a list of candidate responses based on its given context and narrative for a micro-session. The model then selects the best response for the current turn. This setting is similar to the widely studied response selection task Zhou et al., 2018b;Zhang et al., 2018). We follow these previous studies and employ recall at position k in n candidates (R n @k) and mean reciprocal rank (MRR) (Voorhees, 1999) as evaluation metrics. For example, R 10 @1 means recall at one when we rank ten candidates (one positive sample and nine negative samples). The final results are average numbers over all micro-sessions in the test set.
Session-level task aims to predict all the lines in a session gradually. It starts with the first line of the session as the context and the given narrative and predicts the best next line. The predicted line is then incorporated into the context to predict the next line. This process continues until the last line of the session is selected. Finally, we calculate precision over the whole original session and report average numbers over all sessions in the test set. Precision is defined as the number of correct selection divided by the number of lines in a session. We consider two measures: 1) P strict which accepts a right response at the right position; 2) P weak which accepts a right response at any position.

Baselines
As no previous work has been done on narrativebased script generation, no proper baseline exists. Nevertheless, some existing multi-turn conversation models based on context can be adapted to work with a narrative: the context is simply extended with the narrative. Two different extension methods have been tested: the narrative is added into the context together with the previous lines; the narrative is used as a second context. In the latter case, two matching scores are obtained for contextnarrative and narrative-response. They are aggregated through an MLP to produce a final score. This second approach turns out to perform better. Therefore, we only report the results with this latter method 5 .
(1) MVLSTM (Wan et al., 2016): it concatenates all previous lines as a context and uses an LSTM to encode the context and the response candidate. A matching score is determined by an MLP based on a map of cosine similarity between them. A matching score for narrative-response is produced similarly.
(2) DL2R (Yan et al., 2016): it encodes the context by an RNN followed by a CNN. The matching score is computed similarly to MVLSTM.
(3) SMN : it matches each line with response sequentially to produce a matching vector with CNNs. The matching vectors are aggregated with an RNN.
(4) DAM (Zhou et al., 2018b): it represents a context and a response by using self-attention and cross-attention operation on them. It uses CNNs to extract features and uses an MLP to get a score. Different from our model, this model only considers the context-response matching and does not track what in the narrative has already been expressed by the previous lines, i.e., context.
(5) DUA (Zhang et al., 2018): it concatenates the last line with each previous line in the context and response, respectively. Then it performs a selfattention operation to get refined representations, based on which matching features are extracted with CNNs and RNNs.

Training Details
All models are implemented in Tensorflow 6 . Word embeddings are pre-trained by Word2vec (Mikolov et al., 2013) on the training set with 200 dimensions. We test the stack number in {1,2,3} and report our results with three stacks. Due to the limited resources, we cannot conduct experiments with a larger number of stacks, which could be tested in the future. Two 3D convolutional layers have 32 and 16 filters, respectively. They both use [3,3,3] as kernel size, and the max-pooling size is [3,3,3]. Two 2D convolutional layers on narrative-response matching have 32 and 16 filters with [3,3] as kernel size. The max-pooling size is also [3,3]. All parameters are optimized with Adam optimizer (Kingma and Ba, 2015). The learning rate is 0.001 and decreased during training. The initial value for γ is 0.5. The batch size is 64. We use the validation set to select the best models and report their performance on the test set. The maximum number of lines in context is set as ten, and the maximum length of a line, response, and narrative sentence is all set as 50. All sentences are zero-padded to the maximum length. We also padded zeros if the number of lines in a context is less than 10. Otherwise, we kept the latest ten lines. The dataset and the source code of our model are available on GitHub 7 .

Evaluation Results
The experimental results are reported in Table 2. The results on both turn-level and session-level evaluations indicate that ScriptWriter dramatically outperforms all baselines, including DAM and DUA, which are two state-of-the-art models on multi-turn response selection. All improvements are statistically significant (p-value ≤ 0.01). DAM performs better than other baselines, which confirms the effectiveness of the self and cross attention mechanism used in this model. The DUA model also uses the attention mechanism. It outper- forms the other baselines that do not use attention.
Both observations confirm the advantage of using attention mechanisms over pure RNN. Between the two session-level measures, we observe that our model is less affected when moving from P weak to P strict . This shows that ScriptWriter can better select a response in the right position. We attribute this behavior to the utilization of narrative coverage.

Model Ablation
We conduct an ablation study to investigate the impact of different modules in ScriptWriter. First, we remove the updating mechanism by setting γ = 0 (i.e., the representation of the narrative is not updated but static). This model is denoted as ScriptWriter static in Table 2. Then we remove narrative-response, context-narrative, and matching-response, respectively. These variants are denoted as ScriptWriter-PR, ScriptWriter-CP, and ScriptWriter-CR.
Model ablation results are shown in the second part of Table 2. We have the following findings: 1) ScriptWriter performs better than ScriptWriter static , demonstrating the effectiveness of updating mechanism for the narrative. The optimal value of γ is at around 0.647 after training, which means that only about 35% of information is kept when a line conveys it. 2) In both turn-level and session-level evaluations, the performance drops the most when we remove narrative-response matching. This indicates that the relevance of the response to the narrative is the most useful information in narrative-> > > >@ P strict of SW P weak of SW P strict of DUA P weak of DUA Percentage(%) Figure 4: The performance of ScriptWriter (SW) and DUA on the test set with different types of narrative in session-level evaluation.
guided script generation. 3) When we remove context-narrative matching, the performance drops too, indicating that context-narrative matching may provide implicit and complementary information for controlling the alignment of response and narrative. 4) In contrast, when we remove the contextresponse matching, the performance also drops, however, at a much smaller scale, especially on P weak , than when narrative-response matching is removed. This contrast indicates that narrative is a more useful piece of information than context to determine what should be said next, thus it should be taken into account with an adequate mechanism.

Performance across Narrative Types
As we explained, narratives in our dataset are contributed by netizens, and they vary in style. Some narratives are detailed, while others are general. The question we analyze is how general vs. detailed narratives affect the performance of response selection. We use a simple method to evaluate roughly the degree of detail of a narrative: a narrative that has a high lexical overlap with the lines in the session is considered to be detailed. Narratives are put into six buckets depending on their level of detail, as shown in Figure 4. We plot the performance of ScriptWriter and DUA in session-level evaluation over different types of narratives. The first type "0" means no word overlap between narrative and dialogue sessions. This is the most challenging case, representing extremely general narratives. It is not surprising to see that both ScriptWriter and DUA performs poorly on this type compared with other types in terms of P strict . The performance tends to become better when the overlap ratio is increased. This is consistent with our intuition: when a narrative is more detailed and better aligned with the session in wording, it is easier to choose the best responses. This plot also shows that our ScriptWriter can achieve better performance than DUA on all types of narratives, which further demonstrates the effectiveness of using narrative to guide the dialogue.
We also observe that the buckets "[0, 0.2)" and "[0.2, 0.4)" contain the largest proportions of narratives. This indicates that most netizens do not use the original lines to retell a story. The problem we address in this paper is thus non-trivial.

Conclusion and Future Work
Although story generation has been extensively studied in the literature, no existing work addressed the problem of generating movie scripts following a given storyline or narrative. In this paper, we addressed this problem in the context of generating dialogues in a movie script. We proposed a model that uses the narrative to guide the dialogue generation/retrieval. We keep track of what in the narrative has already been expressed and what is remaining to select the next line through an updating mechanism. The final selection of the next response is based on multiple matching criteria between context, narrative and response. We constructed a new large-scale data collection for narrative-guided script generation from movie scripts. This is the first public dataset available for testing narrativeguided dialogue generation/selection. Experimental results on the dataset showed that our proposed approach based on narrative significantly outperforms the baselines that use a narrative as an additional context, and showed the importance of using the narrative in a proper manner. As a first investigation on the problem, our study has several limitations. For example, we have not considered the order in the narrative description, which could be helpful in generating dialogues in correct order. Other methods to track the dialogue state and the coverage of narrative can also be designed. Further investigations are thus required to fully understand how narratives can be effectively used in dialogue generation.