Composing a Picture Book by Automatic Story Understanding and Visualization

Pictures can enrich storytelling experiences. We propose a framework that can automatically compose a picture book by understanding story text and visualizing it with painting elements, i.e., characters and backgrounds. For story understanding, we extract key information from a story on both sentence level and paragraph level, including characters, scenes and actions. These concepts are organized and visualized in a way that depicts the development of a story. We collect a set of Chinese stories for children and apply our approach to compose pictures for stories. Extensive experiments are conducted towards story event extraction for visualization to demonstrate the effectiveness of our method.


Introduction
A story is an ordered sequence of steps, each of which can contain words, images, visualizations, video, or any combination thereof (Kosara and Mackinlay, 2013). There exist vast amounts of story materials on the Internet, while few of them include visual data. Among the few presented to audience, some include illustrations to make the stories more vivid; others are converted to video forms such as cartoons and films, of which the production consumes a lot of time and human efforts. Although visualized stories are difficult to generate, they are more comprehensible, memorable and attractive. Thus, automatic story understanding and visualization has a broad application prospect in storytelling.
As an initial study, we aim to analyze events of a story and visualize them by combining painting elements, i.e., characters and backgrounds. Story understanding has been a challenging task in Natural Language Processing area for a long time (Charniak, 1972). In order to understand a story, we need to tackle the problem of event extraction in a story. A story usually consists of several plots, where characters appear and make actions. We define event keywords of a story as: scene (where), character (who, to whom) and action (what). We extract events from story on both sentence level and paragraph level, so as to make use of the information in each sentence and the context of the full story.
As for story visualization, the most challenging problem is stage directing. We need to organize the events following certain spatial distribution rules. Although literary devices might be used e.g. flashbacks, the order in a story plot roughly corresponds with time (Kosara and Mackinlay, 2013). We arrange the extracted events in a screen along the story timeline. Positions of elements on the screen are determined according to both current and past events. Finally, with audio track added, simple animations could be generated. These simple animations are like storyboards, in which each image represents a major event that correspond to a sentence or a group of consecutive sentences in the story text.
Regarding storytelling, we need to first know our audiences, assess their level of domain knowledge and familiarity with visualization conventions (Ma et al., 2012). In this paper, our target is to understand and visualize Chinese stories for children. We collect children's stories from the Internet. (The sources are described in Section 7.1.) Then, we extract events and prepare visualization materials and style for children. The framework we proposed, however, has wide extensibility, since it does not depend on domain specific knowledge. It could serve as an automatic picture book composition solution to other fields and target audience.
Our contributions are threefold. 1) We propose an end-to-end framework to automatically generate a sequence of pictures that represent major events in a story text. 2) New formulation of story event extraction from sentence level to paragraph level to align the events in a temporal order. 3) We propose using a neural encoder-decoder model to extract story events and present empirical results with significant improvements over the baseline.
The paper is organized as follows: In Section 2 we introduce related work. Then we formulate the problem and overview our proposed solution in Section 3. Details of different modules are provided in Section 4, 5 and 6. We describe our data and experiments in Section 7. In Section 8 we make conclusion and present our future work.
2 Related Work

Story Event Extraction
Event extraction is to automatically identify events from text about what happened, when, where, to whom, and why (Zhou et al., 2014). Previous work on event extraction mainly focuses on sentence-level event extraction driven by data or knowledge.
Data-driven event extraction methods rely on quantitative methods to discover relations (Hogenboom et al., 2011). Term frequency-inverse document frequency ( TF-IDF) (Salton and McGill, 1986) and clustering (Tanev et al., 2008) are widely used. Okamoto et al. (2009) use hierarchical clustering to extract local events. Liu et al. (2008) employ weighted undirected bipartite graphs and clustering methods to extract events from news. Lei et al. (2005) propose using support vector machines for news event extraction.
Knowledge-driven approaches take advantages of domain knowledge, using lexical and syntactical parsers to extract target information. Mc-Closky et al. (2011) convert text to a dependency tree and use dependency parsing to solve the problem. Aone et al. (2009) andNishihara et al. (2009) focus on designed patterns to parse text. Zhou et al. (2014) propose a Bayesian model to extract structured representation of events from Twitter in an unsupervised way. Different frameworks are designed for specific domains, such as the work in (Yakushiji et al., 2000), (Cohen et al., 2009) and (Li et al., 2002)). Although there is less demand of training data for knowledge-driven approaches, knowledge acquisition and pattern design remain difficult.
In order to deal with the disadvantages of both methods, researchers work on combining them.
At the training stages of data-driven methods, initial bootstrapppings with dependency parser (Lee et al., 2003) and clustering techniques (Piskorski et al., 2007) are used for better semantic understanding. Chun et al. (2004) combine lexicon syntactic parser and term co-occurrences to extract biomedical events while Jungermann et al. (2008) combine a parser with undirected graphs. The only trial of neural network on this task is the work in (Tozzo et al., 2018), where they employ RNN with dependency parser as training initialization.
We propose a hybrid encoder-decoder approach for story event extraction to avoid humanknowledge requirement and better utilize the neural network. Moreover, previous work focus on sentence-level event extraction, which has a gap to apply to full story visualization due to the loss of event continuity. Thus, we extend event extraction to paragraph level so that it is possible to visualize a story coherently in a time sequence.

Story Visualization
Previous work mainly focuses on narrative visualization (Segel and Heer, 2010), where the visualization intention is deeper understanding of the data and the logic inside. Valls et al. (2017) extract story graphs a formalism that captures the events (e.g., characters, locations) and their interactions in a story. Zitnick et al. (2013) and Zeng et al. (2009) interpret sentences and visualize scenes. There also exists visual storytelling task (Huang et al., 2016).
The most relevant work to ours is that of Shimazu et al. (1988), where they outlined a story driven animation system and presented story understanding mechanism for creating animations. They mainly targeted on interpretations of three kinds of actions: action causality check, action continuity beyond a sentence and hidden actions between neighbouring sentences. The key solution was a Truth Maintenance System proposed in (Doyle, 1979), which relies on pre-defined constrains from human knowledge. Understanding a story with a TMS system would cost a lot of manual efforts. In light of this, we propose an approach to story understanding that automatically learns from labelled story data.
Different from previous work, we propose new story visualization techniques, including temporal and spatial arrangement for screen view. Our  Figure 1: A flowchart for story understanding and visualization framework generates story animations automatically from end to end. Moreover, it is based on event extraction on both sentence level and paragraph level.

Problem Formulation and System Overview
We formulate the problem as follows: the input is a story that contains m paragraphs and each paragraph p contains n sentences, which are composed of several words. The output is a series of images that correspond to the story. An image I is composed by some prepared painting elements (30 scenes, such as sea, and 600 characters, such as fox and rabbit, with different actions in our experiment). As it is costly to prepare painting elements, given a number of stories to visualize, we hope that the fewer elements are prepared the better. We show the flowchart of our proposed solution in Figure 1. Given a story text, we first split it into a sequence of narrative sentences. Story event extraction is conducted within each sentence independently. Then events are integrated on paragraph level and fed into the visualization stage, where they are distributed temporally and spatially on the screen. The visualization part determines what to be shown on the screen, when and where they should be arranged. Finally, painting elements are displayed on the screen and audio track is added to make a picture book with audio.

Sentence-Level Event Extraction
We start from event extraction on sentencelevel. Given a sentence s = (x 1 , x 2 , ..., x T ) of length T , we intend to get a label sequence y = (y 1 , y 2 , ..., y T ), where y i ∈ scene, character, action, others, i ∈ [1, T ]. We propose using a neural encoder-decoder model to extract events from a story.

BiLSTM-CRF
BiLSTM-CRF is the state-of-the-art method to solve the sequence labeling problem. Thus we apply this model to extract events in sentence level.
We can encode the story sentence with a Bidirectional LSTM (Graves and Schmidhuber, 2005), which processes each training sequence forwards and backwards. A Conditional Random Fields (CRF) (Ratinov and Roth, 2009) layer is used as the decoder to overcome label-bias problem .
Given a sentence s = (x 1 , x 2 , ..., x T ) of length T , we annotate each word and get a ground-truth label sequence l = (l 1 , l 2 , ..., l T ). Every word with a word-embedding dictionary pre-trained from Wikipedia Chinese corpus. Then the sentence is represented as E = (e 1 , e 2 , ..., e T ), where each e i is padded to a fixedlength. We set the embedding length to 100 in our experiment. The embedded sentence vector E is fed into a BiLSTM neural network. The hidden state h i of the network is calculated in the same way as in (Graves and Schmidhuber, 2005).
Different from standard LSTM, Bidirectional LSTM introduces a second hidden layer that processes data flow in the opposite direction. Therefore, it is able to extract information from both the previous and latter knowledge. Each final hidden state is the concatenation of the forward and backward hidden states: Instead of adding a softmax classification layer after the hidden states, we employ CRF (Ratinov and Roth, 2009) to take the label correlations into consideration. The hidden layer h = (h 1 , h 2 , ..., h T ) is fed into the CRF layer. We intend to get the predicted label sequence y = (y 1 , y 2 , ..., y T ). The conditional probability is defined as: where T is the length of the output sequence. W T y i is weight matrix. A y i ,y i+1 represents the transitioning score from label y i to label y i+1 . And y stands for any possible output label sequence. Our training objective is minimizing the negative log likelihood of P (y|h).

Model Variants
Recently, a new pre-trained model BERT obtains new state-of-the-art results on a variety of natural language processing tasks (Devlin et al., 2018). We apply this model to our story event extraction. We input a sentence to the BERT base model released by Devlin et al. The last layer of BERT serves as word embedding and input of the BiL-STM model. The other parts of the model remain the same for comparison. We refer to this variant as BERT-BiLSTM-CRF. We also experiment with IDCNN model (Strubell et al., 2017) and fix the parameter setting for comparison. IDCNN model leverages convolutional neural network instead of recurrent one to accelerate the training process.

Paragraph-Level Event Integration
When generating a story animation, we need to take consideration of the full paragraph, so that the events could be continuous in temporary order. (A story might consists of one or multiple paragraphs.) In this part, we integrate sentencelevel story events to paragraph-level ones. Given a story paragraph p = (s 1 , s 2 , ..., s n ) of length n, where sentence s = (x 1 , x 2 , ..., x T ) has corresponding label sequence y = (y 1 , y 2 , ..., y T ), we integrate the label information and get a refined event keyword set for each sentence, denoted aŝ y = (scene, character, action).ŷ indicates the events in the current sentence.
A story paragraph example is presented in Table  1. The sentence-level detection results are listed. Event detection results of a story vary in different sentences and they are quite unbalanced. Only the 1 st , the 8 th and the 14 th sentence have tokens indicating the story scenes. We need to infer that the first scene "field" should cover the sentence spans from the 1 st to the 7 th . And the scene changes to "river" in the 8 th sentence and remains until the 13 th one. Then it turns to "garden" and keeps the same until the end of the story. Similarly, we have to decide which character and action should appear in a sentence time span according to the paragraph information, even if nothing is detected in a specific sentence.
We mainly consider scene and character detection. An action may last from when it last emerged until the next action, such as running or driving. While it could also be short and happens within a sentence time (e.g. He sits down.). The determination of action continuity requires significantly more human knowledge and is beyond this paper's scope.
Extracted scene of a sentence is expanded to its neighbours in both forward and backward directions. At the scene boundaries, we follow the newly detected one. In this way, the story is divided into several scenes. Then we deal with characters within scenes. Normally, a character emerges at the first detected sentence and remains on the screen until the current plot ends.

Story Visualization
In this part, we calculate positions on the screen for each element. We define the position as [lef t, top] in percentage relative to the top-left corner of the screen. Elements' positions are determined according to three constraints: 1) Meta data of the painting elements for the characters; 2) character number and significance in current time span; 3) history positions of the elements. The painting meta data of all elements include the following information: • (height, width): size of an element The additional meta data of a painting scene are: • horizon: distance from the horizontal line in a scene to the scene bottom. We use it as a safe line to arrange the feet of our characters; otherwise, a bear might float above the grassland, for example.
• We calculate the character number to show on the screen in a time span and evenly distribute their positions based on the painting elements size and the horizon of the scene. Characters with high significance ( talking ones or newly emerged ones ) are placed near point A or B. If the character appeared in previous time spans, its position keeps the same or changes by minimal distance. The position should follow the equations: where top and lef t stand for previous position of an element. If the element appears for the first time, Equation 6 and 7 are ignored.
As to the orientation setting, we initialize each character with an orientation facing towards the middle of the screen. Those who are talking or interacting with each other are set face to face.
Finally, we add a timeline to the story. Each event in the text is assigned a start time and an end time, so that it appears in the screen accordingly. Along with an audio track, the static images are combined to generate a story animation. The characters are mapped to corresponding elements with the detected actions if they are available (e.g., we have the elements when a character is saying). Dialogue boxes are added to show which character is saying. The painting elements are prepared in clip art style to make it more flexible to change them, as shown in Figure 2.

Experiment Setup
Data Collection: We collect 3,680 Chinese stories for children from the Internet 1 . The stories include 47 sentences on average. We randomly sample 10, 000 sentences from the stories and split them into three parts: training set (80%), testing set   (10%), and development set (10%). We hired four experienced annotators to provide story events annotations. For each sentence, the annotators select event keywords and give them a category label of scene, character, or action. The words rather than event keywords are regarded as "others". We present the statistics of the collected corpus in Table 2.
Each sentence in the training and development set was annotated by one annotator for the sake of saving cost. But each sentence in the testing sets was annotated by three annotators independently. We calculate Fleiss' Kappa (Viera et al., 2005) to evaluate the agreement among annotators. For each token in a sentence, it is annotated as y(y ∈ scene, character, action, others) by 3 annotators. The Fleiss' Kappa value is 0.580, which shows that the annotations have moderate agreement.
For story visualization, we hire two designers to design elements for storytelling. The elements include story scenes and characters (with different actions). Each frame of an animation consists of several elements. This mechanism is flexible for element switch and story plot development. We prepared 30 scenes and 600 characters, which have high frequencies in the collected stories. Some example animation elements are shown in Table 3.
Training Details: In the neural based methods, the word embedding size is 100. The LSTM model contains 100 hidden units and trains with a learning rate of 0.001 and Adam (Kingma and Ba, 2014) optimizer. The batch size is set to 20 and 50% dropout is used to avoid overfitting. We train the model for 100 epochs although it converges quickly.

Sentence-Level Evaluation
We compare the neural based models with a baseline based on parser. We first conduct word segmentation with Jieba (Sun, 2012) and part of speech (POS) annotation using Stanford CoreNLP Toolkit (Manning et al., 2014). Then we use dependency parser to extract events. For scene extraction, we find that most scenes in the childrens' stories are common places with few specific names or actions. Thus, we construct a common place dictionary with 778 scene tokens. We keep NP, NR, NT and NN (Klein and Manning, 2003) of POS tagging results and filter the scene tokens according to the scene dictionary. Dependency parser is employed to extract characters and actions. The subjects and objects in a sentence are denoted as the current story characters. The predicates (usually in terms of verbs or verb phrases) in the dependency tree are considered to contain actions of the corresponding characters. The mean evaluation results over the test sets are shown in Table 4. The result shows that the BiLSTM-CRF method can achieve as high as 0.973 F 1 score in scene extraction. The BERT-BiLSTM-CRF method can achieve 0.843 F 1 score in character extraction, which is high too. But action extraction is the most difficult. Even Sentence (actions denoted with underlines) Scene character 1. The chicken and duck walked happily by the lake. lake chicken,duck 2. The chicken and duck walked happily by the lake. lake chicken,duck 1. The rabbit's father and mother are in a hurry at home. home rabbit's father,mother 2. The rabbit's father and mother are in a hurry at home. home rabbit,father,mother 1. He walked into the big forest with his mother's words.
forest He 2. He walked into the big forest with his mother's words. forest He,his mother 1. He said that he once donated money to mountain children. / he,children 2. He said that he once donated money to mountain children. mountain he,children 1. The rabbit walked and suddenly heard a deafening help. / rabbit 2. The rabbit walked and suddenly heard a deafening help.
/ rabbit  the best method BERT-BiLSTM-CRF can achieve 0.499 F 1 score only, which is too low to use. We conduct Tukey HSD significant test over all method pairs too. The results indicate that the neural methods are significantly better than the baseline based on parser in scene and character extraction. BERT-BiLSTM-CRF also significantly beats the parser baseline in action extraction. Among three neural methods, BERT brings significant improvements over the BiLSTM-CRF method in scene and character extraction. Only in scene extraction, BiLSTM-CRF is the best and the differences are significant. Table 5 illustrates sample event extraction results. We can find that most of the story events are correctly extracted while there still exist a lot of biases. For example, some detected events do not actually happen in real but merely appear in the imagination or dialogues. (e.g. In verb phrase "heard a deafening help", the action is "heard", not "deafening".) Some serves as an adjective that modifies an character. (e.g. In noun phrase "mountain children", "mountain" does not indicate the current scene, but the children's hometown.)

Paragraph-Level Evaluation
In this evaluation, we focus on event switch detection. Take paragraph-level scene detection as an example. The story in Table 1 includes three scenes: field, river and garden, starting from the 1 st , the 8 th and the 14 th sentence respectively. Paragraph-level event extraction is required to find the correct switch time and the event content. We compare simple extension of sentence-level results and paragraph-level event integration results (denoted as base and ours in Table 6).
We randomly selected 20 stories from the collected corpus and manually annotated the scene and character spans. Scene keywords are mapped into 30 categories of painting scenes. Sentencelevel scene results are extended in a way where the first sentence including the keyword is regarded as the start of the scene span and the previous sentence of next scene is denoted as the span end. For paragraph-level scene integration, scene spans are extended both in forward and backward orientation. Moreover, the dialogue contexts are ignored because the scene in a dialogue might not be the current one. It might be imagination or merely action of the past or the future. Other event information is also utilized as supplement, as the characters keywords might indicate specific scenes.
We calculate precision, recall and F 1 value for event detection. A correct hit should detect both the event switch time and the right content. The results are listed in Table 6. As we can see, about 0.878% of scene switches are correctly detected. After story scene switch information extracted, it is used in paragraph-level character detection. Character switch is defined as the appearance and disappearance of a single character. The first time when a character keyword is detected is denoted as the switch time of appearance. Scene switch is used as an indication of disappearance of the characters in that scene. Paragraph-level character detection reaches relatively higher accuracy than sentence-level character detection, with F 1 score of over 0.91. T-test results indicate that our improvements are statistically significant.

Visualization Demonstration
Using the prepared 30 painting scenes and 600 characters, we are able to generate picture books for the collected 3680 stories, with 1.42 scenes and 2.71 characters in each story on average. Figure 2 shows some story pictures and painting elements. More examples of video visualization results could be found on our website 2 .

Conclusion and Future Work
In this paper, we propose a framework to address the problem of automatic story understanding and visualization. Story event extraction is extended from sentence level to paragraph level for continuous visualization. We collect children's story from the Internet and apply our framework to generate simple story picture books with audio. 2 https://github.com/StoryVisualization/Demos Currently, our story events include scenes, characters and actions. There is room for event extraction improvement. Furthermore, it is difficult to enumerate and compose an intimate action between characters, such as "hug", or a complex action, such as "kneeling on the ground". We plan to learn the various actions from examples, such as movies, in the future.