Scene Restoring for Narrative Machine Reading Comprehension

This paper focuses on machine reading comprehension for narrative passages. Narrative passages usually describe a chain of events. When reading this kind of passage, humans tend to restore a scene according to the text with their prior knowledge, which helps them understand the passage comprehensively. Inspired by this behavior of humans, we propose a method to let the machine imagine a scene during reading narrative for better comprehension. Speciﬁcally, we build a scene graph by utilizing Atomic as the external knowledge and propose a novel Graph Dimensional-Iteration Network (GDIN) to encode the graph. We conduct experiments on the ROCStories, a dataset of Story Cloze Test (SCT), and Cos-mosQA, a dataset of multiple choice. Our method achieves state-of-the-art.


Introduction
Machine Reading Comprehension (MRC) is an NLP task designed to evaluate a machine's ability to understand human language. This direction has recently drawn much attention due to the fast development of deep learning techniques and large-scale datasets. As a basic form of MRC, the comprehension of narrative has attracted long-standing interests (Mostafazadeh et al., 2017;Kočiskỳ et al., 2018;Cui et al., 2019). In this paper, we focus on this kind of MRC task.
Unlike the other type of text, the narratives usually present a series of events, which are related to a scene in real life. In the field of perception, the scene is a kind of information that flows from a physical environment into a perceptual system (Ruderman and Bialek, 1993). When reading a narrative instead of being in a physical environment, humans tend to restore the scene in their mind according to the text with their prior knowledge for better perception and comprehension (Bower and Figure 1: An example of Narrative MRC. Specifically, it is an example of Story Cloze Test (SCT), where given the first four sentences (s 0 , s 1 , s 2 , s 3 ) of the story, a model is required to select the suitable ending from the candidates (e 0 , e 1 ). The middle part is a presumable description of the scene restored by a human reader. Morrow, 1990;Zwaan et al., 1995). The scene restored is an immediate association about the event, and it could be composed of the event itself, the state of the person roles, the possible cause and effect, and so on. To approach human intelligence, an MRC model is supposed to have a similar ability to restore the scene. However, previous work (Wang et al., 2016;Cui et al., 2019;Zhou et al., 2019a) pay little attention to this ability of narrative MRC models. Figure 1 is an example of Narrative MRC. While reading the story sentence by sentence, a human tends to restore a scene in his or her mind as described in the figure. Subsequently, the human reader can infer that the suitable ending is e 0 based on information of the scene. Unfortunately, a ma- Figure 2: The basic overview of the proposed method chine reader is not endowed with the ability to associate the prior knowledge, and cannot restore the scene according to the original narrative. As a result, It cannot thoroughly understand what happens in the story and possibly make a wrong decision. To address this problem, we proposed a novel method, which can restore the scene and utilize it to understand the narrative passages.
Firstly, we propose to employ external knowledge as the basic resource for restoring the scene. Some previous works Guan et al., 2019) also employ external knowledge in Narrative MRC. However, most of them use the concept knowledge from ConceptNet Singh, 2004), WordNet (Fellbaum, 1998), or other word-centered knowledge bases to obtain the association information for the noun phrases mentioned in the story. This kind of method is able to help the machine to understand what is mentioned in the story, but not what happens in the story. For example, with those methods, given the sentence "The right side of his face was all covered in blood.", the machine understand the noun phrases "right side" "face" and "blood" better, but is still unable to know exactly that a man is hurt, he needs medical assistance, and some others nearby might help him. To this end, we select an event-based knowledge graph, Atomic , as the source of external knowledge. Atomic is an atlas of everyday commonsense reasoning. Each center node of Atomic is an event like "PersonX's face is covered in blood", and the nodes associated with it are the cause, the effect, and the attribute of the roles of the events. Therefore, Atomic is beneficial for the machine to know "what happens".
Secondly, we utilize a structured description to restore the scene. Specifically, we build a scene graph based on the original narrative and the knowledge from Atomic. Compared with the unstructured text, graph data can represent the scene more intuitively. In MRC task, previous works (Kipf and Welling, 2016;Qiu et al., 2019) that utilize structured data generally regard the words or noun phrases as the nodes of the graph. Those methods have no specific for Narrative MRC, where the events and the roles are the key factors. Therefore, we build the scene graph by taking the events, the persons, and the external knowledge of the event as the nodes. Meanwhile, we design the connections of the graph from both the perspectives of each event and the whole passage. Instead of the typical plane graph, we build a threedimensional graph, which can not only model the relevance among the events in the passage but also retain the unique information of each event. To encode the graph in a targeted manner, we propose Graph Dimensional Iteration Network (GDIN). GDIN can encode the scene graph iteratively and thus obtain the integrated representation of the scene graph. As a result, the machine will understand the narrative more comprehensively and make the decision more precisely.
To summarise, inspired by human behaviors, we propose a novel method to restore the scene for narrative MRC. Specifically, we introduce event knowledge from Atomic , and build the scene graph to describe the scene. To encode the graph, we propose a novel graph neural network, GDIN. We conduct experiments on two datasets, ROCStories (Mostafazadeh et al., 2017) and CosmosQA (Huang et al., 2019). The results show that our method achieves state-of-the-art.

Method
The overview of our method is shown in Figure  2. Our starting point is to let the machine restore the scene like a human while reading narrative passages and then utilize the information from the scene to better comprehension. As shown in Figure 2, given a narrative passage, we firstly obtain the knowledge for the events mentioned in the passage. Subsequently, we build a scene graph, a three-dimensional graph, whose nodes contain events, person roles, and event knowledge. The graph is composed of two kinds of plane graphs: one is the inner-event graph, which describes a single event; the other is the cross-event graph, which captures the relevance among the events. Meanwhile, we conduct a basic encoding for the narrative passage and the knowledge and then obtain their original representation. By utilizing our proposed Graph Dimensional-Iteration Network (GDIN), we encode the scene graph from the inner-event graph to cross-event graph iteratively. To this end, we obtain the representation of the scene and then make a prediction based on it. To endow the model with the ability to associate the event-relevant description, we introduce external knowledge from Atomic. Atomic is an event-based knowledge graph. It contains 24,313 central nodes (i.e., base events) like "PersonX repels PersonY's attack". Each of them is linked to multiple types of knowledge nodes, such as the effect on PersonX (e.g., Person X's heart races), the cause of PerosnX (e.g., X wanted to protect himself), the effect on PersonY (e.g., Y gets hurt) and so on. As those knowledge nodes are also events, there are totally 877,108 event, relation, event triples.

Knowledge Obtaining
Nevertheless, due to the diversity of real-world events, Atomic cannot cover all the events. Meanwhile, even if the coverage is acceptable for everyday events, the accuracy of event linking (link a certain event text to Atomic) also cannot be ensured. Therefore, we employ the pre-training framework, Comet (Bosselut et al., 2019), which is originally proposed for the task of knowledge base completion. Specifically, Comet is obtained by fine-tuning GPT (Radford et al., 2018) on Atomic. The training task is inputting the start event and the relation event, relation, , and then generating the end event of the triple.
By employing Comet, we design the process of obtaining event knowledge, as shown in Figure 3. Given an event like "Jerry repels Tom's attack", to approximate the phrases in Atomic, we firstly annotate the person roles, that is, replacing the subject person with "PersonX" and the other person with "PersonY". Thus, we get "PersonX repels PersonY's attack". Secondly, we input it to the Comet and obtain the event knowledge. According to the demand of restoring the scene, we select four types of them, including "xIntend" (Why does X cause the event), "xEffect" (What effects does the event have on X), "yEffect" (What effects does the event have on Y), and "xAttr" (How would X be described). For example, "xIntend" here could be "PersonX wanted to protect himself". Finally, we resolve the normalized person roles, that is, replacing "PersonX" and "PersonY" with the original person names. For example "xIntend" will finally be "Jerry wanted to protect himself".

Scene Graph Building
Having annotated the person roles and obtained relevant knowledge for every event, we build a graph, named "scene graph", to present a structured description for the scene. We believe that compared with the unstructured text, the graph can provide a more intuitive description from the perspective of the events for the scene.
(a) inner-event graph (b) cross-event graph Figure 4: Two types of plane graphs, which compose the three-dimensional scene graph. For i-th (i ∈ {0, 1, 2...n − 1}) event, we denote the nodes as follows: The "Scene Graph Building" part in Figure 2 shows the full view of a scene graph. It is a threedimensional graph composed of two kinds of plane graphs: inner-event graph and cross-event graph as shown in Figure 4. The nodes of event and person are the intersection between the two types of graphs. The inner-event graph describes a single event, and the cross-event graph captures the relevance among the events, including the narrative order and the person coreference. Accordingly, we build an inner-event graph for each event and a cross-event graph for the whole narrative passage.
The graph contains three kinds of nodes: event, person role, and event knowledge. The links of the graph are designed as follows: (1) In each innerevent graph, a) we link every event knowledge to the event; b) The person roles are linked to the event; c) Each knowledge is linked to its corresponding person.
(2) In the cross-event graph, a) we link each event to the adjacent event, which could capture the narrative order; b) To pass information from the perspective of the role, we conduct a coreference resolution and build a connection between two mentions for the same person across the events. To this end, we obtain two kinds of adjacent matrixes for those plane graphs. They are formulated as A inner i ∈ R 7×7 (i ∈ {0, 1, 2...n − 1}) and A cross ∈ R 3n×3n , where n − 1 is the total number of the events in the passage.

Basic Encoding
Before the process of graph encoding, we employ a pre-trained Language Model (LM) to conduct a basic encoding and obtain the original representation of the nodes. For a certain sequence (e.g., a sentence or a passage), the representation is calculated by where L s denotes the word-level length of the sequence, and d is the dimensional size of the representation. We take S as the sequence representation. Thus inputting the narrative passage to the LM, we can obtain C seq ∈ R Lp×d and C ∈ R d . In a specific task, the passage will be concatenated with other text (e.g., question or candidate) together as the input sequence, which will be detailed in 2.5.
In practice, we regard each sentence in the narrative passages as an event, and thus from C seq we can extract E seq i ∈ R Le×d , the representation of the words of i-th event, according to its sentence span. Then, we merge it by max-pooling, and obtain E (0) i ∈ R d , the original representation of i-th event. Meanwhile, the representation of the roles can be extracted from C seq based on their position as well. Therefore, for the subject person, PersonX, we have P For the other person, Per-sonY, we have P y(0) i ∈ R d . Specifically, for each role, we take the representation of its first word as its overall representation. Moreover, taking each knowledge as the input of the LM, we can get the representation for it. Hence, for "xIntend", "xEffect", "yEffect", "xAttr", we have K

Dimensional-Iteration Encoding
To encode the graph in a targeted manner and model the scene from both the perspectives of each event and the whole passage, we propose Graph Dimensional-Iteration Network (GDIN) based on Graph Convolutional Network (GCN) (Kipf and Welling, 2016). As shown in Figure 2, GDIN encodes the graph along the dimension of inner-event graph and then encodes it along the dimension of cross-event graph, which is an iterable process. As the original representation of every node has been obtained by the basic encoding, we conduct a dimensional-iteration encoding with GDIN as follows: (1) Encoding along the dimension of innerevent graph: At t-step, for i-th inner-event graph, we formulate the representation of its nodes as H ] ∈ R 7d , where the symbol";" denotes concatenation. Then we update the representation of all nodes by where I is the identity matrix. W in ∈ R 7d×7d is a trainable matrix and σ is the activation function. D in pp = q A inner i + I pq is the degree matrix.
(2) Encoding along the dimension of cross-event graph: At (t+1)-step, for the cross-event graph, we collect the nodes of person and event from those above inner-event graphs, and then we formulate the representation of its nodes as H Subsequently, we update the representation of the nodes of person and event by where W cs ∈ R 3nd×3nd is a trainable matrix, and D cs pp = q (A cross + I) pq is the degree matrix. Note that, in this step the representation of the knowledge does not change. Taking the xEffect knowledge as an example, at this step we have K Iterating: The nodes of event and person are the intersection between the two types of graphs. With iterating (1) and (2), the information passes across different dimensions along those nodes. Therefore, GDIN can model the three-dimensional scene graph from both the perspectives of each event and the whole passage. Assuming it iterating for L loops, we obtain H

We merge the representation of all the events by
where w p ∈ R d is a trainable vector. C s is the representation of the narrative passage built from the description of the scene. Subsequently, we obtain the final representation of the passage by a residual connection: C f = [C s ; C] ∈ R 2d .

Task-Specific Input and Output
We evaluate our method on two types of MRC test, story cloze test and multiple choice. Given a passage, the former requires the model to select a suitable ending from two candidates; the latter requires the model to select the answer for a certain question from four candidates. We prepare the input for the model following Devlin et al. (2018) and Radford et al. (2018). For the story cloze test, we concatenate each ending with the given passage as the input sequence of basic encoding. Then we can obtain an ending-aware passage representation C seq and C. For multiple choice, we concatenate each option with the question and the passage as the input sequence. Thus we get a option-questionaware passage representation C seq and C. After basic encoding and dimensional iteration encoding, we have the final representation C f . In both the above tests, there is C f j , which is the passage representation for j-th candidate. To this end, we score each candidate by where score j is the normalized selection score of the j-th candidate. w s ∈ R 2d is a trainable vector. Then we predict by taking the candidate with the highest score as the ending or the answer.

Datasets and Metrics
The datasets we choose are ROCStories (Mostafazadeh et al., 2017) and CosmosQA (Huang et al., 2019). The passages of both the above datasets are narrative. ROCStories: a popular dataset of Story Cloze Test (SCT), annotated by Amazon Mechanical Turk (MTurk) workers based on a collection of short stories. In development and test set, each instance contains a four-sentence passage, and two candidate endings, while the train set only provides the original five-sentence story containing the proper ending. Following previous works (Cai et al., 2017;Chaturvedi et al., 2017;Cui et al., 2019), we take the development set for training and evaluate the performance on the test set.
CosmosQA: a recently proposed dataset formulated as multiple choice. The narratives are collected from the Spinn3r Blog dataset (Burton et al., 2009) and annotated by MTurk. We train and validate the model on the train set and the development set, respectively. As the label of the test set is not public, we evaluate our model by submitting the predictions to the official website 1 . Evaluation Metrics: As the targets of both the above tests are making a choice among the candidates, we use the common metric, accuracy, for evaluation.

Implementation Details
In practice, we regard each sentence in the narrative passages as an event. When annotating the person roles in a particular sentence, we employ spaCy 2 for dependency parsing. To link two mentions for the same person across the sentences while building a graph, we utilize Neural Coreference 3 for Method Accuracy DSSM (Huang et al., 2013) 58.5 Conditional GAN (Wang et al., 2017a) 60.9 End Attn (Cai et al., 2017) 74.7 LR+RNNLM (Schwartz et al., 2017) 75.2 HCM (Chaturvedi et al., 2017) 77.6 SeqMANN (Li et al., 2018) 84.7 GPT-FT (Radford et al., 2018) 86.5 Concept  87.6 BERT-FT (Devlin et al., 2018) 89.2 BERT+Diff-Net (Cui et al., 2019) 90.1 Our method (BERT+GDIN) 91.9 For a fair comparison with the state-of-the-art models, we employ pre-trained language model, BERT-large (Devlin et al., 2018) and ALBERTxxlarge (Lan et al., 2020)for basic encoding, respectively. The optimizer we choose is Adam. The learning rates are 5 × 10 −6 for the model based on BERT and 1 × 10 −5 for that based on ALBERT. We train both the models for three epochs with a 0.1 dropout rate.

Baselines
We present a series of previous works as baselines for each dataset. For brevity, we only detail those recently published advanced methods. LM-FT: a kind of model that combines a taskspecific output layer with the pre-trained language model, LM. The model is fine-tuned on ROC-Stories or CosmosQA. LM could be GPT (Radford et al., 2018), BERT, RoBERTa  , or ALBERT. Note that the BERT model is BERT-large, which is the same as that in our method for ROCStories; The ALBERT model is ALBERT-xxlarger and which is the same as that in our method for CosmosQA. Concept: a neural network model for SCT. This model employs a pre-trained language model, which is initialized from GPT and introduces the external knowledge from ConceptNet. BERT+Diff-Net: the state-of-the-art model for SCT. It employs the pre-trained language model, BERT (BERT-large). In particular, it focuses on better modeling the differences of each ending and discriminates two endings in three semantic as-

Effectiveness of the Event Knowledge
As stated in 2.1, Knowledge Obtaining, we choose the event relevant knowledge from Atomic instead of the concept knowledge from ConceptNet, Word-Net, or other word-centered knowledge bases. To validate the effectiveness of the event knowledge, we employ GPT and RoBERTa for basic encoding. We combine them with GDIN, respectively, and conduct experiments on the two datasets. Table 3 shows the results of the experiments. Our model, GPT+GDIN, surpasses Concept, which utilizes GPT and the knowledge from ConceptNet. Meanwhile, the performance of RoBERTa+GDIN for the scene. Note that in the cross-event graph, we omit some links among the persons for brevity, including p x 0 to p y 2 , p x 0 to p y 3 , p x 1 to p y 3 , and p y 0 to p x 3 .
Method Accuracy ROCStories Concept  87.6 GPT+GDIN 88.3 CosmosQA K-Adapter (Wang et al., 2020) 81.8 RoBERTa+GDIN 82.5 Table 3: Comparison between the different source of knowledge is better than that of K-Adapter, which employs RoBERTa and entity and syntax knowledge. To a certain extent, those pairs of comparison verify the effectiveness and suitability of the event knowledge we choose for narrative MRC.

Effectiveness of the Scene Graph
As stated in 2.2, Scene Graph Building, we propose a three-dimensional graph to describe the scene. To verify the advantages of this method, we build two baselines as follows: BERT+Flat: a method that describes the scene by the flatten unstructured text. Specifically, BERT+Flat attaches the knowledge sentences to their corresponding event text, and organizes an  xAttr+and+xIntend+Event+xEffect+yEffect where the subject name of xIntend is dropped for fluency. During the process of encoding, the passage joined with the event knowledge is encoded as a whole, and the ending-aware (or option-questionaware) passage representation C is applied directly to predict. Figure 5 shows an example of the comparison between the unstructured description and the structured one for the scene.
BERT+Plane: a method that merges our proposed three-dimensional scene graph into a unified plane graph. Specifically, we put all of the inner-event graphs on a single plane and then build connects among them with the links of the cross-event graph, e.g., the link between e 0 and e 1 . Because GDIN is  not suitable for this plane graph, we encode it by a two-layer GCN instead. The other processes are the same as those in our proposed method. The comparison results on ROCStories dataset are shown in Tabel 4. Compared with BERT+Flat, the graph-based method, BERT+GDIN shows significant advantages. The result further confirms our belief that the structured data provides a more intuitive and exploitable description of the scene for the machine. Besides, BERT+GDIN surpasses BERT+Plane, which verifies the effectiveness of our proposed three-dimensional graph. From our point of view, during the process of encoding, the unified plane graph can not retain the unique information of each event as well as the threedimensional graph does.

Effectiveness of Iterable Encoding
As stated in 2.4, Dimensional-Iteration Encoding, we propose a novel neural network, GDIN, for encoding the three-dimensional scene graph in a targeted manner. To study the effectiveness of the iteration, we set a different number of iteration steps for our model and conduct experiments on ROCStories. The results are shown in Table 5. On the one hand, when the number is 1, where the model does not iterate actually, the performance lag obviously behind that of 2 steps. This demonstrates the effectiveness of the iteration. On the other hand, by increasing the step number, the performance rises up rapidly and then drops down slowly. This phenomenon indicates that in addition to enabling the iteration, it is also important to select a proper iteration step. We deduce that the proper step is the balance point where each event retains its unique information, and at the same time, also gets the associated information from the whole passage.

Related Work
Machine Reading Comprehension: Due to the fast development of deep learning techniques and large-scale datasets, Machine Reading Comprehension(MRC) has gained increasingly wide attention over the past few years. Richardson et al. (2013) build the multiple-choice dataset MCTest, and this dataset encourages the early research of machine reading comprehension, and a strand of MRC models (Sachan et al., 2015;Narasimhan and Barzilay, 2015) are inspired by the dataset. Hermann et al. (2015) propose a cloze test dataset CNN & Daily Mail, which is large-scale and more suitable than MCTest for deep learning methods. Based on this dataset, Hermann et al. (2015) proposes an attention-based LSTM model named Attentive Reader, and Chen et al. (2016) simplify this model by directly utilize the query-aware context representations to match the candidate answer. Moreover, Rajpurkar et al. (2016) release the span extraction dataset, SQuAD, which has become the most popular MRC dataset over recent years. This dataset enlightens a lot of classical MRC model, like Bidirectional Attention Flow (BiDAF) (Seo et al., 2016) and R-Net (Wang et al., 2017b). Recently, there are some new trends in this field, such as multipassage MRC (Campos et al., 2016), knowledgebased MRC (Ostermann et al., 2018) and multi-hop MRC Min et al., 2019). Narrative Comprehension: Understanding narrative is a challenging task in natural language understanding, for the passages contain rich cause and effect relations. A large body of previous works focus on scripts learning (Schank and Abelson, 1977). Some previous works addressed script learning by focusing on the narrative cloze test (Chambers and Jurafsky, 2008). Story Cloze Test (Mostafazadeh et al., 2017) is then introduced as a new evaluation framework, and gains wide attention (Chaturvedi et al., 2017;Zhou et al., 2019b). Besides, recent works present other test frameworks for narrative comprehension, such as multiple choice (Huang et al., 2019) and answer generation (Kociský et al., 2018). Compared with the other complex forms of test, e.g., answer generation, the test frameworks we choose (selecting ending or answer) are more focused on narrative comprehension itself.

Conclusion
In this paper, we focus on Narrative Machine Reading Comprehension. Inspired by human behaviors, we propose a novel method to restore the scene for the narrative passage. Specifically, we introduce the event knowledge from Atomic and build a three-dimensional graph to describe the scene. To encode the scene graph, we propose Graph Dimensional-Iteration Network (GDIN). We conduct experiments on two relevant datasets, ROCStories and CosmosQA. The result shows our method achieves state-of-the-art. Further experimental investigation shows that (1) compared with concept knowledge, the event knowledge we choose is more suitable for narrative MRC; (2) Our proposed graph models the scene more effectively than the unstructured text and the unified plane graph do; (3) Our proposed GDIN encodes the scene graph efficiently by iterating multiple steps.