Narrative Modeling with Memory Chains and Semantic Supervision

Story comprehension requires a deep semantic understanding of the narrative, making it a challenging task. Inspired by previous studies on ROC Story Cloze Test, we propose a novel method, tracking various semantic aspects with external neural memory chains while encouraging each to focus on a particular semantic aspect. Evaluated on the task of story ending prediction, our model demonstrates superior performance to a collection of competitive baselines, setting a new state of the art.


Introduction
Story narrative comprehension has been a long-standing challenge in artificial intelligence (Winograd, 1972;Turner, 1994;Schubert and Hwang, 2000). The difficulties of this task arise from the necessity for understanding not only narratives, but also commonsense and normative social behaviour (Charniak, 1972). Of particular interest in this paper is the work by Mostafazadeh et al. (2016) on understanding commonsense stories in the form of a Story Cloze Test: given a short story, we must predict the most coherent sentential ending from two options (e.g. see Figure 1).
Many attempts have been made to solve this problem, based either on linear classifiers with handcrafted features (Schwartz et al., 2017;Chaturvedi et al., 2017), or representation learning via deep learning models (Mihaylov and Frank, 2017;Bugert et al., 2017;Mostafazadeh et al., 2017). Another widely used component of competitive systems is language model-based features, which require training on large corpora in the story domain.
Context: Sam loved his old belt. He matched it with everything. Unfortunately he gained too much weight. It became too small. Coherent Ending: Sam went on a diet. Incoherent Ending: Sam was happy. The current state-of-the-art approach of Chaturvedi et al. (2017) is based on understanding the context from three perspectives: (1) event sequence, (2) sentiment trajectory, and (3) topic consistency. Chaturvedi et al. (2017) adopt external tools to recognise relevant aspect-triggering words, and manually design features to incorporate them into the classifier. While identifying triggers has been made easy by the use of various linguistic resources, crafting such features is time consuming and requires domain-specific knowledge along with repeated experimentation.
Inspired by the argument for tracking the dynamics of events, sentiment and topic, we propose to address the issues identified above with multiple external memory chains, each responsible for a single aspect. Building on Recurrent Entity Networks (EntNets), a superior framework to LSTMs demonstrated by Henaff et al. (2017) for reasoning-focused question answering and clozestyle reading comprehension, we introduce a novel multi-task learning objective, encouraging each chain to focus on a particular aspect. While still making use of external linguistic resources, we do not extract or design features from them but rather utilise such tools to generate labels. The generated labels are then used to guide training so that each chain focuses on tracking a particular aspect. At test time, our model is free of feature engineering such that, once trained, it can be easily deployed to unseen data without preprocessing. Moreover, our approach also differs in the lack of a ROC Stories language model component, eliminating the need for large, domain-specific training corpora.
Evaluated on the task of Story Cloze Test, our model outperforms a collection of competitive baselines, achieving state-of-the-art performance under more modest data requirements.

Methodology
In the story cloze test, given a story of length L, consisting of a sequence of context sentences c = {c 1 , c 2 , . . . , c L }, we are interested in predicting the coherent ending to the story out of two ending options o 1 and o 2 . Following previous studies (Schwartz et al., 2017), we frame this problem as a binary classification task. Assuming o 1 and o 2 are the logical and inconsistent endings respectively with their associated labels being y 1 and y 2 , we assign y 1 = 1 and y 2 = 0. At test time, given a pair of possible endings, the system returns the one with the higher score as the prediction. In this section, we first describe the model architecture and then detail how we identify aspect-triggering words in text and incorporate them in training.

Proposed Model
First, to take neighbouring contexts into account, we process context sentences and ending options at the word level with a bi-directional gated recurrent unit ("Bi-GRU": Chung et al. (2014) where w i is the embedding of the i-th word in the story or ending option. The backward hidden representation ← − h i is computed analogously but with a different set of GRU parameters. We simply take the sum h i = − → h i + ← − h i as the representation at time i. An ending option, denoted o, is represented by taking the sum of the final states of the same Bi-GRU over the ending option word sequence.
This representation is then taken as input to a Recurrent Entity Network ("EntNet": Henaff et al. (2017)), capable of tracking the state of the world with external memory. Developed in the context of machine reading comprehension, EntNet maintains a number of "memory chains" -one for each entity -and dynamically updates the representations of them as it progresses through a story. Here, we borrow the concept of "entity" and use each chain to track a unique aspect. An illustration of EntNet is provided in Figure 2. Figure 2: Illustration of EntNet with a single memory chain at time i. φ and σ represent Equations 1 and 2, while circled nodes L, C, ⊙ and + depict the location, content terms, Hadamard product, and addition, resp. Memory update candidate. At every time step i, the internal memory update candidate − → m j i for the j-th memory chain is computed as: where k j is the embedding for the j-th entity (key), − → U , − → V and − → W are trainable parameters, and φ is the parametric ReLU (He et al., 2015).
Memory update. The update of each memory chain is controlled by a gating mechanism: where ⊙ denotes Hadamard product (resulting in the unnormalised memory representation − → m j i ), and σ is the sigmoid function. The gate − → g j i controls how much the memory chain should be updated, a decision factoring in two elements: (1) the "location" term, quantifying how related the current input h i (the output of the Bi-GRU at time i) is to the key k j of the entity being tracked; and (2) the "content" term, measuring the similarity between the input and the current state − → m j i−1 of the entity tracked by the j-th memory chain.
Normalisation. We normalise each memory In doing so, we allow the model to forget in the sense that, as − → m j i is constrained to be lying on the surface of the unit sphere, adding new information carried by − → m j i and then performing normalisation inevitably causes forgetting in that the cosine distance between the original and updated memory decreases.
Bi-directionality. In contrast to EntNet, we apply the above steps in both directions, scanning a story both forward and backward, with arrows denoting the processing direction. ← − m j i is computed analogously to − → m j i but with a different set of parameters, and m j i = − → m j i + ← − m j i . We further define g j i to be the average of the gate values of both directions at time i for the j-th chain: which will be used for semantic supervision.
Final classifier. The final predictionŷ to an ending option given its context story is performed by incorporating the last states of all memory chains in the form of a weighted sum u: where T denotes the total number of words in a story and W att is a trainable weight matrix. We then transform u to get the final prediction: where R and H are trainable weight matrices.

Semantically Motivated Memory Chains
In order to encourage each chain to focus on a particular aspect, we supervise the activation of each update gate (Equation 2) with binary labels. Intuitively, for the sentiment chain, a sentimentbearing word (e.g. amazing) receives a label of 1, allowing the model to open the gate and therefore change the internal state relevant to this aspect, while a neutral one (e.g. school) should not trigger the activation with an assigned label of 0. To achieve this, we use three memory chains to independently track each of: (1) event sequence, (2) sentiment trajectory, and (3) topical consistency.
To supervise the memory update gates of these chains, we design three sequences of binary labels: l j = {l j 1 , l j 2 , . . . , l j T } for j ∈ [1, 3] representing event, sentiment, and topic, and l j i ∈ {0, 1}. The label at time i for the j-th aspect is only assigned a value of 1 if the word is a trigger for that particular aspect: l j i = 1; otherwise l j i = 0. During training, we utilise such sequences of binary labels to supervise the memory update gate activations of each chain. Specifically, each chain is encouraged Ricky fell while hiking in the woods  to open its gate only when it sees a trigger bearing information semantically sensitive to that particular aspect. Note that at test time, we do not apply such supervision. This effectively becomes a binary tagging task for each memory chain independently and results in individual memory chains being sensitive to only a specific set of triggers bearing one of the three types of signals. While largely inspired by Chaturvedi et al. (2017), our approach differs in how we extract and use such features. Also note that, while still making use of external linguistic resources to detect trigger words, our approach lets the memory chains decide how such words should influence the final prediction, as opposed to the handcrafted conditional probability features of Chaturvedi et al. (2017).
To identify the trigger words, we use external linguistic tools, and mark trigger words for each aspect with a label of 1. An example is presented in Table 1, noting that the same word can act as trigger for multiple aspects.
Event sequence. We parse each sentence into its FrameNet representation with SEMAFOR (Das et al., 2010), and identify each frame target (word or phrase tokens evoking a frame).

Sentiment
trajectory. Following Chaturvedi et al. (2017), we utilise a pre-compiled list of sentiment words (Liu et al., 2005). To take negation into account, we parse each sentence with the Stanford Core NLP dependency parser Chen and Manning, 2014) and include negation words as trigger words.
Topical consistency. We process each sentence with the Stanford Core NLP POS tagger and identify nouns and verbs, following Chaturvedi et al. (2017).

Training Loss
In addition to the cross entropy loss of the final prediction of right/wrong endings, we also take into account the memory update gate supervision of each chain by adding the second term. More formally, the model is trained to minimise the loss: whereŷ and g j i are defined in Equations 7 and 4 respectively, y is the gold label for the current ending option o, and l j i is the semantic supervision binary label at time i for the j-th memory chain. In our experiments, we empirically set α to 0.5 based on development data.

Experimental Setup
Dataset. To test the effectiveness of our model, we employ the Story Cloze Test dataset of Mostafazadeh et al. (2016). The development and test set each consist of 1,871 4-sentence stories, each with a pair of ending options. Consistent with previous studies, we split the development set into a training and validation set (for early stopping), resulting in 1,683 and 188 in each set, resp. Note that while most current approaches make use of the much larger training set, comprised of 100K 5sentence ROC stories (with coherent endings only, also released as part of the dataset) to train a language model, we make no use of this data.
Model configuration. We initialise our model with word2vec embeddings (300-D, pre-trained on 100B Google News articles, not updated during training: Mikolov et al. (2013a,b)). In addition to the three supervised chains, we also add a "free" chain, unconstrained to any semantic aspect.
Training is carried out over 200 epochs with the FTRL optimiser (McMahan et al., 2013) and a batch size of 128 and learning rate of 0.1. We use the following hyper-parameters for weight matrices in both directions: R ∈ R 300×1 , H, U, V, W are all matrices of size R 300×300 , and hidden size of the Bi-GRU is 300. Dropout is applied to the output of φ in the final classifier (Equation 7) with a rate of 0.2. Moreover, we employ the technique introduced by Gal and Ghahramani (2016) where the same dropout mask is applied at every step to the input w i to the Bi-GRU and the input h i to the memory chains with rates of 0.5 and 0.2 respectively. Lastly, to curb overfitting, we regularise the last layer (Equation 7) with an L 2 penalty on its weights: λ R where λ = 0.001.
Evaluation. Following previous studies, we evaluate the performance in terms of coherent ending prediction accuracy: #correct #total . Here, we report the average accuracy over 5 runs with different random seeds. Echoing Melis et al. (2018), we also observe some variance in model performance even if training is carried out with the same random seed, which is largely due to the nondeterministic ordering of floating-point operations in our environment (Tensorflow (Abadi et al., 2015) with a single GPU). Therefore, to account for the randomness, we further train our model 5 times for each random seed and select the model with the best validation performance.
We benchmark against a collection of strong baselines, including the top-3 systems of the 2017 LSDSem workshop shared task (Mostafazadeh et al., 2017): MSAP (Schwartz et al., 2017), HCM (Chaturvedi et al., 2017) 2 , andTBMIHAYLOV (Mihaylov andFrank, 2017). The first two primarily rely on a simple logistic regression classifier, both taking linguistic and probability features from a ROC Stories domain-specific neural language model. TBMIHAYLOV is LSTM-based; we also include DSSM (Mostafazadeh et al., 2016). We additionally implement a bi-directional EntNet (Henaff et al., 2017) with the same hyperparameter settings as our model and no semantic supervision. 3

Results and Discussion
The experimental results are shown in Table 2. new state of the art. Note that this is achieved without any external linguistic resources at test time or domain-specific language model-based probability features, highlighting the effectiveness of the proposed model.
EntNet vs. our model. We see clear improvements of the proposed method over EntNet, an absolute gain of 0.9% in accuracy. This validates the hypothesis that encouraging each memory chain to focus on a unique, well-defined aspect is beneficial.
Discussion. To better understand the benefits of the proposed method, we visualise the learned word representations (output representation of the Bi-GRU, h i ) and keys between EntNet and our model in Figure 3. With EntNet, while sentiment words form a loose cluster, words bearing event and topic signal are placed in close proximity and are largely indistinguishable. With our model, on the other hand, the degree of separation is much clearer. The intersection of a small portion of the event and topic groups is largely due to the fact that both aspects include verbs. Another interesting perspective is the location of the automatically learned keys (displayed as diamonds): while all the keys with EntNet end up overlapping, indicating little difference among them, the keys learned by our method demonstrate semantic diversity, with each placed within its respective cluster. Note that the free key is learned to complement the event key, a difficult challenge given the two disjoint clusters.
Ablation study. We further perform a ablation study to analyse the utility of each component in  performance down to 77.8, is inferior to its bidirectional cousin. Removing all semantic supervision (resulting in 4 free memory chains) is also damaging to the accuracy (dropping to 77.6).
Among the different types of semantic supervision, we see various degrees of performance degradation, with removing sentiment trajectory being the most detrimental, reflecting its value to the task. Interestingly, we observe improvement when removing event sequence supervision from consideration. We suspect that this is mainly due to the noise introduced by the rather inaccurate FrameNet representation output of SEMAFOR (F 1 = 61.4% in frame identification as reported in Das et al. (2010)). While it is possible that replacing SEMAFOR with the SemLM approach (with the extended frame definition to include explicit discourse markers, e.g. but) in Peng and Roth (2016) and Chaturvedi et al. (2017) may potentially reduce the amount of noise, we leave this exercise for future work.

Conclusion
In this paper, we have proposed a novel model for tracking various semantic aspects with external memory chains. While requiring less domainspecific training data, our model achieves stateof-the-art performance on the task of ROC Story Cloze ending prediction, beating a collection of strong baselines.