Modeling Human Mental States with an Entity-based Narrative Graph

Understanding narrative text requires capturing characters’ motivations, goals, and mental states. This paper proposes an Entity-based Narrative Graph (ENG) to model the internal- states of characters in a story. We explicitly model entities, their interactions and the context in which they appear, and learn rich representations for them. We experiment with different task-adaptive pre-training objectives, in-domain training, and symbolic inference to capture dependencies between different decisions in the output space. We evaluate our model on two narrative understanding tasks: predicting character mental states, and desire fulfillment, and conduct a qualitative analysis.


Introduction
Understanding narrative text requires modeling the motivations, goals and internal states of the characters described in it. These elements can help explain intentional behavior and capture causal connections between the characters' actions and their goals. While this is straightforward for humans, machine readers often struggle as a correct analysis relies on making long range common-sense inferences over the narrative text. Providing the appropriate narrative representation for making such inferences is therefore a key component. In this paper, we suggest a novel narrative representation model and evaluate it on two narrative understanding tasks, analyzing the characters' mental states and motivations (Abdul-Mageed and Ungar, 2017;Chen et al., 2020), and desire fulfillment (Chaturvedi et al., 2016;Rahimtoroghi et al., 2017).
We follow the observation that narrative understanding requires an expressive representation capturing the context in which events appear and the interactions between characters' states. To clarify, consider the short story in Fig. 1. The desire expression appears early in the story and provides the context explaining the protagonist's actions. Evaluating the fulfilment status of this expression, Cindy really likes apples.
She wanted to try something new with them.
She decided to try to make baked apples for the first time.
She gathered everything she needed and began cooking.
It's now her favorite apple dish! Desire Expression: try something new with them Motivation (Reiss): Curiosity Emotion (Plutchik): Joy, Anticipation Desire Fulfilled! Motivation (Reiss): Independence Emotion (Plutchik): Joy Figure 1: Narrative Example which tends to appear towards the end of the story, requires models that can reason over the desire expression ("trying something new"), its target ("apples") and the outcome of the protagonist's actions ("it's now her favorite apple dish!"). Capturing the interaction between the motivation underlying the desire expression (in Fig. 1, CURIOSITY) and the emotions (in Fig. 1, ANTICIPATION) likely to be invoked by the motivation can help ensure the consistency of this analysis and improve its quality.
To meet this challenge, we suggest a graphcontextualized representation for entity states. Similar to contextualized word representations (Peters et al., 2018;Devlin et al., 2019), we suggest learning an entity-based representation which captures the narrative it is a part of. For example, in "She decided to try to make baked apples for the first time" the mental state of "she" would be represented differently given a different context, such as a different motivation for the action ("Her mother asked her to make an apple dish for a dinner party"). In this case, the contextualized representation would capture the different emotion associated with it (e.g., FEAR of disappointing her mother). Unlike contextualized word embeddings, entity-based contextualization needs to consider, at least, two levels of context: local text context and distant event context, which require more complicated modeling techniques to capture event semantics. Moreover, the context of event relationships can spread over a long narrative, exceeding maximum sequence length limitation in modern contextualized word embedding models such as BERT (Devlin et al., 2019).
In this paper, we propose an Entity-based Narrative Graph (ENG) representation of the text. Unlike other graph-based narrative representations (Lehnert, 1981;Goyal et al., 2010;Elson, 2012) which require intensive human annotation, we design our models around low-cost supervision sources and shift the focus from symbolic graph representations of nuanced information to their learned embedding. In ENG, each node is associated with an entityevent pair, representing an entity mention that is involved in an event. Edges represent observed relations between entities or events. We adapt the definition of event relationships introduced in Lee et al. (2020) to our entity-event scenario. For entity relationships, the CNext relationship connects two coreferent entity nodes. For event relationships, the Next relationship captures the sequential order of events as they appear in the text, and six discourse relation types from the Penn Discourse Tree Bank (PDTB) (Prasad et al., 2007) are used. These include Before, After, Sync., Contrast, Reason and Result. Note that these are extracted in a weakly supervised manner, without expensive human annotations.
To contextualize the entity embeddings over ENG, we apply a Relational Graph Convolution Network (R-GCN) (Schlichtkrull et al., 2018), a relational variant of the Graph Convolution Network architecture (GCN) (Kipf and Welling, 2016). R-GCNs create contextualized node representations by considering the graph structure through graph convolutions and learn a composition function. This architecture allows us to take into account the narrative structure and the different discourse relations connecting the entity-event nodes.
To further enhance our model, we investigate three possible pre-training paradigms: whole-wordmasking, node prediction, and link prediction. All of them are constructed by automatically extracting noisy supervision and pre-training on a largescale corpus. We show that choosing the right pre-training strategy can lead to significant performance enhancements in downstream tasks. For example, automatically extracting sentiment for entities can impact downstream emotion predictions. Finally, we explore the use of a symbolic inference layer to model relationships in the output space, and show that we can obtain additional gains in the downstream tasks that have strong correlation in the output space.
The evaluated downstream tasks include two challenging narrative analysis tasks, predicting characters' psychological states  and desire fulfilment (Rahimtoroghi et al., 2017). Results show that our model can outperform competitive transformer-based representations of the narrative text, suggesting that explicitly modeling the relational structure of entities and events is beneficial. Our code and trained models are publicly available 1 .

Related Work
Tracking entities and modeling their properties has proven successful in a wide range of tasks, including language modeling (Ji et al., 2017), question answering (Henaff et al., 2017) and text generation . In an effort to model complex story dynamics in text,  released a dataset for tracking the emotional reactions of characters in stories. In their dataset, each character mention is annotated with three types of mental state descriptors: Maslow's "hierarchy of needs" (Maslow, 1943), Reiss' "basic motives" (Reiss, 2004), that provide a more informative range of motivations, and Plutchik's "wheel of emotions" (Plutchik, 1980), comprised of eight basic emotional dimensions (e.g. joy, sadness, etc). In their paper, they showed that neural models with explicit or latent entity representations achieve promising results on this task. Paul and Frank (2019) approached this task by extracting multi-hop relational paths from ConceptNet, while Gaonkar et al. (2020) leveraged semantics of the emotional states by embedding their textual description and modeling the co-relation between different entity states. Rahimtoroghi et al. (2017) introduced a dataset for the task of desire fulfillment. They identified desire expressions in firstperson narratives and annotated their fulfillment status. They showed that models that capture the flow of the narrative perform well on this task.
Representing the narrative flow of stories using graph structures and multi-relational embeddings has been studied in the context of script learning (Li et al., 2018;Lee and Goldwasser, 2019;Lee et al., 2020). In these cases, the nodes represent predicatecentric events, and entity mentions are added as context to the events. In this paper, we use an entitycentric narrative graph, where nodes are defined by entity mentions and their textual context. We encode the textual information in the nodes using pre-trained language models (Devlin et al., 2019;Liu et al., 2019), and the graph structure with a relational graph neural network (Schlichtkrull et al., 2018). To learn the representation, we incorporate a task-adaptive pre-training phase. Gururangan et al. (2020) showed that further specializing large pre-trained language models to domains and tasks within those domains is effective.
3 Entity-based Narrative Graph

Framework Overview
Many NLU applications require understanding entity states in order to make sophisticated inferences Bosselut et al., 2019;, and the entity states are highly related to the event the entity involves in. In this work, we propose a learning framework that aims at modeling entities' internal states, and their interactions to other entities' internal states through events. We include task-adaptive pretraining (TAPT) and downstream task training to train an entity-based narrative graph (ENG), a graph neural model designed to capture implicit states and interactions between entities. We extend the narrative graph proposed by Lee et al. (2020), which models event relationships, and instead of learning node representations for events, we focus on entity mentions that are involved in events. This change is motivated by the high-demand of NLU applications that require understanding entity mentions' states in order to make sophisticated inference.
Our framework consists of four main components: Node Encoder, Graph Encoder, Learning Objectives, and Symbolic Inference, outlined in Figure 2. The node encoder is a function used to extract event information about the target entity mention corresponding to the local node representation. The graph encoder uses a graph neural network to contextualize node representations with entity-events in the same document, generating entity-context-aware representations. The learning objectives use this representation for several learning tasks, such as node classification, link prediction, and document classification. Finally, we include a symbolic inference procedure to capture dependencies between output decisions.
We introduce a training pipeline, containing pretraining and downstream training, following recent evidence suggesting that task-adaptive pre-training is potentially useful for many NLU tasks (Gururangan et al., 2020). We experiment with three pre-training setups, including the common wholeword-masking pre-training (Liu et al., 2019), and two newly proposed unsupervised pre-training objectives based on ENG. We then evaluate two downstream tasks: StoryCommonsense  and DesireDB (Rahimtoroghi et al., 2017). StoryCommonsense aims at predicting three sets of mental states based on psychological theories (Maslow, 1943;Reiss, 2004;Plutchik, 1980), while DesireDB's goal is to identify whether a target desire is satisfied or not. Solving these tasks requires understanding entities' mental states and their interactions.

Node Encoder
Each node in our graph captures the local context of a specific entity mention (or character mention), and how the entity mentions are extracted is related to extracting their edges, which will be described in Sec. 3.3. Following Gaonkar et al. (2020), we format the input information to be fed into a pretrained language model. For a given character c and sentence s, the inputs to the node encoder consist of three components (s, ctx(c), L), where s is the sentence in which c appears, ctx(c) is the context of c (all the sentences that the character appears in), and L is a label sentence. The label sentence is an artificial sentence of the form "[entity name] is [label 1], [label 2], ..., [label k]." The k labels correspond to the target labels in the downstream task. For example, in StoryCommonsense, the Plutchik state prediction task has eight labels characteriz-ing human emotions, such as joy, trust, and anger. Gaonkar et al. (2020) show that self-attention is an effective way to let the model take label semantics into account, and improve performance 2 . Our best model uses RoBERTa (Liu et al., 2019), a highly-optimized version of BERT (Devlin et al., 2019), to encode nodes. We convert the node input (s, ctx(c), L) to RoBERTa's twosentence input format by treating s as the first sentence, and the concatenation of ctx(c) and L as the second sentence. After forward propagation, we take the pooled sentence representation (i.e., <s >for RoBERTa, CLS for BERT), as the node representation v. This is formulated as v = f roberta (s, ctx(c), L).

Graph Encoder
The ENG is defined as EN G = (V, E), where V is the set of encoded nodes in a document and E is the set of edges capturing relationships between nodes. Each edge e ∈ E is a triplet (v 1 , r, v 2 ), where v 1 , v 2 ∈ V and r is an edge type (r ∈ R). Following Lee et al. (2020), we use eight relation types (|R| = 8) that have been shown to be useful for modeling narratives. NEXT denotes if two nodes appear in neighboring sentences. CNEXT expresses the next occurrence of a specific entity following its co-reference chain. Six discourse relation types, used by Lee et al. (2020) and defined in Penn Discourse Tree Bank (PDTB) (Prasad et al., 2007), are also used in this work, including BE-FORE, AFTER, SYNC., CONTRAST, REASON, RE-SULT. Their corresponding definition in PDTB and can be found in Table 1. Following Lee et al. (2020), we use the Stanford CoreNLP pipeline 3 (Manning et al., 2014) to obtain co-reference links and dependency trees. We use them as heuristics to extract the above relations and identify entities for TAPT 4 . Details of this procedure can be found in (Lee et al., 2020). Note that although we share the same relation definitions, our nodes are defined over entities, instead of events.
For encoding the graph, we use a Relational Graph Convolution Network (R-GCN) (Schlichtkrull et al., 2018) architecture is capable of modeling typed edges and is resilient to noise. R-GCN is defined as: where h l i is the hidden representation for the i-th node at layer l and h 0 i = v i (output of the node encoder); U r (v i ) represents v i 's neighboring nodes connected by the relation type r; z i,r is for normalization; and W l r represents trainable parameters. Our implementation of R-GCN propagates messages between entity nodes, emulating the interactions between their psychological states, and thus enriching node representations with context. Note that our framework is flexible, and alternative node and graph encoders could be used.

Output Layers and Learning Objectives
We explore three learning problem types.
Node Classification For node classification, we use the contextualized node embeddings coming from the graph encoder, and plug in a k-layer feedforward neural network on top (k = 2 in our case). The learning objectives could be either multi-class or multi-label. For multi-class classification, we use the weighted cross-entropy loss (CE). For multilabel classification, we use the binary cross-entropy (BCE) loss for each label 5 : where S(.) is the Softmax function, f (.) is the graph encoder, g(.) is the node encoder, x i is the input including the target node i ((s, ctx(c), L)) and all other nodes in the same document (or ENG), y i is the label, and α i is the example weight based on the label distribution of the training set..

Link Prediction
This objective tries to recover missing links in a given ENG. We sample a small portion of edges (20% in our case) as positive examples, based on the relation type distribution given in Table 1, taken from the training set. To obtain negative examples, we corrupt the positive examples by replacing one component of the edge triplet with a sampled component so that the resulting triplet does not exist in the original graph. For example, given a positive edge (e 1 , r, e 2 ), we can create negative edges: (e 1 , r, e 2 ), (e 1 , r , e 2 ), or (e 1 , r, e 2 ). Following Schlichtkrull et al. (2018), we score each edge sample with DistMult (Chang et al., 2014): where W r is a relation-specific trainable matrix (non-diagonal) and h i and h j are node embeddings coming from the graph encoder. A higher score indicates that the edge is more likely to be active. To learn this, we reward positive samples and penalize negative ones, using an adapted CE loss: T is the sampled edges set, y = {0, 1}, σ(.) is the Sigmoid function, and r is the edge type weight, based on the edge sampling rate in Table 1.

Document Classification
For document classifications, such as DesireDB, we aggregate the node representations from the entire ENG to form a single representation. To leverage the relative importance of each node, we add a self-attention layer on top of the graph nodes. We calculate the attention weights by attending on the query embedding (in DesireDB, this is the sentence embedding for the desire expression).
where h i is the i-th node representation, h t is the query embedding, W a and b a are trainable parameters, and h d is the final document representation. We then feed h d to a two-hidden-layer classifier to make predictions. We use the loss function specified in Eq. 2.

Task-Adaptive Pre-training
Recent studies demonstrate that downstream tasks performance can be improved by performing selfsupervised pre-training on the text of the target domain (Gururangan et al., 2020), called Task-Adaptive Pre-Training (TAPT). To investigate whether different TAPT objectives can provide different insights for downstream tasks, we apply three possible pre-training paradigms and compare them on StoryCommonsense. We focus on StoryCommonsense given that the dataset was created by annotating characters' mental states on a subset of RocStories (Mostafazadeh et al., 2016), a corpus with 90K short common-sense stories. This provides us with a large unlabeled resource for investigating different pre-training methods. We run TAPT on all the RocStories text 6 . We use the learning parameters suggested by Gururangan et al.
(2020) and explore the following strategies: Whole-Word Masking: Randomly masks a subset of words and asks the model to recover them from their context (Radford et al., 2019;Liu et al., 2019). We perform this task over RoBERTa, initialized with roberta-base.
ENG Link Prediction: Weakly-supervised TAPT over the ENG. The setup follows Sec. 3.4 (Link Prediction) to learn a model that can recover missing edges in the ENG.
ENG Node Sentiment Classification: Performs weakly-supervised sentiment TAPT. We use the Vader sentiment analysis (Hutto and Gilbert, 2014) tool to annotate the sentiment polarity for each node in the ENG, based on its sentence. The setup follows Sec. 3.4 (Node Classification).

Symbolic Inference
In addition to modeling the narrative structure in the embedding space, we add a symbolic inference procedure to capture structural dependencies in the output space for the StoryCommonsense task. To model these dependencies, we use DRaiL (Pacheco and Goldwasser, 2021), a neural-symbolic framework that allows us to define probabilistic logical rules on top of neural network potentials.
Decisions in DRaiL are modeled using rules, which can be weighted (i.e., soft constraints), or unweighted (i.e., hard constraints). Rules are formatted as horn clauses: A ⇒ B, where A is a conjunction of observations and predicted values, and B is the output to be predicted. Each weighted rule is associated with a neural architecture, which is used as a scoring function to obtain the rule weight. The collection of rules represents the global decision, and the solution is obtained by performing MAP inference. Given that rules are written as horn clauses, they can be expressed as linear inequalities corresponding to their disjunctive form, and thus MAP inference is defined as a linear program.
In DRaiL, parameters are trained using the structured hinge loss. This way, all neural parameters are updated to optimize the global objective. Additional details can be found in (Pacheco and Goldwasser, 2021). To score weighted rules, we used feed-forward networks over the node embeddings obtained by the objectives outlined in Sec. 3.4 and 3.5, without back-propagating to the full graph. We model the following rules: Weighted rules We score each state, as well as state transitions to capture the progression in a character's mental state throughout the story.
where e i and e j are two different mentions of the same character, and HasNext is a relation between consecutive sentences. State can be either Maslow, Reiss or Plutchik.
Unweighted rules There is a dependency between Maslow's "hierarchy of needs' and Reiss "basic motives" . We introduce logical constraints to disallow mismatches in the Maslow and Reiss prediction for a given mention e i . In addition to this, we model positive and negative sentiment correlations between Plutchik labels. To do this, we group labels into positive (e.g. joy, trust), and negative (e.g. fear, sadness). We refer to this set of rules as inter-label dependencies. Maslow(ei, mi) ∧ ¬Align(mi, ri) ⇒ ¬Reiss(ei, ri) Reiss(ei, ri) ∧ ¬Align(mi, ri) ⇒ ¬Maslow(ei, mi) Plut(ei, pi) ∧ Pos(pi) ∧ ¬Pos(pj) ⇒ ¬Plut (ei, pj) Given that the DesireDB task requires a single prediction for each narrative graph, we do not employ symbolic inference for this task.

Evaluation
Our evaluation includes two downstream tasks and a qualitative analysis. We report the results for different TAPT schemes and symbolic inference on StoryCommonsense. For the qualitative analysis, we visualize and compare the contextualized graph embeddings and contextualized word embeddings.

Data and Experiment Settings
For TAPT, we use RocStories, as it has a decent amount of documents (90K after excluding the validation and testing sets) that share the text style of StoryCommonsense. For all tasks, we use the train/dev/test splits used in previous work.
All the RoBERTa models used in this paper are initialized with roberta-base, and the BERT models with bert-base-uncased. The maximum sequence length for the language models is 160. If the input sequence exceeds this number, we will keep the label sentence untouched and cut down the main sentence. For large ENGs, such as long narratives in DesireDB, we set the maximum number of nodes to 60; all the hidden layer have 128 hidden units; and the number of layers for R-GCN is 2. For learning parameters in TAPT, we set the batch size to 256 through gradient accumulations; the optimizer is Adam (Kingma and Ba, 2014) with an initial learning rate of 1e − 4, = 1e − 6, β = (0.9, 0.98), weight decay 0.01, and warm-up proportion 0.06. We run TAPT for 100 epochs. For the downstream tasks, we conduct a grid search of Adam's initial learning rate from {2e − 3, 2e − 4, 2e − 5, 2e − 6}, 5000 warm-up steps, and stop patience of 10. Model selection is done on the validation set. We report results for the best model. For learning the potentials for symbolic inference with DRaiL (Pacheco and Goldwasser, 2021), we use local normalization with a learning rate of 1e − 3, and represent neural potentials using 2-layer Feed-Forward Networks over the ENG node embeddings. All hidden layers consist of 128 units. The parameters are learned using SGD with a patience of 5, tested against the validation set. For more details, refer to (Pacheco and Goldwasser, 2021). Note that while it would be possible to back-propagate to the whole graph, this is a computationally expensive procedure. We leave this exploration for future work.

Task: StoryCommonsense
StoryCommonsense consists of three subtasks: Maslow, Reiss, and Plutchik, introduced in Sec. 2. Each subtask is a multi-label classification task, where the input is a sentence-character pair in a given story, and the output is a set of mental state labels. Each story was annotated by three annota-  Table 2: Results for the StoryCommonsense task, including three multi-label tasks (Maslow, Reiss, and Plutchik), for predicting human's mental states of motivations or emotions. The star sign indicates that the result is from our re-implemented version of previous baselines.
tors and the final labels were determined through a majority vote. For Maslow and Reiss, the vote is count-based, i.e., if two out of three annotators flag a label, then it is an active label. For Plutchik, the vote is rating-based, where each label has an annotated rating, ranging from {0, 5}. If the averaged rating is larger or equal to 2, then it is an active label. This is the set-up given in the original paper . Some papers (Gaonkar et al., 2020) report results using only the countbased majority vote, resulting in scores that are not comparable to ours. Therefore, we re-implement two recent strong models proposed for this task. The Label Correlation model (LC (Gaonkar et al., 2020)) applies label semantics as input and model output space using a learned correlation matrix.
The Self-Attention model (SA (Paul and Frank, 2019)) utilize attentions over multi-hop knowledge paths extracted from external corpus. We evaluate them under the same set of hyper-parameters and model selection strategies as our models.
We briefly explain all the baselines, as well as our model variants shown in Table 2. The first group (G1) are the baselines proposed in the task paper. TF-IDF uses TF-IDF features, trained on RocStories, to represent the target sentence s and character context ctx(c), and uses a Feed-Forward Net (FFN) classifier; GloVe encodes the sentences with the pretrained GloVe embeddings and uses a FFN; CNN (Kim, 2014) replaces the FFN with a Convolutional Neural Network; LSTM is a two-layer bi-directional LSTM; REN (Henaff et al., 2017) is a recurrent entity network that learns to encode information for memory cells; and NPN  is an REN variant that includes a neural process network.
The second group (G2) of baselines are based on two recent publications-LC and SA-that showed strong performance on this task. We re-implement them and run the evaluation under the same setting as our proposed models. They originally use BERT and ELMo, respectively. To provide a fair comparison, we also train a RoBERTa variant for them (LC-RBERT and SA-RBERT). Note that the original paper of SA (Paul and Frank, 2019) reports an F1 of 59.81 on Maslow and 35.41 on Reiss, while LC (Gaonkar et al., 2020) reports 65.88 on Plutchik. However, these results are not directly comparable to ours. The discrepancy arises mainly from two points: (1) The rating-based voting, described in Sec. 4.2, is not properly applied, and (2) We do not optimize the hyper-parameter search space in our setting, given the relatively expensive pre-training. Our re-implemented versions give a better foundation for a fair comparison.
The third (G3) and fourth (G4) groups are our model variants. ENG is the model without TAPT; ENG+Mask, ENG+Link, and ENG+Sent are the models with Whole-Word-Masking (WM), Link Prediction (LP), and Node Sentiment (NS) TAPT, respectively. In the last group, ENG(Best) + IL and ENG(Best) + IL + ST are based on our best ENG model with TAPT and adding inter-label dependencies (IL) and state transitions (ST) using symbolic inference, described in Sec. 3.6. Table 2 reports all the results. We can see that Group 2 generally performs better than Group 1 on all three subtasks, suggesting that our implementation is reasonable. Even without TAPT, ENG outperforms all baselines, rendering 2 − 3% absolute F1-score improvement. With TAPT, the performance is further strengthened. Moreover, we find that different TAPT tasks offer different levels of improvement for each subtask. The WM helps the most in Maslow and Plutchik, while the LP and NS excel in Reiss and Plutchik, respectively. This means that different TAPTs embed different information needed for solving the subtask. For example, the ability to add potential edges can be key to do motivation reasoning (Reiss), while identifying sentiment polarities (NS) can help in emotion analysis (Plutchik). This observation suggests a direction of connecting different related tasks in a joint pipeline. We leave this for future work.
Lastly, we evaluate the impact of symbolic inference. We perform joint inference over the rules defined in Sec. 3.6. On Table 2, we can appreciate the advantage of modeling these dependencies for predicting Plutchik labels. However, the same is not true for the other two subtasks, where symbolic inference increases recall at the expense of precision, resulting in no F1 improvement. Note that labels for Maslow and Reiss are sparser, accounting for 55% and 42% of the nodes, respectively. In contrast, Plutchik labels are present in 68% of the nodes.

Task: DesireDB
DesireDB (Rahimtoroghi et al., 2017) is the task of predicting whether a desire expression is fulfilled or not, given its prior and posterior context. It requires aggregating information from multiple parts of the document. If a target desire is "I want to be rich", and the character's mental changed from "sad" to "happy" along the text, we can infer that their desire is likely to be fulfilled.
We use the baseline systems described in (Rahimtoroghi et al., 2017), based on SkipThought (ST) and Logistic Regression (LR), with manually engineered lexical and discourse features. We train a stronger baseline by encoding the prior and poste-rior context, as well as the desire expression, using BERT. Then, we add an attention layer (Eq. 5) for the two contexts over the desire expression. The resulting three representations (the weighted prior and posterior representations, and the desire representation) are then concatenated. For ENG, we add an attention layer over the nodes to form the ENG document representation. We compare BERT and BERT+ENG document representations by feeding each of them into a two-layer FFN for classification, as described in Sec. 3.4 (Doc. Classification). Table 3 shows the result. The BERT baseline outperforms other baselines with a large gap, 4.27% absolute increase in the averaged F1-score. Furthermore, BERT+ENG forms a better document summary for the target desire, which further increase another absolute 3.23% on the avg. F1score. These results illustrate that ENG can be used in various settings for modeling entity information.

Qualitative Analysis
We conduct a qualitative analysis by measuring and visualizing distances between event nodes corresponding to six verbs and their Maslow labels. We project the node embeddings, based on different encoders, to a 2-D space using t-SNE (Maaten and Hinton, 2008). We use shapes to represent verbs and colors to represent labels. In Fig. 3b and 3c, RoBERTa, pretrained on Whole-Word-Masking TAPT, was used. Nodes are word-contextualized, receiving the whole story (W-CTX-STORY) or the target sentence (W-CTX-SENT) as context. In these two cases, event nodes with the same verb (shape) tend to be closer. In Fig. 3a, we use ENG as the encoder to generate graph-contextualized embeddings (ENG-CTX). We observe that nodes with the same label (color) tend to be closer. In all cases, the embedding was trained using only the TAPT tasks, without task specific data. The ENG embedding is better at capturing entities' mental states, rather than verb information, as the graph structure is entity-driven. Figure 4 makes this point quantitatively. We use 10-fold cross validation and report averaged results. The proximity between verbs and between labels are measured in two ways: cluster purity and KNN classification. For the cluster purity (Manning et al., 2008), we cluster the events using K-Means (K = 5), and calculate the averaged cluster  Table 3: Results for the DesireDB task: identifying if a desire described in the document is fulfilled or not.  purity, defined as follows: where C is the set of clusters and D is either the set of labels or verbs. For the graph contextualization, we can see that the labels have higher cluster purity than the verbs, while for the word contextualization, the verbs have higher cluster purity. This result aligns with our visualization. The KNN classification uses the learned embedding as a distance function. The KNN classifier performs better when classifying labels using the graph-contextualized embeddings, while it performs better using word-contexualized embeddings when classifying verbs. These results demonstrate that ENG can better capture the states of entities.

Conclusions
We propose an ENG model that captures implicit information about the states of narrative entities using multi-relational graph contextualization. We study three types of weakly-supervised TAPTs for ENG and their impact on the performance of downstream tasks, as well as symbolic inference capturing the interactions between predictions. Our empirical evaluation was done over two narrative analysis tasks. The results show that ENG can outperform other strong baselines, and the contribution of different types of TAPT is task-dependent. In the future, we want to connect different TAPT schemes and downstream tasks, and explore constrained representations.