Modeling Event Salience in Narratives via Barthes’ Cardinal Functions

Events in a narrative differ in salience: some are more important to the story than others. Estimating event salience is useful for tasks such as story generation, and as a tool for text analysis in narratology and folkloristics. To compute event salience without any annotations, we adopt Barthes’ definition of event salience and propose several unsupervised methods that require only a pre-trained language model. Evaluating the proposed methods on folktales with event salience annotation, we show that the proposed methods outperform baseline methods and find fine-tuning a language model on narrative texts is a key factor in improving the proposed methods.


Introduction
Narratives (e.g., folktales, literary short stories) are the representations of a series of events (Abbott, 2008). Events, the essential components of narratives, differ in salience: some are more important to the story than others. Taking Cinderella as an example, The prince falls in love with Cinderella is a salient event; however, Cinderella draws water from a well is not. Estimating event salience is a fundamental task in analyzing and processing narratives, ranging from narrative analysis to automatic story generation (Ouyang and McKeown, 2015;Choubey et al., 2018;Papalampidi et al., 2019).
This study aims to estimate event salience in an unsupervised manner. Manually annotating event salience is prohibitively costly because it requires annotators to deeply understand the notion of event salience in narratology (Finlayson, 2015). In fact, despite a long history of research, very few narrative corpora are annotated with event salience. Thus, it is crucial to develop a method for estimating event salience that does not rely on event salience-annotated corpora.
In order to estimate event salience without annotated data, we adopt the definition of cardinal functions (CFs) introduced by Barthes (1966;1975), the successor of Proppian function 1 (Propp, 1928), as follows: cardinal functions are logically essential to the narrative action and cannot be eliminated without destroying its causal-chronological coherence. (Prince, 2003) This definition suggests a simple test for identifying event salience: an event is highly salient if removing it greatly reduces the story's coherence. We adopt this idea for two reasons. First, CFs are commonly used in narrative analysis (Abbott, 2008). Second, the idea of CFs can be directly operationalized without any annotated data. Computing event salience based on the idea of CFs requires measuring narrative texts' coherence, but recent advances in discourse coherence models can provide a solution for this difficulty. To date, a wide variety of discourse coherence models have been proposed (Barzilay and Lapata, 2008;Li and Jurafsky, 2017). See et al. (2019) have reported that GPT-2 (Radford et al., 2019), a powerful left-to-right language model (LM), could accurately estimate narrative texts' coherence, importantly, without any annotated data. Note that, in folkloristics and narratology, another well-known concept of event salience, motif is "the smallest element in a tale having a power to persist in tradition" (Thompson, 1946), but CFs are more operationalizable given accurate discourse coherence models.

Related work
Numerous studies on the salience of text units (e.g., word, sentence) can be related to our work. Here, we review two particularly relevant topics. First, the deletion test (Carlson and Marcu, 2001) aims to identify salient discourse segments in rhetorical structure theory (Mann and Thompson, 1987). In the deletion test, annotators check how much discourse coherence is reduced by removing the discourse unit of interest. Notably, Carlson and Marcu (2001) and Barthes (1966) use essentially the same idea, to "remove the textual unit of interest, and see how the whole structure changes," although the task is quite different. Second, extractive summarization is a task of identifying salient sentences in documents, which is formally very similar to the task of our work. Despite various existing approaches for extractive summarization (Mani, 2001;Gambhir and Gupta, 2017), it is the open problem whether these methods can be directly applied to narrative texts. Extractive summarization conventionally focuses on domains with rigid structures, such as news articles or scientific papers, while narrative texts do not have such rigid structures (Kazantseva and Szpakowicz, 2010).
In the context of narrative processing in NLP, several methods have been proposed to identify some kinds of salient events: suspenseful events in entertainment stories (Wilmot and Keller, 2020), turning points in a movie script (Papalampidi et al., 2019), and reportable events in personal narratives (Ouyang and McKeown, 2015). In contrast to studies focusing on a specific type of narrative (e.g., movie scripts, personal narratives), our method is potentially applicable to any type of narrative because Barthes' CFs is not a concept specific to those particular kinds of narratives and because our methods require only a pre-trained language model.  3 Estimating event salience

Task setup
We identify event salience in the simplified setting introduced by Ouyang and McKeown (2015). That is, we estimate a sentence's salience rather than an event's salience; we score each sentence in a narrative according to the degree to which it contains a salient event. This simplification enables us to avoid the difficult subtask of identifying phrases and clauses that express events, while addressing the task of identifying sentences that express salient events. Moreover, this sentence level identification can be easily applied to narrative processing and narrative analysis. Formally, given a narrative comprising n sentences S {1:n} := {S 1 , . . . , S n } and the target sentence S k ∈ S {1:n} , our goal is to predict the salience score of S k in S {1:n} , denoted by σ(S k , S {1:n} ) ∈ R.

Proposed method
Overview In light of Barthes' definition (Section 1), we compute the salience score σ(S k , S {1:n} ) as the amount of coherence loss when events in S k are deleted from the original narrative S {1:n} (Figure 1). If a narrative's coherence is greatly reduced when events in a sentence are removed, the sentence is considered to contain a highly salient event.
To this end, let S {1:n} := {S {1:k−1} , r(S k ), S {k+1:n} } be the modified narrative with all events in S k removed from the given narrative S {1:n} , where r is an event removal function, introduced in the following paragraph. Let c(S) be the coherence score of a given narrative S. Then, the salience score of S k can be estimated as follows: In the following, we describe the details of (i) an event removal function r and (ii) coherence evaluator c.
Removing events in a sentence: r We employ the following three functions r. 1. Sentence Deletion (SD): Removing the entire sentence 2. Verb Anonymization (VA): Replacing all verbs in the sentence with common verbs (e.g., "do", "does", "did") based on the POS tags of each verb 2 3. Predicate and Arguments Anonymization (PAA): Replacing all verbs with common verbs (as in VA) and their main arguments with an indefinite pronoun (e.g., "someone", "something") 3 We employ VA and PAA because predicates and their arguments are main components of commonly used event representations (Chambers and Jurafsky, 2009;Pichotta and Mooney, 2016;Martin et al., 2018;Niklaus et al., 2018).
Computing narratives' coherence: c Following See et al. (2019), we compute the generation probability of a narrative using a pre-trained language model and regard it as the narrative's coherence score. Importantly, pre-trained LMs allow us to evaluate narrative's coherence without any annotated data.
Here, the narrative's generation probability is the product of word probabilities, which is influenced by the number of words in the narrative. Thus, following Li and Jurafsky (2017), we estimate the coherence score by the average log-likelihood of all tokens. Moreover, we consider only sentences after the target sentence S {k+1:n} because sentences whose generation probabilities change with the removal of events in S k are limited to S {k+1:n} when using left-to-right LM, such as GPT-2. In summary, we estimate the coherence score c(S) as follows: where |S {k+1:n} | denotes the number of tokens in S {k+1:n} . In order to compute the salience score for each narrative text's last sentence, we add a special token that indicates the end of a text at the end of each narrative. The proposed methods can compute the salience score for the last sentence using the generation probability of this special token. See Appendix B for further details.

Experiments
In this section, we provide empirical evidence that the proposed methods can evaluate event salience in narratives. Concretely, we applied the proposed method with three event-removal methods on a manually annotated folktale dataset and confirmed their performance.

Experimental setup
Dataset We used the ProppLearner corpus (Finlayson, 2015), which contains 15 Russian folktales.  Table 2: MAP scores for the proposed methods and the baseline methods. We report the MAP score for random baseline method as the average over 10 seeds (standard deviation = 0.015). Values with a dagger mark are statistically significant improvements over the random baseline method, which was tested using the Wilcoxon signed-rank test (Wilcoxon, 1945) with p < 0.05. The bold score is the best performance in our proposed methods alone. The bold italic score is the best performance in combination methods of our proposed methods and the TF-IDF baseline method.
In the corpus, verbs corresponding to the Proppian function, i.e., salient event, which is the predecessor of CFs are annotated. Following the task setup, our goal is to detect sentences that contain such verbs, i.e., salient events. The Prop-pLearner corpus includes POS and semantic role annotations, which are used by VA and PAA. Table 1 shows the statistics of the ProppLearner corpus.
Language model (fine-tuning) We used GPT-2 as a pre-trained language model for computing coherence scores 4 . Note that See et al. (2019) reported that GPT-2 outperforms state-of-the-art story generation models in coherence evaluation.
Baselines We compared the proposed methods with the following baseline methods: • Random baseline: This method assigns a random score in the range [0, 1) to each sentence. • Sentence position baseline (ascending): This method assigns a score based on the position of each sentence. Here, we assumed that a sentence closer to the story's end has higher salience (Friedland and Allan, 2008). • Sentence position baseline (descending): This method assigns a score in the opposite way to sentence position baseline (ascending). • TF-IDF baseline: This method assigns the sum of the TF-IDF values 5 of the words in the sentence for each sentence.
Evaluation metric We cast salience estimation as a ranking problem following Liu et al. (2018), where each method ranks a sentence based on its salience score. We used mean average precision (MAP) as an evaluation metric (Manning et al., 2008). We calculated the average precision for each story and reported their macro average score. Table 2 shows the experimental results. The results show all proposed methods consistently outperform the random baseline method, and the proposed method (SD, ProppLearner) yields the best performance.

Experimental results
Event removal methods We found SD performed comparably to or relatively better than VA and PAA 6 . We employed VA and PAA, aiming to remove event information from the sentence more elaborately than SD. However, experimental results show that these methods do not improve the proposed method. We suspect unnatural sentences produced by the operations in VA and PAA might negatively affect inference of the language model, indicating some room for improvement in how to remove events from a sentence.
Effect of fine-tuning GPT-2 Fine-tuning GPT-2 on the BookCorpus slightly but consistently improved the proposed methods with SD, VA and PAA. We found that fine-tuning GPT-2 on the ProppLearner corpus (transductive setting) also improved the proposed methods with SD and PAA. In addition, we found that our methods' MAP scores and LM's perplexity on the ProppLearner corpus were strongly correlated. For each of SD, VA, and PAA, the Spearman's rank correlation coefficient between the MAP score in three LM settings and the LM's perplexity were −1.0, −0.5, and −1.0. This result shows that the better the LM fits the evaluation corpus, the better our methods perform.
Combining the proposed method and the baseline method We performed additional experiments with the same setting by combining each proposed method with the TF-IDF baseline method, which is the best baseline method. We normalized salience scores of each proposed method and the TF-IDF baseline method to [0, 1] within each story 7 and then added them to obtain the final salience score. Results are shown in Table 2 as +TF-IDF. For all cases, combination methods consistently improved MAP scores more than our proposed methods alone or the TF-IDF baseline method alone. The combination of the proposed method (SD, BookCorpus) and the TF-IDF baseline method and the combination of the proposed method (PAA, ProppLearner) and the TF-IDF baseline method achieved the best performance among all methods. The Wilcoxon signed-rank test on the best combination method (i.e., combination of the proposed method (SD, BookCorpus) and the TF-IDF baseline method) and the TF-IDF Baseline method resulted in a p-value of 0.21. This result suggests that TF-IDF-based salience cues are complementary to Barths' CFs-based cues, and they have been merged into a better measure of event salience. Appendix C shows examples of salience evaluation results in toy Cinderella story and qualitative analysis of the behavior of our proposed method.

Discussion and future work
One promising direction for improving our proposed methods is to improve the narrative coherence evaluator. For more accurate coherence evaluation, the coherence evaluator needs to have world knowledge and common sense reasoning skills. Imagine the story of Cinderella. To be able to identify that the absence of event The prince falls in love with Cinderella leads to coherence reduction, an ideal coherence evaluator needs to recognize that this event has a strong causal relation (in this case, precondition) with the next event Cinderella marries the prince. Recently, several techniques have been proposed to provide language models with more world knowledge (Guan et al., 2020) and to enhance the common sense reasoning skills of language models (Mao et al., 2019). Evaluating the coherence of a narrative using these LMs can potentially improve our proposed methods.

Conclusions
Inspired by the Barthes' definition of cardinal functions in narratology, we have proposed methods to estimate event salience in a narrative in an unsupervised manner using an LM. In our proposed methods, we have removed events from a narrative text and have estimated event salience by comparing the coherence score of the original narrative text with that of the event-removed narrative text. Experiments on a folktales dataset have demonstrated that the proposed methods outperformed baseline methods and fine-tuning the LM on a narrative text is an effective way to improve the proposed methods.

A Preliminary experiments
In this section, we preliminary assess the ability of GPT-2 to evaluate the coherence of texts. Results support use of GPT-2 as a coherence evaluator in our methods.

A.1 Preliminary experiment 1: Assessing GPT-2 as a coherence evaluator
In this preliminary experiment, we evaluated GPT-2 in a sentence ordering task, which is a common task for evaluating discourse coherence models (Barzilay and Lapata, 2008;Li and Jurafsky, 2017). See et al. (2019) reported that GPT-2 (Radford et al., 2019) better captures narrative text's coherence compared to the state-of-the-art model in story generation. However, See et al. (2019) evaluated GPT-2 in the document ranking task, a slightly different task from sentence ordering task. Thus, we examined GPT-2's ability as a coherence evaluator in the sentence ordering (i.e., the common task for evaluating discourse coherence models) and provided further evidence that GPT-2 could accurately evaluate the coherence of the narrative texts used in our experiments.
Given a pair from an original document and one of its permutations, the task is to assign a higher coherence score to original one. In evaluation, GPT-2 predicted that the text with a higher likelihood was more coherent. We report accuracy as the ratio of the model's correct predictions.
For each of 15 narratives in ProppLearner, we generated 80 random permutations. Then we obtained 1,200 pairs from original narrative texts and one of its permutations. Results showed that GPT-2 achieved 100% accuracy. Li and Jurafsky (2017) reported 87.3% accuracy on the same task 8 . Our result supported the validity of using GPT-2's likelihood to compute narratives' coherence.
A.2 Discussion: On evaluation of discourse coherence models In a common task to evaluate a discourse coherence model (e.g., sentence ordering), the model is given an original text and an artificially created incoherent text; it is then required to score the former with a higher coherence score. In our preliminary experiments, we created incoherent texts by shuffling sentences as a common practice in sentence ordering task, and See et al. (2019) created an incoherent text by swapping adjacent sentences. These experiments demonstrated that GPT-2 can accurately perform these tasks.
However, as mentioned in Lai and Tetreault (2018), identifying a document's original sentence order is not the same as distinguishing low and high coherence. Just identifying sentences' correct order is not sufficient for evaluating coherence models, and we believe that more elaborate evaluation methods are needed. Lai and Tetreault (2018) provides a dataset that addresses this issue, but does not include the texts in the narrative domain.

A.3 Preliminary experiment 2: Sanity check via sentence deletion detection
In this preliminary experiment, we validated whether our method could detect event elimination as a step prior to identifying event salience. Given a narrative comprising n sentences S {1:n} := {S 1 , . . . , S n }, in which every sentence can be regarded as highly salient, and the target sentence S k ∈ S {1:n} , we evaluated whether GPT-2 can detect the sentence's deletion as a reduction in the subsequent story's likelihood: σ(S k , S {1:n} ) > 0. If our method could not do so, our methods would be unlikely to work because they are required to reduce the subsequent story's likelihood when the target sentence (to be removed) contains salient event.
Dataset For this experiment, we need a dataset that allows us to assume that every sentence in a story is highly salient (i.e., removing any sentence would result in a significantly incoherent narrative). We used ROCStories (Mostafazadeh et al., 2016) because it is designed to meet the requirement that each story captures a rich set of causal and temporal common sense relations among daily events. Each story contains five sentences. As the event removing method, we examine SD in this preliminary experiment. We used the 2016 Spring Set (45,495 stories) and the 2017 Winter Set (52,664 stories).
Task Setting We calculated accuracy as the percentage of cases in which sentence deletion was correctly detected as σ(S k , S {1:n} ) > 0. Random prediction would result in 50% accuracy.
Result SD with GPT-2 (No-fine-tuning) achieved 94% accuracy in both the 2016 Spring Set and the 2017 Winter Set. This result shows that SD can detect event deletion with SD as a reduction in the subsequent story's likelihood .

B Details of proposed approach
GPT-2, which is an LM we used for computing coherence score, has a limitation of input length and we can't always input the entire narrative text. Thus we practically compute coherence score c(S {1:n} ) as follows: where |S {k+1:n− } | denotes the number of tokens in S {k+1:n− } , so as c( S). P (S i ) is computed as the product of the probability of words: |S {i:j} | denotes the sum of the number of words in (S i , . . . , S j ). and are thresholds determined by input length limitation of the language model, L. We determine and so that |S {k+1:n− } | + |S {1+ :k} | are less than or equal to L and have a maximum value, respectively. As mentioned at the end of Section3.2, we add a special token that indicates the end of a text at the end of each narrative text for computing the salience score for the last sentence in each narrative text 9 . The generation probability of this special token is used only when computing the salience score for the last sentence, otherwise it is ignored.

C Estimating event salience for toy example
Including salient event Sentence Salience score -S 1 Cinderella draws water from a well.

S 2
A fairy godmother appears and provides Cinderella with clothes, a carriage, and a coachman. 0.309 S 3 Cinderella goes to the ball. 0.214 -S 4 Cinderella greets her stepsisters at the venue , but they do not notice. −0.014 S 5 The prince falls in love with Cinderella. 0.394 S 6 Cinderella marries the prince. −0.112 Table 3: The behavior of the proposed method (SD, No fine-tuning) in toy Cinderella story. Our method gives sentences a high salience score if the target sentence contains a salient event. Table 3 shows the behavior of the proposed method (SD, No fine-tuning) in toy example, Cinderella. We found the last sentence tended to have a large variance in its salience score because only one special token in the succeeding story is used for estimating saliency. likelihood diff when deleting salient event sentence S1 S2 S3 S4 S5 S6 S1 Cinderella draws water from a well.
A fairy godmother appears and provides Cinderella with clothes, a carriage, and a coachman.  Table 4: The more detailed behavior of our proposed method (SD, No fine-tuning) in toy Cinderella story. The value in row i, column j, represents the difference in S i 's generation probability (token-wise likelihoods are averaged within a sentence) before and after S j is removed from the story. A Large value indicates that the removal of S j greatly reduces the generation probability of S i .
Table4 shows the more detailed behavior of the proposed method (SD, No fine-tuning) in Cinderella. For example, the last row shows that deleting salient sentences (e.g., S 3 , S 5 ) resulted in a larger decrease in the likelihood of the ending sentences (S 6 ) than deleting less salient sentences (e.g., S 1 , S 4 ). In addition, if we look at the likelihood difference of S 3 , Cinderella goes to the ball, the likelihood dropped more when the S 2 , related sentence to S 3 , is removed than when S 1 , which is unrelated to S 3 is removed.