Is Killed More Significant than Fled? A Contextual Model for Salient Event Detection

Identifying the key events in a document is critical to holistically understanding its important information. Although measuring the salience of events is highly contextual, most previous work has used a limited representation of events that omits essential information. In this work, we propose a highly contextual model of event salience that uses a rich representation of events, incorporates document-level information and allows for interactions between latent event encodings. Our experimental results on an event salience dataset demonstrate that our model improves over previous work by an absolute 2-4% on standard metrics, establishing a new state-of-the-art performance for the task. We also propose a new evaluation metric that addresses flaws in previous evaluation methodologies. Finally, we discuss the importance of salient event detection for the downstream task of summarization.


Introduction
Identifying the salient information in a given piece of text is a ubiquitous and important problem in natural language understanding. While important parts of the text have been identified by attending to entities (Dunietz and Gillick, 2014), elementary discourse units (Xu et al., 2020), or whole sentences Liu and Lapata, 2019), in this work, we choose to model extracting important events (Liu et al., 2018;Choubey et al., 2018). Events are the core parts of most sentences -they center around a predicate and include its key arguments -yet they are compact semantic units, and a salient event in a sentence could carry the sentence's meaning efficiently. Extracting important events has been shown to be central to many downstream tasks, such as summarization (Marujo, 2015), storyline creation (Martin et al., 2018) and question answering (Kociský et al., 2018).
To model the importance of an event, it is critical to understand its context: who is involved, where did it happen, what other events is it related to, and more. For instance, in Figure 1, the difference in salience of the "fled" events in each document is significantly influenced by their arguments ("2,100 Colombians" versus "shopkeepers"). Previous work that identifies salient events has a limited event representation that is unable to capture these important contextual signals (Liu et al., 2018;Choubey et al., 2018). In contrast, the model which we propose in this work ( §3) directly models the context of an event in three different ways as follows.
First, instead of representing an event by only its predicate mention, our representation includes the subject, object, time, and location of the event ( §3.1). Second, we directly incorporate global features into the model that capture hierarchical relations between events, abstract event frames, position information, and more ( §3.2). Third, we use a neural network architecture that includes an inter-event interaction layer, which allows information to be passed between latent event encodings so that other events may increase or decrease another's importance ( §3.3). Our experimental results on a standard event salience dataset (Liu et al., 2018) demonstrate that these contextual signals significantly increase performance ( §4.4). We find that our model performs 2-4% better than previous work, setting a new state-of-the-art performance for the task. If Colombia is going to be another Vietnam, as everyone keeps saying, then Ecuador is going to become the Cambodia of this war, Maximo Abad Jaramillo, the mayor here, warned. We are not ready for this war, we don't want to be a part of it, but we are being dragged into the conflict against our will. In December alone, the local police say, {20 people}subject were killed {here}object, 15 of them in clashes among Colombians. As of Dec. 31, nearly {2,100 Colombians}subject had fled {the fighting just across the border}object and registered with the Roman Catholic Church in Lago Agrio.
Israeli troops and tanks occupied positions in Jenin on Oct. 18, the day after {Palestinian radicals}subject killed the {Israeli minister of tourism}object {in a Jerusalem hotel}location. Yael Shaluka, who works at a bakery in the market, said she heard one of the gunmen screaming, Kill them! As terrified {shopkeepers}subject fled {their stalls}object, the men cut through an aisle of the market, now pursued by policemen and an army reservist. Figure 1: Two sample documents from the New York Times Annotated Corpus with "killed" and "fled" events. Event mentions in bold, arguments in braces. The colors red and blue indicate whether that event is annotated as salient or not. Some of the events' arguments (e.g., "2,100 Colombians" and "Israeli minister of tourism") elevate the importance of their respective events.
In addition to proposing a model for salient event detection, we also provide a new evaluation metric for the task ( §4.2). Previous work has evaluated the top-k salient events a model outputs using precision@k and recall@k, where the recall term is normalized by the total number of salient events in the document. However, this metric has some undesirable properties; for example, a perfect model could have, for different documents, variable recall@k scores that depend on the number of events in the document. We propose a more interpretable metric, normalized recall@k, that addresses these issues. In addition, our new metric avoids duplication counting due to co-referenced events.
Finally, we discuss the potential impact that modeling salient events could have on the downstream task of extractive summarization ( §5). We find that the sentence-level extractive oracle that is frequently used to train summarization systems misses a significantly large portion of sentences with important events, a result which suggests that an event-based oracle could provide a better supervision signal. Further, because events are more fine-grained than full sentences, an event-focused model can discard unimportant information in a sentence to generate a more-concise summary.
The contributions of this work are three-fold: (1) We propose a contextual model for salient event detection that achieves state-of-the-art results; (2) We provide a sensible and more interpretable metric for evaluating extracted salient events; (3) We demonstrate that extractive summarization systems could potentially gain from modeling important events.

Related Work
Important Information Identification has been a topic of interest in NLP community since the 1980s and through these years, researchers have defined salience in multiple ways. Mann and Thompson (1988) divides the text into nuclei and satellite with an idea that a satellite is incomprehensible without the nucleus but, after removing some satellites, the text can still be understood. Upadhyay et al. (2016) discusses identifying events that would have triggered the author to write that article. Choubey et al. (2018) proposed the idea that the central events have a large number of coreferential event mentions and those mentions are spread throughout the document. However, we believe that in realistic documents, redundancy is commonly used for various rhetorical or other reasons, it is not necessarily the case that frequently mentioned events convey the main point of the article. We follow proposals from entity salience (Dunietz and Gillick, 2014) and event salience work (Liu et al., 2018) that suggest that these are difficult to explicitly define, but can be learned from observing human summaries: events that appear in the summary are salient.
Event representation has also evolved over time. Chambers and Jurafsky (2008) represented narrative events as pairs of verb and the grammatical dependency relation between the verb and the entity. Do et al. (2011) included nominal predicates, using the nominal form of verbs and lexical items under the Event frame in FrameNet (Baker et al., 1998). Further work by Balasubramanian et al. (2013), Pichotta and Mooney (2014), and others incorporated arguments such as propositional objects. However, most of the Kuwait's Interior Ministry says young Kuwaiti man who fled to Saudi Arabia after terrorist shooting in Kuwait that killed one American and wounded another has confessed to the attack. Then, we grab the token level embeddings of all constituents from the document encoded using BERT, and compose an event level embedding. Finally, in the classification module, all events attend to/vote each other and the salience score of an event is calculated by accumulating the votes from all events.
work on event salience identification has represented events as verbs/nominal event mentions. To address this, we use a more holistic event representation which includes nominal/verbal event mentions, entities in the subject and object of the predicate as well as the time and location. A similar representation was used by Peng et al. (2016) to build an event detection and co-reference system. From modeling perspective, most earlier work (Decker, 1985;Kay and Aylett, 1996) on event salience has built rule based systems (e.g. presence of the event in the main clause, its voice, etc.). More recent work has focused on capturing coreference relation between events (Choubey et al., 2018) and on automatically capturing salient specific interactions between the discourse units (Liu et al., 2018). However, as mentioned earlier, we believe that events are highly contextual, so we use a more expressive model to obtain contextualized embeddings for events. It additionally helps us in capturing local interevent interactions and, to capture the document level interactions, we design a number of global document level features.

Salient Event Extraction
We choose to model salient event extraction as a binary classification task. For every verbal and nominal event mention e 1 , . . . , e n in a document, the goal is to predict whether or not each event is salient. In our experimental study we assume that the event mentions are provided.
In the following sections, we describe our model's event representation ( §3.1), the global features which are incorporated into the representation ( §3.2), and the network architecture and inter-event interaction mechanisms ( §3.3).

Event Representation
The standard method for representing an event is to define it based on the span of text that represents its predicate mention in a document. However, this representation is clearly suboptimal because many other important signals in the text are missing which could add critical information to determine the importance of an event. For instance, it is difficult to determine whether a "snow" event is important by itself, but knowing that the event took place in during the summer increases the rarity of the event, potentially elevating its importance in the document.
Subsequently, we define an event to be a 5-tuple where each item in the tuple corresponds to the contiguous span of text for the event mention, subject, object, time, and location, respectively. In practice, the arguments for each event correspond to the ARG0, ARG1, ARGM-LOC, and ARGM-TMP arguments output from verbal and nominal semantic role labeling systems (He et al., 2017;Khashabi et al., 2018). After the arguments for each event have been extracted, they are combined together to form a vector representation for the event, denoted e i , as follows. We first obtain the BERT embedding (Devlin et al., 2019) for every token in the input document. Then, since each of the items of the event tuple is a contiguous span of text, we create a fixed-size representation for each argument by encoding the corresponding tokens using a bidirectional LSTM. There is a separate LSTM encoder for each argument type. Finally, the vectors for all of the arguments are concatenated together to form the event encoding,

Event Augmentation with Global Features
Although BERT embeddings have proven to be highly beneficial for a large number of tasks, they may not encode all of the information that is useful for the task. This is especially true for high-level document features with long-range dependencies. Therefore, we augment the event representation from the previous section with features that leverage the event structure, document-level statistics, event-event relations, and event abstractions. Refer to Table 1(b) for a summary of the statistics of these features on the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). We extract the following features: 1. Parent Score: This feature leverages the hierarchical relations between events in a document. The intuition is that the higher level and more abstract events are relatively more salient. We use the model from Wang et al. (2020) to identify the parent-child relationship between every event pair. An event is called a child of another event if it is a subevent of the parent (e.g., "shooting" may be a subevent/child of an "attack" event). The Parent Score of an event is defined as the number of child events it has.
2. Frame Name: Framename provides a more abstract understanding of the event compared to the event trigger. All event triggers (9716 in total) in the Event Salience corpus (Liu et al., 2018) belong to a total of 569 frames (annotated using Semafor (Das and Smith, 2011)). For instance, Figure  3 (right) shows all event triggers under the frame "Killing." As can be seen from the figure, the frequency of all events within a frame is very different, whereas their salience in the text is usually the same. Therefore, this feature enables events with low frequency to leverage the understanding from more frequent events under the same frame.
3. Sentence Location: One of the most commonly used (Dunietz and Gillick, 2014;Liu et al., 2018) features for salience related tasks. It represents the first location of sentence containing this event.
4. Event Trigger Frequency: The number of times the event trigger appears in the document. Table 1 shows that salient events are, on average, significantly more frequent than non salient events.

Argument Frequency:
We leverage the event structure to design this feature. Argument Frequency of an event is the maximum number of times any of its arguments (ARG0/1) appear in the document.
6. Named Argument: Since Named Entities signify their relative importance compared to other entities, we add a binary feature representing whether any of the event arguments is a Named Entity.
After these features have been computed for each event, they are concatenated to the event encoding e i to get the final representation.

Inter-Event Interactions
After each event has been encoded into a vector representation e i , the model needs to make a binary decision about whether or not an event is salient. Our baseline model assigns a labelŷ i ∈ {0, 1} to event e i using a sigmoid classification layer:ŷ  Figure 3: (Left) Example inter-event relations. Yellow lines represent lexical matches between event representations, and brown lines imply a transitive relationship between events. Although Events 1 and 3 have no lexical matches, we are able to infer a relationship between them due to both having lexical matches with Event 2. (Right) Event trigger frequencies for the Killing frame on the New York Times Annotated Corpus.
where W and b are learned parameters. The model is trained using a binary cross-entropy loss and denoted CEE-BASE (Contextual Event Extractor).
Intuitively, it may be beneficial to allow for the event representation vectors to interact with each other, thereby allowing one event to increase or decrease the salience of another event. In our model, this is done by adding additional modules in between the event encoding and classification layer. We experiment with two different methods of inter-event interactions as follows.

Inter-Event Attention Module
The idea behind this classification module is to capture inter-event votes (Gu et al., 2020). The events which get higher votes from others will have a larger attention score, with the intuition that the supporting events will increase the salience of their corresponding main events and that the irrelevant and noisy events will be ignored. Specifically, given the representation of an event e i , a key and query vector are calculated as where W k and W q are learned parameters. Then, the attention score a i of an event is calculated as follows:ŷ We refer to models that use this attention mechanism as CEE-IEA.
Dynamic Memory Module Since events are highly contextual, a given event forms discourse relations with other events in the document. For instance, in Figure 3 (left), we assume that events 1 and 2 are related to each other because their arguments have some lexical overlap. Since the same is true for events 2 and 3, we can infer that events 1 and 3 might be related even though they share no lexical overlap themselves. To capture such transitive inter-event relations, we repurpose Dynamic Memory Networks (Xiong et al., 2016, DMNs) for our task. DMNs make T passes over the input event vectors, each time refining a episodic memory vector m t i for event e i based on the previous iteration and the other events. The multiple iterations allow information to flow transitively across events. Finally, the output score is calculated based on m T i : where c t i = AttnGRU([e j ]); j = [1, n], j = i (see Xiong et al. (2016) for details about the AttnGRU function).

Dataset
For the experimental evaluation of our contextual event salience module (and subsequent discussion of extractive summarization), we use the New York Times dataset (Sandhaus, 2008), which is a large corpus of articles that were published between 1996 and 2007 and their corresponding summaries. We use the salient event annotations for this dataset provided by Liu et al. (2018). In their work, events in the document are marked as salient based on whether the event mention's lemma is present in the abstractive summary. Due to the annotation procedure, around 18.3% of the instances had no salient events and were subsequently omitted from the evaluation. Refer to Table 1 (left) for the details on dataset preparation.

Evaluation Metric
We evaluate our event salience model on three metrics: Precision@k (P@k), Recall@k (R@k) and Normalized Recall@k (NR@k) for k = 1, 5, 10. Previous work has reported results on P@k and R@k where: P@k = # of salient events in top k predictions k R@k = # of salient events in top k predictions # of salient events in the doc However, R@k is not a very comprehensible metric because: i) The maximum value of recall is variable and ii) averaging R@k across documents with different number of salient events is biased towards documents with fewer events. Consequently, we propose a new metric, NR@k, which gives equal importance to all documents and has a maximum value is 1 for all k, making the metric easier to understand. NR@k = # of salient events in top k predictions min(k, # of salient events in the doc) Additionally, we can see from the Table 1 (right) that the average trigger frequency of non salient events is 2.18 whereas those of salient events is 6.76. So, depending upon the correference ability of a model, for each true positive, on an average the model can be given a reward of 6.76 whereas for each false positive, the model can be penalized by a score of 2.18. For a fair understanding of a model's ability to identify top k events, a reward/penalty of one should be given to each unique event. Therefore, we also calculate P@k and NR@k using only the top-k unique model predictions. For a direct comparison of previously reported results, we also include results with the original metrics.

Implementation Details
We use sub word tokenizer from BERT to tokenize the documents and fine-tune bert-base-uncased 2 version of BERT in all of our settings. Our models are implemented in PyTorch (Paszke et al., 2017). Token level BERT embeddings of size 768 are passed through different BiLSTM modules to get embeddings of the event mention (of size 512) and all other constituents (of size 64 each) to get the final event embedding of size 768. To add the Frame Name feature from § 3.2, we first get the event embedding of size 768 as    , 2015). Our models are trained for 30k steps on 4 GPUs (TITAN RTX). We evaluated the model after every 750 steps and saved the best checkpoint based on the validation loss. The test results are reported by evaluating our test set's performance on this checkpoint.

Results
We compare our model to three baseline models: LOCATION (which selects the first k events), FRE-QUENCY (which selects the k events which appear most frequently in the document), and the Kernel Centrality Estimation (KCE) model from Liu et al. (2018). KCE models relationships between events using K Gaussian kernels and adds Sentence Location and Frequency features along with three others which capture similarity with entities and other events in the document. The main results on the event salience task are summarized in Table 2.
First, among the baseline models, KCE performs consistently the best across all evaluation metrics with an absolute 10 point improvement over LOCATION and FREQUENCY on both P@1 and NR@1.
Then, we compare the results of our base model with the two attention-based variants without including any global features from the document. The dynamic memory network CEE-DMN provides an 0.41 point improvement in P@1 and NR@1. The inter-event attention module CEE-IEA provides an even larger improvement, with 1.27. The stronger performance of the CEE-IEA model shows that the voting attention module helps to promote the salient events and suppress the noisy and background events.
The addition of the global features in Table 2 on top of the better performing inter-event attention model provides yet the largest improvement over the baseline models. Specifically, it improves over KCE by 3.89% P@1. The improvement is consistent even with higher values of k, with a 1.38% improvement in P@10 and 2.48% NR@10. This result suggests that the global features are indeed critical for identifying salient events and that neither the event representation nor inter-event interactions capture the same information that the features do.
Then, in Table 3, we present the results of our best performing model and the baseline models using the  original metrics formulation (without the normalization or uniqueness). The improvement of the model in this work over the baseline models is similarly consistent at all values of k. All together, the results from our experiments demonstrate that the CEE-IEA model with global features performs better at extracting salient events and sets a new state-of-the-art performance for this task.

Ablation: Feature Contribution
Due to the significant improvement in the model's performance when the global features were added, we conduct an ablation study on the features to understand what contributes most to the gains. Table 4 shows the performance of the CEE-IEA model as different features are successively added to the model. We observe that all of the features contribute positively to both the precision and recall of the models at all values of k. Among the features, the two which provided the largest P@1 improvements are those which have not been used by previous work: Parent Score and Frame Name. This result suggests that providing the model with information that distinguishes high-and low-level events is beneficial for detecting salient events.
As the value of k increases, the benefit of Sentence Location and Trigger Frequency features increase. This shows that these features help identifying salient events from the rest. Whereas, Parent Score and Frame Name features further help recognize the most salient event among these.

Case Study: Extractive Summarization
In this section, we discuss the potential benefits of modeling event salience for the downstream task of extractive summarization. Ideally, a good summary captures information about the key events in the original document. However, most common models for extractive summarization operate at the more coarse-grained sentence level. We discuss below why we believe that modeling event importance rather than sentence importance has the potential to benefit summarization.
An Improved Supervision Signal First, the extractive summarization training oracles often miss important signals in the data. Extractive summarization systems are trained on 0/1 labels assigned to each sentence in the document to indicate whether or not that sentence should be selected to the summary. The most common procedure for obtaining these labels is to greedily select document sentences while the ROUGE (Lin, 2004) score between the selected sentences and a reference summary still increases (Nallapati et al., 2017).
We compared this labeling procedure to an alternative method based on events. The event-based oracle selects the sentence which contains the first occurrence of an event in the document for each event in the summary. For our analysis, the sentences selected by the ROUGE-based and event-based methods are divided into three sets: B, sentences chosen by both; R, sentences chosen by ROUGE only; and E, sentences chosen by the event method only. The relative sentence coverage for both sets of selected sentences is calculated as follows. Figure 4: (Left) Predicted salience scores of all events from our event based model are shown in the subscripts. In contrast to a sentence based summarization system, event based system provides a much more fine grained output by predicting the relative importance of each event. (Middle) Disjoint sentence sets E, B and R. It shows relative coverage of 79.8% by salient event signal and 48.4% by ROUGE based signal. (Right) Proportion of salient events in {B ∪ R}, the light blue area shows that 27.6% of the events from sentences selected by ROUGE are not salient.
We observe that, E Cov = 79.8% which means that most of the sentences in the ROUGE-based oracle are covered by event based summaries. However, R Cov = 48.4%, which means that the standard supervision signal misses a significant number of events present in the reference summary (all salient events in the remaining 51.6% sentences; see Figure 4 (middle)). The lower value of R Cov is evidence that ROUGE-based oracles are missing a valuable supervision signal.
Finer-Grained Selection Since the most popular extractive summarization models are forced to select full sentences to be in the summary, it likely that the model selects a lot of unimportant information. For instance, if a document sentence is quite long and covers a lot of information, it is likely that some of the predicate mentions in it are not salient and may not need to be in the summary. Consequently, including only semantic unit that consist of important events and their arguments may be a better method of identifying text that should belong in the summary. To quantify this, we calculate the number of salient events in the sentences selected by ROUGE (set {B ∪ R}) and observed that 27.6% of the events in these selected sentences are not salient (see Figure 4 (right)).
Together, these two points demonstrate that the current method of creating oracle for training summarization systems misses important events which are present in the gold summaries while including a significant number of non salient events. This underscores that a good summarization system can benefit from our event salience model to select better events for generating summaries.

Conclusion
In this work, we proposed a contextual model for salient event identification. We demonstrated that the three different components of our model (the event representation, global features, and inter-event interaction) combine to produce state-of-the-art results on the New York Times Annotated Corpus. Further, we identified issues with previous evaluation metrics and proposed new intuitive evaluation methods. Finally, we discussed how event salience can be helpful for other downstream applications through a case study of extraction summarization.