Automatic Event Salience Identification

Identifying the salience (i.e. importance) of discourse units is an important task in language understanding. While events play important roles in text documents, little research exists on analyzing their saliency status. This paper empirically studies Event Salience and proposes two salience detection models based on discourse relations. The first is a feature based salience model that incorporates cohesion among discourse units. The second is a neural model that captures more complex interactions between discourse units. In our new large-scale event salience corpus, both methods significantly outperform the strong frequency baseline, while our neural model further improves the feature based one by a large margin. Our analyses demonstrate that our neural model captures interesting connections between salience and discourse unit relations (e.g., scripts and frame structures).


Introduction
Automatic extraction of prominent information from text has always been a core problem in language research. While traditional methods mostly concentrate on the word level, researchers start to analyze higher-level discourse units in text, such as entities (Dunietz and Gillick, 2014) and events (Choubey et al., 2018).
Events are important discourse units that form the backbone of our communication. They play various roles in documents. Some are more central in discourse: connecting other entities and events, or providing key information of a story. Others are less relevant, but not easily identifiable by NLP systems. Hence it is important to be able to quantify the "importance" of events. For example, Figure 1 is a news excerpt describing a debate around a jurisdiction process: "trial" is central as the main discussing topic, while "war" is not. Researchers are aware of the need to identify central events in applications like detecting salient relations (Zhang et al., 2015), and identifying climax in storyline (Vossen and Caselli, 2015). Generally, the salience of discourse units is important for language understanding tasks, such as document analysis (Barzilay and Lapata, 2008), information retrieval (Xiong et al., 2018), and semantic role labeling (Cheng and Erk, 2018). Thus, proper models for finding important events are desired.
In this work, we study the task of event salience detection, to find events that are most relevant to the main content of documents. To build a salience detection model, one core observation is that salient discourse units are forming discourse relations. In Figure 1, the "trial" event is connected to many other events: "charge" is pressed before "trial"; "trial" is being "delayed".
We present two salience detection systems based on the observations. First is a feature based learning to rank model. Beyond basic features like frequency and discourse location, we design features using cosine similarities among events and entities, to estimate the content organization (Grimes, 1975): how lexical meaning of elements relates to each other. Similarities from within-sentence or across the whole document are used to capture interactions on both local and global aspects ( §4). The model significantly outperforms a strong "Frequency" baseline in our experiments.
However, there are other discourse relations beyond lexical similarity. Figure 1 showcases some: the script relation (Schank and Abelson, 1977) 1 between "charge" and "trial", and the frame relation (Baker et al., 1998) between "attacks" and "trial" ("attacks" fills the "charges" role of "trial"). Since it is unclear which ones contribute more to salience, we design a Kernel based Centrality Estimation (KCE) model ( §5) to capture salient specific interactions between discourse units automatically.
In KCE, discourse units are projected to embeddings, which are trained end-to-end towards the salience task to capture rich semantic information. A set of soft-count kernels are trained to weigh salient specific latent relations between discourse units. With the capacity to model richer relations, KCE outperforms the feature-based model by a large margin ( §7.1). Our analysis shows that KCE is exploiting several relations between discourse units: including script and frames (Table 5). To further understand the nature of KCE, we conduct an intrusion test ( §6.2), which requires a model to identify events from another document. The test shows salient events form tightly related groups with relations captured by KCE.
The notion of salience is subjective and may vary from person to person. We follow the empirical approaches used in entity salience research (Dunietz and Gillick, 2014). We consider the summarization test: an event is considered salient if a summary written by a human is likely to include it, since events about the main content are more likely to appear in a summary. This approach allows us to create a large-scale corpus ( §3).
In this paper, we make three main contributions. First, we present two event salience detection systems, which capture rich relations among discourse units. Second, we observe interesting connections between salience and various discourse relations ( §7.1 and Table 5), implying potential research on these areas. Finally, we construct a large scale event salience corpus, providing a testbed for future research. Our code, dataset and models are publicly available 2 .
1 Scripts are prototypical sequences of events: a restaurant script normally contains events like "order", "eat" and "pay".
However, studies on event salience are premature. Some previous work attempts to approximate event salience with word frequency or discourse position (Vossen and Caselli, 2015;Zhang et al., 2015). Parallel to ours, Choubey et al. (2018) propose a task to find the most dominant event in news articles. They draw connections between event coreference and importance, on hundreds of closeddomain documents, using several oracle event attributes. In contrast, our proposed models are fully learned and applied on more general domains and at a larger scale. We also do not restrict to a single most important event per document.
There is a small but growing line of work on entity salience (Dunietz and Gillick, 2014;Dojchinovski et al., 2016;Xiong et al., 2018;Ponza et al., 2018). In this work, we study the case for events.
Text relations have been studied in tasks like text summarization, which mainly focused on cohesion (Halliday and Hasan, 1976). Grammatical cohesion methods make use of document level structures such as anaphora relations (Baldwin and Morton, 1998) and discourse parse trees (Marcu, 1999). Lexical cohesion based methods focus on repetitions and synonyms on the lexical level (Skorochod'ko, 1971;Morris and Hirst, 1991;Erkan and Radev, 2004). Though sharing similar intuitions, our proposed models are designed to learn richer semantic relations in the embedding space.
Comparing to the traditional summarization task, we focus on events, which are at a different granularity. Our experiments also unveil interesting phenomena among events and other discourse units.

The Event Salience Corpus
This section introduces our approach to construct a large-scale event salience corpus, including methods for finding event mentions and obtaining saliency labels. The studies are based on the Annotated New York Times corpus (Sandhaus, 2008), a newswire corpus with expert-written abstracts.

Automatic Corpus Creation
Event Mention Annotation: Despite many annotation attempts on events (Pustejovsky et al., 2002;Brown et al., 2017), automatic labeling of them in general domain remains an open problem. Most of the previous work follows empirical approaches. For example, Chambers and Jurafsky (2008) consider all verbs together with their subject and object as events. Do et al. (2011) additionally include nominal predicates, using the nominal form of verbs and lexical items under the Event frame in FrameNet (Baker et al., 1998).
There are two main challenges in labeling event mentions. First, we need to decide which lexical items are event triggers. Second, we have to disambiguate the word sense to correctly identify events. For example, the word "phone" can refer to an entity (a physical phone) or an event (a phone call event). We use FrameNet to solve these problems. We first use a FrameNet based parser: Semafor (Das and Smith, 2011), to find and disambiguate triggers into frame classes. We then use the FrameNet ontology to select event mentions.
Our frame based selection method follows the Vendler classes (Vendler, 1957), a four way classification of eventuality: states, activities, accomplishments and achievements. The last three classes involve state change, and are normally considered as events. Following this, we create an "eventevoking frame" list using the following procedure: 1. We keep frames that are subframes of Event and Process in the FrameNet ontology. 2. We discard frames that are subframes of state, entity and attribute frames, such as Entity, Attributes, Locale, etc. 3. We manually inspect frames that are not subframes of the above-mentioned ones (around 200) to keep event related ones (including subframes), such as Arson, Delivery, etc. This gives us a total of 569 frames. We parse the documents with Semafor and consider predicates that trigger a frame in the list as candidates. We finish the process by removing the light verbs 3 and reporting events 4 from the candidates, similar to previous research (Recasens et al., 2013 Times Annotated Corpus, we extract event mentions. We then label an event mention as salient if we can find its lemma in the corresponding abstract (Mitamura et al. (2015) showed that lemma matching is a strong baseline for event coreference.). For example, in Figure 1, event mentions in bold and red are found in the abstract, thus labeled as salient. Data split is detailed in Table 1 and §6.

Annotation Quality
While the automatic method enables us to create a dataset at scale, it is important to understand the quality of the dataset. For this purpose, we have conducted two small manual evaluation study. Our lemma-based salience annotation method is based on the assumption that lemma matching being a strong detector for event coreference. In order to validate this assumption, one of the authors manually examined 10 documents and identified 82 coreferential event mentions pairs between the text body and the abstract. The automatic lemma rule identifies 72 such pairs: 64 of these matches human decision, producing a precision of 88.9% (64/72) and a recall of 78% (64/82). There are 18 coreferential pairs missed by the rule.
The next question is: is an event really important if it is mentioned in the abstract? Although prior work (Dunietz and Gillick, 2014) shows that the assumption to be valid for entities, we study the case for events. We asked two annotators to manually annotate 10 documents (around 300 events) using a 5-point Likert scale for salience. We compute the agreement score using Cohen's Kappa (Cohen, 1960). We find the task to be challenging for human: annotators don't agree well on the 5-point scale (Cohens Kappa = 0.29). However, if we collapse the scale to binary decisions, the Kappa between the annotators raises to 0.67. Further, the Kappa between each annotator and automatic labels are 0.49 and 0.42 respectively. These agreement scores are also close to those reported in the entity salience tasks (Dunietz and Gillick, 2014).
While errors exist in the automatic annotation process inevitably, we find the error rate to be reasonable for a large-scale dataset. Further, our study indicates the difficulties for human to rate on a finer scale of salience. We leave the investigation of continuous salience scores to future work.

Feature-Based Event Salience Model
This section presents the feature-based model, including the features and the learning process.

Features
Our features are summarized in Table 2. Basic Discourse Features: We first use two basic features similar to Dunietz and Gillick (2014): Frequency and Sentence Location. Frequency is the lemma count of the mention's syntactic head word (Manning et al., 2014). Sentence Location is the sentence index of the mention, since the first few sentences are normally more important. These two features are often used to estimate salience (Barzilay and Lapata, 2008;Vossen and Caselli, 2015). Content Features: We then design several lexical similarity features, to reflect Grimes' content relatedness (Grimes, 1975). In addition to events, the relations between events and entities are also important. For example, Figure 1 shows some related entities in the legal domain, such as "prosecutors" and "court". Ideally, they should help promote the salience status for event "trial".
Lexical relations can be found both withinsentence (local) or across sentence (global) (Halliday and Hasan, 1976). We compute the local part by averaging similarity scores from other units in the same sentence. The global part is computed by averaging similarity scores from other units in the document. All similarity scores are computed using cosine similarities on pre-trained embeddings (Mikolov et al., 2013).
These lead to 3 content features: Event Voting, the average similarity to other events in the document; Entity Voting, the average similarity to entities in the document; Local Entity Voting, the average similarity to entities in the same sentence. Local event voting is not used since a sentence often contains only 1 event.

Model
A Learning to Rank (LeToR) model (Liu, 2009) is used to combine the features. Let ev i denote the ith event in a document d. Its salience score is computed as: Table 2); W f and b are the parameters to learn. The model is trained with pairwise loss: where ev + and ev − represent the salient and nonsalient events; y is the gold standard function. Learning can be done by standard gradient methods.

Neural Event Salience Model
As discussed in §1, the salience of discourse units is reflected by rich relations beyond lexical similarities, for example, script ("charge" and "trial") and frame (a "trial" of "attacks"). The relations between these words are specific to the salience task, thus difficult to be captured by raw cosine scores that are optimized for word similarities. In this section, we present a neural model to exploit the embedding space more effectively, in order to capture relations for event salience estimation.

Kernel-based Centrality Estimation
Inspired by the kernel ranking model (Xiong et al., 2017), we propose Kernel-based Centrality Estimation (KCE), to find and weight semantic relations of interests, in order to better estimate salience. Formally, given a document d, the set of annotated events V = {ev 1 , . . . ev i . . . , ev n }, KCE first embed an event into vector space: ev i Emb − −− → − → ev i . The embedding function is initialized with pretrained embeddings. It then extract K features for each ev i :

Name Description
Frequency The frequency of the event lemma in document.

Sentence Location
The location of the first sentence that contains the event.

Event Voting
Average cosine similarity with other events in document.

Entity Voting
Average cosine similarity with other entities in document. Local Entity Voting Average cosine similarity with entities in the sentence.
is the k-th Gaussian kernel with mean µ k and variance σ 2 k . It models the interactions between events in its kernel range defined by µ k and σ k . Φ K (ev i , V) enforces multi-level interactions among events -relations that contribute similarly to salience are expected to be grouped into the same kernels. Such interactions greatly improve the capacity of the model with negligible increase in the number of parameters. Empirical evidences (Xiong et al., 2017) have shown that kernels in this form are effective to learn weights for task-specific term pairs.
The final salience score is computed as: where W v is learned to weight the contribution of the certain relations captured by each kernel. We then use the exact same learning objective as in equation (2). The pairwise loss is first backpropagated through the network to update the kernel weights W v , assigning higher weights to relevant regions. Then the kernels use the gradients to update the embeddings, in order to capture the meaningful discourse relations for salience.
Since the features and KCE capture different aspects, combining them may give superior performance. This can be done by combining the two vectors in the final linear layer:

Integrating Entities into KCE
KCE is also used to model the relations between events and entities. For example, in Figure 1, the entity "court" is a frame element of the event "trial"; "United States" is a frame element of the event "war". It is not clear which pair contributes more to salience. We again let KCE to learn it.
Formally, let E be the list of entities in the document, i.e. E = {en 1 , . . . , en i , . . . , en n }, where en i is the ith entity in document d. KCE extracts the kernel features about entity-event relations as follows: similarly, en i is embedded by: en i Emb − −− → − → en i , which is initialized by pre-trained entity embeddings.
We reach the full KCE model by combining all the vectors using a linear layer: The model is again trained by equation (2).

Experimental Methodology
This section describes our experiment settings.

Event Salience Detection
Dataset: We conduct our experiments on the salience corpus described in §3. Among the 664,911 articles with abstracts, we sample 10% of the data as the test set and then randomly leave out another 10% documents for development. Overall, there are 4359 distinct event lexical items, at a similar scale with previous work (Chambers and Jurafsky, 2008;Do et al., 2011). The corpus statistics are summarized in Table 1. Input: The inputs to models are the documents and the extracted events. The models are required to rank the events from the most to least salience. Baselines: Three methods from previous researches are used as baselines: Frequency, Location and PageRank. The first two are often used to simulate saliency (Barzilay and Lapata, 2008;Vossen and Caselli, 2015). The Frequency baseline ranks events based on the count of the headword lemma; the Location baseline ranks events using the order of their appearances in discourse. Ties are broken randomly. Similar to entity salience ranking with PageRank scores (Xiong et al., 2018), our PageRank baseline runs PageRank on a fully connected graph whose nodes are the events in documents. The edges are weighted by the embedding similarities between event pairs. We conduct supervised PageRank on this graph, using the same pairwise loss setup as in KCE. We report the best performance obtained by linearly combining Frequency with the scores obtained after a one-step random walk. Evaluation Metric: Since the importance of events is on a continuous scale, the boundary between "important" and "not important" is vague. Hence we evaluate it as a ranking problem. The metrics are the precision and recall value at 1, 5 and 10 respectively. It is adequate to stop at 10 since there are less than 9 salient events per document on average (Table 1). We also report Area Under Curve (AUC). Statistical significance values are tested by permutation (randomization) test with p < 0.05. Implementation Details: We pre-trained word embeddings with 128 dimensions on the whole Annotated New York Times corpus using Word2Vec (Mikolov et al., 2013). Entities are extracted using the TagMe entity linking toolkit (Ferragina and Scaiella, 2010). Words or entities that appear only once in training are replaced with special "unknown" tokens.
The parameters of the models are optimized by Adam (Kingma and Ba, 2015), with batch size 128. The vectors of entities are initialized by the pre-trained embeddings. Event embeddings are initialized by their headword embedding.

The Event Intrusion Test: A Study
KCE is designed to estimate salience by modeling relations between discourse units. To better understand its behavior, we design the following event intrusion test, following the word intrusion test used to assess topic model quality (Chang et al., 2009). Event Intrusion Test: The test will present to a model a set of events, including: the origins, all events from one document; the intruders, some events from another document. Intuitively, if events inside a document are organized around the core content, a model capturing their relations well should easily identify the intruder(s).
Specifically, we take a bag of unordered events {O 1 , O 2 , . . . , O p }, from a document O, as the origins. We insert into it intruders, events drawn from another document, I: {I 1 , I 2 , . . . , I q }. We ask a model to rank the mixed event set M = {O 1 , I 1 , O 2 , I 2 , . . .}. We expect a model to rank the intruders I i below the origins O i . Intrusion Instances: From the development set, we randomly sample 15,000 origin and intruding document pairs. To simplify the analysis, we only take documents with at least 5 salient events. The intruder events, together with the entities in the same sentences, are added to the origin document. Metrics: AUC is used to quantify ranking quality, where events in O are positive and events in I are negative. To observe the ranking among the salient origins, we compute a separate AUC score between the intruders and the salient origins, denoted as SA-AUC. In other words, SA-AUC is the AUC score on the list with non-salient origins removed. Experiments Details: We take the full KCE model to compute salient scores for events in the mixed event set M , which are directly used for ranking. Frequency is recounted. All other features (Table 2) are set to 0 to emphasize the relational aspects, We experiment with two settings: 1. adding only the salient intruders. 2. adding only the non-salient intruders. Under both settings, the intruders are added one by one, allowing us to observe the score change regarding the number of intruders added. For comparison, we add a Frequency baseline, that directly ranks events by the Frequency feature.

Evaluation Results
This section presents the evaluations and analyses.

Event Salience Performance
We summarize the main results in Table 3. Baselines: Frequency is the best performing baseline. Its precision at 1 and 5 are higher than 40%. PageRank performs worse than Frequency on all     Case Study: We inspect some pairs of events and entities in different kernels and list some examples in Table 5. The pre-trained embeddings are changed a lot. Pairs of units with different raw similarity values are now placed in the same bin. The pairs in Table 3 exhibit interesting types of relations: e.g.,"arrest-charge" and "attack-kill" form script-like chains; "911 attack" forms a quasiidentity relation (Recasens et al., 2010) with "attack"; "business" and "increase" are candidates as frame-argument structure. While these pairs have different raw cosine similarities, they are all useful in predicting salience. KCE learns to gather these relations into bins assigned with higher weights, which is not achieved by pure embedding based methods. The KCE has changed the embedding space and the scoring functions significantly from the original space after training. This partially explains why the raw voting features and PageRank are not as effective. The left figure shows that KCE successfully finds the non-salient intruders. The SA-AUC is higher than 0.8. Yet the AUC scores, which include the rankings of non-salience events, are rather close to random. This shows that the salient events in the origin documents form a more cohesive group, making them more robust against the intruders; the non-salient ones are not as cohesive.

Intrusion Test Results
In both settings, KCE produces higher SA-AUC than Frequency at the first 30%. However, in setting 2, KCE starts to produce lower SA-AUC than Frequency after 30%, then gradually drops to 0.5 (random). This phenomenon is expected since the asymmetry between origins and intruders allow KCE to distinguish them at the beginning. When all intruders are added, KCE performs worse because it relies heavily on the relations, which can be also formed by the salient intruders. This phenomenon is observed only on the salient intruders, which again confirms the cohesive relations are found among salient events.
In conclusion, we observe that the salient events form tight groups connected by discourse relations while the non-salient events are not as related. The observations imply that the main scripts in documents are mostly anchored by small groups of salient events (such as the "Trial" script in Example 1). Other events may serve as "backgrounds" (Cheung et al., 2013). Similarly, Choubey et al. (2018) find that relations like event coreference and sequence are important for saliency.

Conclusion
We propose two salient detection models, based on lexical relatedness and semantic relations. The feature-based model with lexical similarities is effective, but cannot capture semantic relations like scripts and frames. The KCE model uses kernels and embeddings to capture these relations, thus outperforms the baselines and feature-based models significantly. All the results are tested on our newly created large-scale event salience dataset. While the automatic method inevitably introduces noises to the dataset, the scale enables us to study complex event interactions, which is infeasible via costly expert labeling.
Our case study shows that the salience model finds and utilize a variety of discourse relations: script chain (attack and kill), frame argument relation (business and increase), quasi-identity (911 attack and attack). Such complex relations are not as prominent in the raw word embedding space. The core message is that a salience detection module automatically discovers connections between salience and relations. This goes beyond prior centering analysis work that focuses on lexical and syntax and provide a new semantic view from the script and frame perspective.
In the intrusion test, we observe that the small number of salient events are forming tight connected groups. While KCE captures these relations quite effectively, it can be confused by salient intrusion events. The phenomenon indicates that the salient events are tightly connected, which form the main scripts of documents.
This paper empirically reveals many interesting connections between discourse phenomena and salience. The results also suggest that core script information may reside mostly in the salient events. Limited by the data acquisition method, this paper only models discourse salience as binary decisions. However, salience value may be continuous and may even have more than one aspects. In the future, we plan to investigate these complex settings. Another direction of study is large-scale semantic relation discovery, for example, frames and scripts, with a focus on salient discourse units.