Connecting the Dots: Event Graph Schema Induction with Path Language Modeling

Event schemas can guide our understanding and ability to make predictions with respect to what might happen next. We propose a new Event Graph Schema , where two event types are connected through multiple paths involving entities that ﬁll important roles in a coherent story. We then introduce Path Language Model , an auto-regressive language model trained on event-event paths, and select salient and coherent paths to probabilistically construct these graph schemas. We de-sign two evaluation metrics, instance coverage and instance coherence , to evaluate the quality of graph schema induction, by checking when coherent event instances are covered by the schema graph. Intrinsic evaluations show that our approach is highly effective at inducing salient and coherent schemas. Ex-trinsic evaluations show the induced schema repository provides signiﬁcant improvement to downstream end-to-end Information Extraction over a state-of-the-art joint neural extraction model, when used as additional global features to unfold instance graphs. 1


Introduction
Existing approaches to automated event extraction retain the overly simplistic assumption that events are atomic occurrences. Understanding events requires knowledge in the form of a repository of abstracted event schemas (complex event templates). Scripts (Schank and Abelson, 1977) encode frequently recurring event sequences, where events are ordered by temporal relation (Chambers and Jurafsky, 2009), causal relation (Mostafazadeh et al., 2016b), or narrative order (Jans et al., 2012). Event schemas have become increasingly important for natural language understanding tasks such as story 1 Our code and data are publicly available for research purpose at http://blender.cs.illinois.edu/software/ pathlm. ending prediction (Mostafazadeh et al., 2016a) and reading comprehension (Kočiský et al., 2018;Ostermann et al., 2019).
Previous schema induction methods mostly ignore uncertainty, re-occurring events and multiple hypotheses, with limited attention to capture complex relations among events, other than temporal or causal relations. Temporal relations exist between almost all events, even those that are not semantically related; while research in identifying causal relations has been hobbled by low inter-annotator agreement (Hong et al., 2016).
In this paper, we hypothesize that two events are connected when their entity arguments are coreferential or semantically related. For example, in Figure 1, (a) and (b) refer to very different event instances, but they both illustrate a typical scenario where a group of people moved from one place to another and then attacked the destination. From many such event instance pairs, we can induce multiple paths connecting a movement event to a related attack event: the person being moved became the attacker, and the weapon or vehicle being moved became the instrument of the attack. Low-level primitive components of event schemas are abundant, and can be part of multiple, sparsely occurring, higher-level graph schemas. We thus propose a new schema representation, Event Graph Schema, where two event types are connected by such paths containing entity-entity relations. Each node represents an entity type or event type, and each edge represents an entity-entity relation type or the argument role of an entity played in an event.
However, between two event types, there may also be noisy paths that should be excluded from graph schemas. We define the following criteria to select good paths in a graph schema: (1). Salience: A good path should appear frequently between two event types; (2). Coherence: Multiple paths between the same pair of event types should tell a  Figure 1: The framework of event graph schema induction. Given a news article, we construct an instance graph for every two event instances from information extraction (IE) results. In this example, instance graph (a) tells the story about Russia deploying troops to attack Ukraine using tanks from Russia; instance graph (b) is about Ukrainian protesters hit police using stones that are being carried to Maidan Square. We learn a path language model to select salient and coherent paths between two event types and merge them into a graph schema. The graph schema between ATTACK and TRANSPORT is an example output containing the top 20% ranked paths.
coherent story, namely they should co-occur frequently in the same discourse (e.g., the same document). Table 1 shows some examples of good paths and bad paths.
As the first attempt to extract such schemas, we propose a path language model to select paths which clearly indicate how two events are connected through their shared entity arguments or the entity-entity relations between their arguments. For example, in Figure 1 (b), Maidan Square and Ukraine connect events TRANSPORT and ATTACK through the path TRANSPORT We train the path language model on two tasks: learning an auto-regressive language model (Ponte and Croft, 1998;Dai and Le, 2015;Peters et al., 2018;Radford et al.;Yang et al., 2019) to predict an edge or a node, given previous edges and nodes in a path, and a neighboring path classification task to predict how likely two paths co-occur. The path language model is trained from all the paths between two event instances from the same document, based on the assumption that events from the same document (especially news document) tell a coherent story.
We propose two intrinsic evaluation metrics, instance coverage and instance coherence, to assess when event instance graphs are covered by each graph schema, and when different schemas appear in the same document. Intrinsic evaluation on heldout documents demonstrates that our approach can produce highly salient and coherent schemas.
Such event graph schemas can also be exploited to enhance the performance of Information Extraction (IE) tasks, such as entity extraction, relation extraction, event extraction and argument role labeling, because most of the existing methods ignore such inter-event dependencies. For example, from the following sentence "Following the trail of Mohammed A. Salameh, the first suspect arrested in the bombing, investigators discovered a jumble of chemicals, chemistry implements and detonating materials...", the state-of-the-art IE system (Lin et al., 2020) successfully extracts the ARRESTJAIL event but fails to extract the INVESTIGATECRIME triggered by "discovered" and its DEFENDANT argument "Mohammed A. Salameh". Event graph scehmas can inform the model that a person who is arrested was usually investigated, our IE system can fix this missing error. Therefore we also conduct extrinsic evaluations and show the effectiveness of the induced schema repository in enhancing downstream end-to-end IE tasks.
In summary, we make the following novel contributions: • A novel semantic schema induction framework for the new event schema representation, Event Graph Schema, that encodes rich event structures and event-event connections, and two new evaluation metrics to assess graph schemas for coverage and coherence.
• A Path Language Model to select salient and coherent event-event paths and construct an event graph schema repository that is probabilistic and semantically coherent.
• The first work to show how to apply event schema to enhance end-to-end IE.

Problem Formulation
Given an input document, we extract instances of entities, relations, and events. The type set of entities and events is Φ, and the type set of entity-entity relations and event argument roles is Ψ. For every two event instances, we construct an event instance graph g = (V, E, ϕ) ∈ G with all paths connecting the two, as in Figure 1 (a) and (b). V and E are the node and edge sets, and ϕ : {V, E} → {Φ, Ψ} is a mapping function to obtain the type of each node or edge. Each node v i = w i , ϕ(v i ) ∈ V represents an entity or an event with text mention w i , and ϕ(v i ) ∈ Φ denotes its node type. Each set of coreferential entities or events is mapped to one single node. Each edge e ij = v i , ϕ(e ij ), v j ∈ E represents an event-argument role or an entity-entity relation, where i and j denote the involved nodes. ϕ(e ij ) ∈ Ψ indicates the edge type. Figure 1 shows two example instance graphs. Event graph schema induction aims to generate a set of recurring graph schemas S from instance graphs G. For every event type pair, we induce an event graph schema s = (U, H) ∈ S, where U and H are the node and edge sets. Each node represents an edge type ψ ij ∈ Ψ in instance graphs G, where φ i and φ j denote the involved node types. Figure 1 shows an example of an induced graph schema between TRANSPORT and ATTACK.

Path Language Model based Graph
Schema Induction

Overview
As shown in Table 1, a graph schema for two event types consists of salient and coherent paths between them. A salient path reveals knowledge of recurring event-event connection patterns. For example, the frequent path in Table 1 shows that the attacker is a member of the government conducting a deployment, which repeatedly appears in the story about attackers sending weapons and people to attack a target place. However, the attacker is unlikely to be affiliated with a target place, so the infrequent path in Table 1 should be excluded from the schema.
In addition, a good path is semantically coherent. For example, the coherent path in Table 1 shows that the origin of transportation is a subarea of the attacker's country, which captures the hierarchical part-whole relation between two places. However, in the bad path example, a person is affiliated with both the origin and destination of the transportation, which is a weakly coherent situation.
Furthermore, multiple paths in a good schema should be semantically consistent, namely they should co-occur frequently in the same scenario. For example, in Table 1, the destination of transportation is the attack's target, and meanwhile, is the location of the transported people. The cooccurrence of these two paths represents a repetitive pattern to connect TRANSPORT and ATTACK. However, the incoherent example in Table 1 indicates that the attack place is both the destination and the origin of the transportation, where two paths rarely co-occur.
To induce such salient and coherent graph schemas, we start by applying Information Extraction (IE) to construct instance graphs between event instances in each document (Section 3.2). We consider a path sequence as a text sequence, and learn an auto-regressive path language model to score each path (Section 3.3). To capture the coherence between paths, we learn a neighbor path classifier to predict whether two paths co-occur (Section 3.4). The path language model is trained jointly on these two tasks (Section 3.5), which enables us to score and rank paths between event type pairs, and merge salient and coherent paths into graph schemas (Section 3.6).

Instance Graph Construction
Starting with entities, entity-entity relations, events and their arguments extracted from an input document by IE systems or manual annotation, we construct an event instance graph g for two event instances v and v , that includes all instance paths  between them. Each instance path is a sequence of nodes v, v 1 ,..., v ∈V and edges e 0;1 ,..., e n−1;n ∈E, such as the instance path Figure 1 (a). The node instances in each path are distinct to avoid cycles. An event-event path is a sequence of types of nodes and edges, For example, the path abstracted from the instance path above is ATTACK INSTRUMENT We consider paths in both directions, namely that reversed paths are valid.

Autoregressive Path Language Model
To score and select salient and semantically coherent path sequences, we take a language modeling approach, inspired by node representation learning (Grover and Leskovec, 2016;Goikoetxea et al., 2015) using language model over paths. Autoregressive language model (Ponte and Croft, 1998;Dai and Le, 2015;Peters et al., 2018;Radford et al.;Yang et al., 2019) learns the probability of text sequences as the probability distribution of each word, given its context factorizing the likelihood of prior words into a forward product or, for context in the other direction, a backward product. Similarly, for a path instance p I , we estimate the probability distribution of a node type ϕ(v i ) (or edge type ϕ(e j;j+1 )), given the sequence of previously observed nodes and edges [ϕ(v), ϕ(e 0;1 ), ϕ(v 1 ), . . . , ϕ(e i−1; Following (Yang et al., 2019), we apply the Transformer (Vaswani et al., 2017) to learn the probability distribution, with permutation operation (Yang et al., 2019) to capture bidirectional contexts. Unlike in text sequences, we have nodes and edges that alternate within path sequences. As shown in Figure 2, to distinguish nodes and edges, we add type embedding E T = [1, 2, 1, . . . , 2, 1] into the token representation, where 1 stands for nodes, 2 for edges, and 0 for special tokens such as [CLS]. We hypothesize that event instances from the same discourse (e.g., a news document) describe a coherent story, and so we use the paths between them as training paths.

Neighbor Path Classification
To capture the consistency between paths, we train a binary neighbor path classifier to learn the occurrence probability of two paths. For each path p i ∈ P v,v between two event instances v and v , we obtain its neighbor path set as its co-occuring paths between the same event instances v and v , We sample negative neighbor paths from paths that appear between the same event types ϕ(v) and ϕ(v ), but never occur with p i in the corpus.
We also swap each path pair to improve the consistency of the neighbor path classification. The neighbor path classifier (top of Figure 2) is a linear layer with the classification token x [CLS] as input, We balance the positive and negative path pairs during training, and optimize cross-entropy loss,

Joint Training
We jointly optimize autoregressive language model loss and neighbor path classifier loss, L = L LM + λL N P .

Graph Schema Construction
Given two event types φ and φ , we construct a graph schema s by merging the top k percent ranked paths. Paths in P φ,φ are ranked in terms of a score function f (p), where f LM (p) captures salience and coherence of a single path, and where f NP (p) scores a path p i by its average probability of co-occuring with other paths p j ∈ P φ,φ between the given event types φ and φ , We merge instance paths into a graph schema s by mapping nodes of the same type into a single node. We allow some self-loops in the graph, such as GPE PART-WHOLE −−−−−−−→ GPE. Each path in the schema has a probability, .
Each edge and node is assigned a salience score by aggregating the scores of paths passing through it,

Dataset
We use Automatic Content Extraction (ACE) 2005 dataset 2 , the widely used dataset with annotated instances of 7 entity types, 6 relation types, 33 event types, and 22 argument roles. We follow our recent work on ACE IE (Lin et al., 2020) to split the data. We consider the training set as historical data to train the LM, and the test set as our target data to induce schema for target scenarios. The instance graphs of the target data set are constructed from manual annotations. For historical data, we construct event instance graphs from both manual annotations (Historical ann ) and system extraction results (Historical sys ) from the state-ofthe-art IE model (Lin et al., 2020). We perform cross-document entity coreference resolution by applying an entity linker (Pan et al., 2017) for both annotated and system generated instance graphs.

Instance Coverage
A salient schema can serve as a skeleton to recover instance graphs. Therefore, we use each graph schema s ∈ S to match back to each ground-truth instance graph g ∈ G and evaluate their intersection g ∩ s in terms of Precision and Recall.
Intersection is obtained by searching instance graphs with each graph schema as a query. Since instance graphs can be regarded as partially instantiated graph schema, we employ the substructures of the schema graph, i.e., paths of different lengths, as queries. For example, a path of length l = 3 is a triple in graph schema φ i , ψ ij , φ j ∈ s. We consider an instance triple v m , e mn , v n ∈ g matched if instance types match, i.e., ϕ(v m )=φ i , ϕ(e mn )=ψ ij , ϕ(v n )=φ j . Let | · | I denote the number of instance substructures matched, and | · | S is the number of schema substructures matched, i.e., The cardinality for an instance graph and a schema will be the number of substructures in each, i.e., By extension, each path of length l=5 in a graph schema [φ i , ψ ij , φ j , ψ jk , φ k ] contains two consecutive triples φ i , ψ ij ,φ j , φ j , ψ jk , φ k ∈s, and a matched instance path contains two consecutive in- Similarly, a path of length l=7 contains three consecutive triples. Then we compute:

Instance Coherence
For an instance graph between two events v and v , we hypothesize that the graph is coherent if v and v are from the same discourse (document). We carefully select 24 documents with each document talking about a unique complex event such as Iraq War or North Korea Nuclear Test. A coherent schema should have the maximal number of matched instance graphs g ∩ s from a single document, but the minimal number of matched graphs connecting two event instances from different documents. We define Instance Coherence as the proportion of event-event path instances in graphs within one document.

Coherence =
s∈S g∈G where I g is an indicator function taking value 1 when g is between event instances from the same document, and value 0 otherwise.

Schema-Guided Information Extraction
As a case study for extrinsic evaluation, we evaluate the impact of our induced schema 3 on end-to-end Information Extraction (IE). We choose our stateof-the-art IE system ONEIE (Lin et al., 2020) 4 as our baseline for two reasons: (1) it achieves state-of-the-art performance on all IE components; (2) it can easily incorporate global features during decoding converting each input sentence into an instance graph. Given an input sentence, ONEIE generates a set of candidate IE graphs at each decoding step, as shown in Figure 3. The candidate IE graphs are ranked by type prediction scores s (G) of each entity, relation and event in each graph G. We consider schemas as global features and use them as an additional scoring mechanism for ONEIE 5 . The schemas are induced from the training data of our IE system. If a path p i in the schema appears n i times in a candidate graph, we add n i * w i to obtain the global score of this graph, where w i is a learnable weight. The candidate graphs are then ranked in terms of their global scores. In this way, the model can promote candidate graphs containing positive global features, even if the graphs may have lower local type prediction scores.

Settings
Baselines. As the first to induce event graph schema, we compare our method to various path ranking methods: (1) Frequency Ranking Model ranks paths between every two event types by the number of associated instance paths in the historical and target data. (2) Unigram, Bigram, and Trigram Language Models assign probabilities to path sequences by estimating the probability of each node (or edge) from the unigram, bigram, and trigram frequency counts, respectively. We also include a variant of PathLM by removing the neighbor path classifier (CLS NP ) as an ablation study. Schema@K. To compare the ranking of paths with baselines, we evaluate graph schemas containing top k % ranked paths. Implementation Details. We use the same hyperparameters as in XLNet-base-cased (Yang et al., 2019), with dropout = 0.5. λ = 0.1, and α = 0.3. Detailed parameter settings are in Appendix.

Results and Analysis
We induce 124 and 197 graph schemas for Schema@10 and Schema@20 respectively. Figure 1 shows an output graph schema. 6 According to Table 3 and Table 4, PathLM achieves significant improvement on both instance coverage and instance coherence. T-test shows that the gains achieved by PathLM are all statistically significant over baselines (Frequency, UnigramLM, Bi-gramLM, TrigramLM), with a P value less than 0.01. We make the following observations: (1) PathLM achieves larger gains compared to baselines on Schema@10 than Schema@20 in Table 3, demonstrating the effectiveness of our ranking approach, especially on top ranked ones.
(2) The improvement relative to baselines on longer path queries (e.g. l = 7) is greater than shorter paths (e.g., l = 3) in Table 3, showing that our approach is able to capture complex graph structures involving long distance between related events. In the l=3 setting, the performance of PathLM is close to baselines. The reason is that l=3 setting evaluates a single overlapped triple, which is exactly the objective of TrigramLM. We conduct t-test, and the gain is statistically significant (P value less than 0.01).
(3) The neighbor path classification proves to be effective in enhancing the salience (see 'w/o CLS NP ' in Table 3) and coherence (see 'w/o CLS NP ' in Table 4) of the induced schemas, showing that salient substructures can be better captured by frequently co-occurring paths. The model outputs consistent neighbor path classification results for the swapped path pairs. 96.17% swapped path pairs yield the same results as original pairs. (4) The schemas induced from Historical sys and Historical ann have comparable performance. This proves our approach is robust to extraction noise and effective even with lower quality input.
As shown in Table 5, our event graph schemas have provided significant improvement on relation extraction and event extraction which require knowledge of complex connections among events and entities. Our approach achieves dramatic improvement on relation extraction, because existing methods mainly rely on local contexts between two entities, which are typically short and ambiguous. In contrast, the paths in our graph schemas can capture the global context between two events, and thus event-related information captures deeper    Table 5: F 1 score (%) of schema-guided information extraction, including entity extraction (Entity), relation extraction (Rel), event trigger identification (Trig-I) and classification (Trig-C), event argument identification (Arg-I) and argument role classification (Arg-C).
contextual features, yielding a big boost in performance. For example, when decoding candidate IE graph in Figure 3, the LOCATED IN relation is extracted by promoting the structures matching paths  in the graph schema.

Remaining Challenges
A major challenge in schema induction is to automatically decide the type granularity. For example, if two events happen on the same street, it is likely that they are related; if it is a country that connects to two events through place arguments, they can be independent. In this case, the fine-grained type information of shared place argument is required in schemas. However, to induce schemas about war, geopolitical entities of different granularities should be generalized as GPE.

Related Work
Atomic Event Schema Induction. Atomic event schema induction methods (Chambers, 2013;Cheung et al., 2013;Nguyen et al., 2015;Huang et al., 2016;Sha et al., 2016;Yuan et al., 2018) focus on discovering event types and argument roles of individual atomic events. Narrative Event Schema Induction. Previous work (Chambers and Jurafsky, 2008Jans et al., 2012;Balasubramanian et al., 2013;Mooney, 2014, 2016;Rudinger et al., 2015;Granroth-Wilding and Clark, 2016;Modi, 2016;Mostafazadeh et al., 2016a;Peng et al., 2019) focuses on inducing narrative schemas as partially ordered sets of events (represented as verbs) sharing a common argument. The event order is further extended to include causality (Mostafazadeh et al., 2016b;Kalm et al., 2019), and temporal script graph is proposed where events and arguments are abstracted as event types and participant types (Modi et al., 2017;Wanzare et al., 2017;Zhai et al., 2019). In our work, we propose a new event graph schema representation to capture more complex connections between events, and use event types instead of verbs as in previous work for more abstraction power.
Path-based Language Model. Language models (LMs) (Ponte and Croft, 1998) achieve great advances on contextualizing LMs in the last few years (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019). LM has been used over paths to learn node representations in a network (Goikoetxea et al., 2015;Grover and Leskovec, 2016;Dong et al., 2017). To the best of our knowledge, there has not been an effort to incorporate latent linguistic structures into language models based on typed event-event paths. This is also the first work to demonstrate how to leverage event schemas to enhance the performance of an IE system. Graph Pattern Mining. Motif finding on heterogeneous networks (Prakash et al., 2004;Carranza et al., 2018;Rossi et al., 2019;Hu et al., 2019) discovers highly recurrent instance graph patterns, but fails in abstracting schema graphs to the type level. Previous work applies graph summarization to discover frequent subgraph patterns for heterogeneous networks (Cook and Holder, 1993;Buehrer and Chellapilla, 2008;Li and Lin, 2009;Zhang et al., 2010;Koutra et al., 2014;Wu et al., 2014;Song et al., 2018;Bariatti et al., 2020), but ignores semantic coherence among multiple patterns.

Conclusions and Future Work
We propose Event Graph Schema induction as a new step towards semantic understanding of inter-event connections. We develope a path language model based method to construct graph schemas containing salient and semantically coherent eventevent paths, which also effectively enhances endto-end Information Extraction. In the future, we aim to extend graph schemas to encode hierarchical and temporal relations, as well as rich ontologies in open domain. We will also assemble our graph schemas to represent more complex scenarios involving multiple events, so they can be applied to more downstream applications including event graph completion and event prediction.