Joint Learning Templates and Slots for Event Schema Induction

Automatic event schema induction (AESI) means to extract meta-event from raw text, in other words, to find out what types (templates) of event may exist in the raw text and what roles (slots) may exist in each event type. In this paper, we propose a joint entity-driven model to learn templates and slots simultaneously based on the constraints of templates and slots in the same sentence. In addition, the entities' semantic information is also considered for the inner connectivity of the entities. We borrow the normalized cut criteria in image segmentation to divide the entities into more accurate template clusters and slot clusters. The experiment shows that our model gains a relatively higher result than previous work.


Introduction
Event schema is a high-level representation of a bunch of similar events. It is very useful for the traditional information extraction (IE) (Sagayam et al., 2012) task. An example of event schema is shown in Table 1. Given the bombing schema, we only need to find proper words to fill the slots when extracting a bombing event.
There are two main approaches for AESI task. Both of them use the idea of clustering the potential event arguments to find the event schema. One of them is probabilistic graphical model (Chambers, 2013;Cheung, 2013). By incorporating templates and slots as latent topics, probabilistic graphical models learns those templates and slots that best explains the text. However, the graphical models Bombing Template Perpetrator: person Victim: person Target: public Instrument: bomb considers the entities independently and do not take the interrelationship between entities into account. Another method relies on ad-hoc clustering algorithms (Filatova et al., 2006;Sekine, 2006;Chambers and Jurafsky, 2011). (Chambers and Jurafsky, 2011) is a pipelined approach. In the first step, it uses pointwise mutual information(PMI) between any two clauses in the same document to learn events, and then learns syntactic patterns as fillers. However, the pipelined approach suffers from the error propagation problem, which means the errors in the template clustering can lead to more errors in the slot clustering.
This paper proposes an entity-driven model which jointly learns templates and slots for event schema induction. The main contribution of this paper are as follows: • To better model the inner connectivity between entities, we borrow the normalized cut in image segmentation as the clustering criteria.
• We use constraints between templates and between slots in one sentence to improve AESI result. Our ultimate goal is to assign two labels, a slot variable s and a template variable t, to each entity. After that, we can summarize all of them to get event schemas.

Inner Connectivity Between Entities
We focus on two types of inner connectivity: (1) the likelihood of two entities to belong to the same template; (2) the likelihood of two entities to belong to the same slot;

Template Level Connectivity
It is easy to understand that entities occurred near each other are more likely to belong to the same template. Therefore, (Chambers and Jurafsky, 2011) uses PMI to measure the correlation of two words in the same document, but it cannot put two words from different documents together. In the Bayesian model of (Chambers, 2013), p(predicate) is the key factor to decide the template, but it ignores the fact that entities occurring nearby should belong to the same template. In this paper, we try to put two measures together. That is, if two entities occurred nearby, they can belong to the same template; if they have similar meaning, they can also belong to the same template. We use PMI to measure the distance similarity and use word vector (Mikolov et al., 2013) to calculate the semantic similarity.
A word vector can well represent the meaning of a word. So we concatenate the word vector of the j-th entity's head word and its predicate, denoted as vec hp (i). We use the cosine distance cos hp (i, j) to measure the difference of two vectors.
Then we can get the template level connectivity formula as shown in Eq 1. The P M I(i, j) is calculated by the head words of entity mention i and j.

Slot Level Connectivity
If two entities can play similar role in an event, they are likely to fill the same slot. We know that if two entities can play similar role, their head words may have the same hypernyms. We only consider the direct hypernyms here. Also, their predicates may have similar meaning and the entities may have the same dependency path to their predicate. Therefore, we give the factors equal weights and add them together to get the slot level similarity.
(2) Here, the δ(·) has value 1 when the inner expression is true and 0 otherwise. The "hypernym" is derived from Wordnet (Miller, 1995), so it is a set of direct hypernyms. If two entities' head words have at least one common direct hypernym, then they may belong to the same slot. And again cos p (i, j) represents the cosine distance between the predicates' word vector of entity i and entity j.

Template and Slot Clustering Using Normalized Cut
Normalized cut intend to maximize the intra-class similarity while minimize the inter class similarity, which deals well with the connectivity between entities. We represent each entity as a point in a highdimension space. The edge weight between two points is their template level similarity / slot level similarity. Then the larger the similarity value is, the more likely the two entities (point) belong to the same template / slot, which is also our basis intuition.
For simplicity, denote the entity set as E = {e 1 , · · · , e |E| }, and the template set as T . We use the |E| × |T | partition matrix X T to represent the template clustering result. Let Usually, we define the degree matrix D T as: Obviously, D T is a diagonal matrix. It contains information about the weight sum of edges attached to each vertex. Then we have the template clustering optimization as shown in Eq 4 according to (Shi and Malik, 2000).
For the slot clustering, we have a similar optimization as shown in Eq 5.
where S represents the slot set, X S is the slot clustering result with X S = [X S 1 , · · · , X S |S| ], where X S l is a binary indicator for slot l(S l ).

Joint Model With Sentence Constraints
For event schema induction, we find an important property and we name it "Sentence constraint". The entities in one sentence often belong to one template but different slots.
The sentence constraint contains two types of constraint, "template constraint" and "slot constraint".
1. Template constraint: Entities in the same sentence are usually in the same template. Hence we should make the templates taken by a sentence as few as possible.
2. Slot constraint: Entities in the same sentence are usually in different slots. Hence we should make the slots taken by a sentence as many as possible.
Based on these consideration, we can add an extra item to the optimization object. Let N sentence be the number of sentences. Define N sentence × |E| matrix J as the sentence constraint matrix, the entries of J is as following: Easy to show, the product G T = J T X T represents the relation between sentences and templates. In matrix G T , the (i, j)-th entry represents how many entities in sentence i are belong to T j . Using G T , we can construct our objective. To represent the two constraints, the best objective we have found is the trace value: tr(G T G T T ). Each entry on the diagonal of matrix G T G T T is the square sum of all the entries in the corresponding line in G T , and the larger the trace value is, the less templates the sentence would taken. Since tr(G T G T T ) is the sum of the diagonal elements, we only need to maximize the value tr(G T G T T ) to meet the template constraint. For the same reason, we need to minimize the value tr(G S G T S ) to meet the slot constraint. Generally, we have the following optimization objective: The whole joint model is shown in Eq 9. The de-tailed derivation 1 is shown in the supplement file.

Dataset
In this paper, we use MUC-4(Sundheim, 1991) as our dataset, which is the same as previous works (Chambers and Jurafsky, 2011;Chambers, 2013 Table 2: Slot-only mapping comparison to state-of-the-art unsupervised systems, "-SC" means without sentence constraint the same. We call this the slot-only mapping evaluation. The second approach is to map each template t to the best gold template g, and limit the slot mapping so that only the slots under t can map to slots under g. We call this the strict template mapping evaluation. The slot-only mapping can result in higher scores since it is not constrained to preserve schema structure in the mapping. We compare our results with four works (Chambers and Jurafsky, 2011;Cheung, 2013;Chambers, 2013;Nguyen et al., 2015) as is shown in Table 2 and Table 3. Our model has outperformed all of the previous methods. The improvement of recall is due to the normalized cut criteria, which can better use the inner connectivity between entities. The sentence constraint improves the result one step further.
Note that after adding the sentence constraint, the slot-only performance has increased a little, but the strict template mapping performance has increased a lot as is shown in Table 3. This phenomenon can be explained by the following facts: We count the  Table 3: strict template mapping comparison to state-of-the-art unsupervised systems, "-SC" means without sentence constraint amount of entities which has been assigned different templates or different slots in "Our Model-SC" and "Our Model". Of all the 11465 entities, 2305 entities has been assigned different templates in the two methods while only 108 entities has different slots. This fact illustrates that the sentence constraint can affect the assignment of templates much more than the slots. Therefore, the sentence constraint leads largely improvement to the strict mapping performance and very little increase to the slot-only performance.

Related Works
The traditional information extraction task is to fill the event schema slots. Many slot filling algorithms requires the full information of the event schemas and the labeled corpus. Among them, there are rule-based method (Rau et al., 1992;Chinchor et al., 1993), supervised learning method (Baker et al., 1998;Chieu et al., 2003;Bunescu and Mooney, 2004;Patwardhan and Riloff, 2009;Maslennikov and Chua, 2007), bootstrapping method (Yangarber et al., 2000) and cross-document inference method (Ji and Grishman, 2008). Also there are many semisupervised solutions, which begin with unlabeled, but clustered event-specific documents, and extract common word patterns as extractors (Riloff and Schmelzenbach, 1998;Sudo et al., 2003;Riloff et al., 2005;Patwardhan and Riloff, 2007;Filatova et al., 2006;Surdeanu et al., 2006) Other traditional information extraction task learns binary relations and atomic facts. Models can learn relations like "Jenny is married to Bob" with unlabeled data Etzioni et al., 2008;Yates et al., 2007;Fader et al., 2011), or ontology induction (dog is an animal) and attribute extraction (dogs have tails) (Carlson et al., 2010a;Carlson et al., 2010b;Huang and Riloff, 2010;Van Durme and Pasca, 2008), or rely on predefined patterns (Hearst, 1992). Shinyama and Sekine (2006) proposed an approach to learn templates with unlabeled corpus. They use unrestricted relation discovery to discover relations in unlabeled corpus as well as extract their fillers. Their constraints are that they need redundant documents and their relations are binary over repeated named entities. (Chen et al., 2011) also extract binary relations using generative model. Kasch and Oates (2010), Chambers and Jurafsky (2008), Chambers andJurafsky (2009), Balasubramanian et al. (2013) captures template-like knowledge from unlabeled text by large-scale learning of scripts and narrative schemas. However, their structures are limited to frequent topics in a large corpus. Chambers and Jurafsky (2011) uses their idea, and their goal is to characterize a specific domain with limited data using a three-stage clustering algorithm.
Also, there are some state-of-the-art works using probabilistic graphic model (Chambers, 2013;Cheung, 2013;Nguyen et al., 2015). They use the Gibbs sampling and get good results.

Conclusion
This paper presented a joint entity-driven model to induct event schemas automatically.
This model uses word embedding as well as PMI to measure the inner connection of entities and uses normalized cut for more accurate clustering. Finally, our model uses sentence constraint to extract templates and slots simultaneously. The experiment has proved the effectiveness of our model.