Zero-Shot Transfer Learning for Event Extraction

Most previous supervised event extraction methods have relied on features derived from manual annotations, and thus cannot be applied to new event types without extra annotation effort. We take a fresh look at event extraction and model it as a generic grounding problem: mapping each event mention to a specific type in a target event ontology. We design a transferable architecture of structural and compositional neural networks to jointly represent and map event mentions and types into a shared semantic space. Based on this new framework, we can select, for each event mention, the event type which is semantically closest in this space as its type. By leveraging manual annotations available for a small set of existing event types, our framework can be applied to new unseen event types without additional manual annotations. When tested on 23 unseen event types, our zero-shot framework, without manual annotations, achieved performance comparable to a supervised model trained from 3,000 sentences annotated with 500 event mentions.


Introduction
The goal of event extraction is to extract event triggers and arguments from unstructured data.An example is shown in Figure 1.Major obstacles to making progress on event extraction have been the poor portability of traditional supervised methods and the limited coverage of available event annotations.Handling new event types means to start from scratch without being able to re-use annotations for old event types.The main reason is that these approaches modeled event extraction as a classification problem, encoding features only by measuring the similarity between rich features encoded for test event mentions and annotated event mentions.In these models, an event type (e.g., Transport-Person) or an argument role (e.g., Destination) is simply treated as an atomic symbol (i.e., a surface lexical form).Therefore it's not feasible to repeat the high-cost annotation process for each of the 3,000+ event types.
In fact, many rich event ontologies have been recently developed, including FrameNet (Baker and Sato, 2003), VerbNet (Kipper et al., 2008), Propbank (Palmer et al., 2005) and OntoNotes (Pradhan et al., 2007), where each event type is associated with a set of predefined argument roles.We observe that both event mentions and types can be represented with structures, where event mention structure is constructed from trigger and candidate arguments, and event type structure consists of event type and predefined roles.Consider two example sentences: E1.The Government of China has ruled Tibet since 1951 after dispatching troops to the Himalayan region in 1950.
E2. Iranian state television stated that the conflict between the Iranian police and the drug smugglers took place near the town of mirjaveh.
E1, as can be seen in Figure 1, includes a Transport Person event mention triggered by dispatching and E2 includes an Attack event mention triggered by conflict.For each event mention, we apply Abstract Meaning Representation (AMR) (Banarescu et al., 2013) to identify candidate arguments and construct event mention structures.Meanwhile, the two event types can also be represented with structures from ERE (Entity Relation Event) (Song et al., 2015), as shown in Figure 2. We can see that, besides the lexical semantics that relates a trigger to its type, their structures also tend to be similar: a Transport Person event typically involves a Person instead of an Artifact as the patient, while an Attack event involves a Person or Location as an Attacker.This observation is similar to the theory that "the semantics of an event structure can be generalized and mapped to event mention structures in a systematic and predictable way" (Pustejovsky, 1991).Inspired by this theory, we take a fresh look at the event extraction task and model it as a "grounding" problem, by mapping each mention to its semantically closest event type in the ontology.One possible implementation of this idea is Zero-Shot Learning (ZSL), which has been successfully exploited in visual object classification (Frome et al., 2013;Norouzi et al., 2013;Socher et al., 2013a).The main idea of applying ZSL for vision tasks is to represent both images and type labels in a multi-dimensional vector space separately, and then learn a regression model to map from image semantic space to type label semantic space based on annotated images for seen labels.This regression model can be further used to predict the unseen labels of any given image.
In this paper, we apply ZSL to event extraction.
Given an event ontology, where each event type is defined with a rich structure (e.g., argument roles), we call the event types with annotated event mentions as seen types, while those without annotations as unseen.Our goal is to effectively transfer the knowledge of events from seen types to unseen types, so we can extract event mentions of any types defined in the ontology.We design a transferable neural architecture, which jointly learns and maps the structural representations of both event mentions and types into a shared semantic space by minimizing the distance between each event mention and its corresponding type.For event mentions with unseen types, their structures will be projected into the same semantic space using the same framework and assigned types with top-ranked similarity values.
There are two appealing advantages of this new view, which also manifest our contributions: • This mapping/ranking function is "universal" and independent of event types, thus we can transfer resources from existing types to new types without any additional annotation effort; • Many existing event ontologies cover a wider range of event types, which allow us to extend the scope of event extraction from several dozen types to thousands of types.

Overview
Event extraction aims to extract both triggers and arguments.Figure 3 illustrates the overall architecture of our approach for trigger typing, while argument typing follows the same pipeline.Given a sentence s, we start by identifying candidate triggers and arguments based on AMR parsing (Wang et al., 2015b).An example is shown in Figure 1.For each trigger t, e.g., dispatch-01, we build a structure S t using AMR as shown in Figure 3.Each structure is composed of a set of tuples, e.g, dispatch-01, :ARG0, China .We use a Given a target event ontology, for each type y, e.g., Transport Person, we construct a type structure S y by incorporating its predefined roles, and use a tensor to denote the implicit relation between any types and arguments.We compose the semantics of type and argument role with the tensor for each tuple, e.g., Transport Person, Destination .We generate the event type structure representation V Sy using the same CNN.By minimizing the semantic distance between dispatch-01 and Transport Person using V St and V Sy , we jointly map the representations of event mention and event types into a shared semantic space, where each mention is closest to its annotated type.
After training, the compositional functions and CNNs can be further used to project any new event mention (e.g., donate-01) into the semantic space and find its closest event type (e.g., Donation).

Candidate Trigger and Argument Identification
Similar to Huang et al. (2016), we identify candidate triggers and arguments based on AMR Parsing (Wang et al., 2015b) and apply the same word sense disambiguation (WSD) tool (Zhong and Ng, 2010) to disambiguate word senses and link each sense to OntoNotes, as shown in Figure 1.Given a sentence, we consider all noun and verb concepts that can be mapped to OntoNotes senses by WSD as candidate event triggers.In addition, the concepts that can be matched with verbs or nominal lexical units in FrameNet are also considered as candidate triggers.For each candidate trigger, its candidate arguments are specified by a subset of AMR relations, as shown in Table 1.

Structure Construction and Composition
As Figure 3 shows, for each candidate trigger t, we construct its event mention structure S t based on its candidate arguments and AMR parsing.
For each type y of the target event ontology, we construct a structure S y by incorporating its predefined roles and take the type as the root.Each S t or S y is composed of a collection of tuples.For each event mention structure, a tuple consists of two AMR concepts and a AMR relation, while for each event type structure, a tuple consists of a type name and an argument role name.Here we propose two approaches to incorporate the semantics of relations into the two words of each tuple.
Event Mention Structure For each tuple u = w 1 , λ, w 2 in an event mention structure, we use a matrix to represent each AMR relation, and compose the semantics of the AMR relation λ to the two concepts w 1 and w 2 as: where V w 1 , V w 2 ∈ R d are the vector representations of words w 1 and w 2 .d is the dimension size of each word vector.[ ; ] denotes the concatenation of two vectors.M λ ∈ R 2d×2d is the matrix representation for AMR relation λ.V u is the composition representation of tuple u, which consists of two updated vector representations V w 1 , V w 2 for w 1 and w 2 by incorporating the semantics of λ.
Event Type Structure For each tuple u = y, r in an event type structure, where y denotes the event type and r denotes an argument role, following Socher et al. (2013b), we assume an implicit and "universal" relation between any pairs of type and argument, and use a single and powerful tensor to represent the implicit relation: where V y and V r are vector representations for y and r.U [1:2d] ∈ R 2d×2d×2d is a 3-order tensor.V u is the composition representation of tuple u , which consists of two updated vector representations V y , V r for y and r by incorporating the semantics of their implicit relation U [1:2d] .

Joint Event Mention and Type Label Embedding
CNN is good at capturing sentence level information in various Natural Language Processing tasks.
In this work, we use it to generate structure-level representations.For each event mention structure S t = (u 1 , u 2 , ..., u h ) and each event type structure S y = (u 1 , u 2 , ..., u p ), which contains h and p tuples respectively, we apply a weight-sharing CNN to each input structure to jointly learn event mention and type structural representations, which will be later used to learn the ranking function for zero-shot event extraction.
Input layer is a sequence of tuples, where the order of tuples is from top to bottom in the structure.Each tuple is represented by a d × 2 dimensional vector, thus each mention structure and each type structure are represented as a feature map of dimensionality d × 2h * and d × 2p * respectively, where h * and p * are the maximal number of tuples for event mention and type structures.We use zero-padding to the right to make the volume of all input structures consistent.
Convolution layer Take S t with h * tuples: u 1 , u 2 , ..., u h * as an example.The input matrix of S t is a feature map of dimensionality d × 2h * .We make c i as the concatenated embeddings of n continuous columns from the feature map, where n is the filter width and 0 < i < 2h * + n.A convolution operation involves a filter W ∈ R nd , which is applied to each sliding window c i : where c i is the new feature representation, and b ∈ R d is a biased vector.We set filter width as 2 and stride as 2 to make the convolution function operate on each tuple with two input columns.
Max-Pooling: All tuple representations c i are used to generate the representation of the input sequence by max-pooling.
Learning: For each event mention t, we name the correct type as positive and all the other types in the target event ontology as negative.To train the composition functions and CNN, we first consider the following hinge ranking loss: where y is the positive event type for t.Y is the type set of the event ontology.[V t ; V St ] denotes the concatenation of representations of t and S t .j is a negative event type for t from Y .m is a margin.C t,y denotes the cosine similarity between t and y.
The hinge loss is commonly used in zero-shot visual object classification task.However, it tends to overfit the seen types in our experiments.While clever data augmentation can help alleviate overfitting, we propose two strategies: (1) we add "negative" event mentions into the training process.Here a "negative" event mention means that the mention has no positive event type among all seen types, namely it belongs to Other.(2) we design a new loss function as follows: where Y is the type set of the event ontology.Y is the seen type set.y is the annotated type.y is the type which ranks the highest among all event types for event mention t, while t belongs to Other.
By minimizing L d 1 , we can learn the optimized model which can compose structure representations and map both event mention and types into a shared semantic space, where the positive type ranks the highest for each mention.

Joint Event Argument and Role Embedding
For each mention, we map each candidate argument to a specific role based on the semantic similarity of the argument path.Take E1 as an example.China is matched to Agent based on the semantic similarity between dispatch-01→ :ARG0→ China and Transport-Person→Agent.
Given a trigger t and a candidate argument a, we first extract a path S a = (u 1 , u 2 , ..., u p ), which connects t and a and consists of p tuples.Each predefined role r is also represented as a structure by incorporating the trigger type, S r = y, r .We apply the same framework to take the sequence of tuples contained in S a and S r into a weightsharing CNN to rank all possible roles for a.
where R y and R Y are the set of argument roles which are predefined for trigger type y and all seen types Y .r is the annotated role and r is the argument role which ranks the highest for a when a or y is annotated as Other.
In our experiments, we sample various size of "negative" training data for trigger and argument labeling respectively.In Section 3.2 we describe how the negative training instances are generated.We adopt a pipelined framework and train the model for trigger labeling and argument labeling separately.

Zero-Shot Classification
During test, given a new event mention t , we compute its mention structure representation for S t and all event type structure representations for S Y = {S y 1 , S y 2 , ..., S yn } using the same parameters trained from seen types.We rank all event types based on their similarity scores with mention t .The top ranked prediction for t from the event type set, denoted as y(t , 1), is given by: Moreover, y(t , k) denotes the k th most probable event type predicted for t .We will investigate the event extraction performance based on the topk predicted event types.
After determining the type y for mention t , for each candidate argument, we adopt the same ranking function to find the most appropriate role from the role set defined for y .

Hyper-Parameters
We use an August 11, 2014 English Wikipedia dump to learn trigger sense and argument embeddings based on the Continuous Skip-gram model (Mikolov et al., 2013).Table 2 shows the hyper-parameters we used to train models.Table 2: Hyper-parameters.

ACE Event Classification
We first use ACE event schema as our target event ontology and assume the boundaries of triggers and arguments are given.Of the 33 ACE event types, we select the top-N most popular event types from ACE05 data as "seen" types, and use 90% event annotations of these for training and 10% for development.N is set as 1, 3, 5, 10 respectively.We test the zero-shot classification performances on the annotations for the remaining 23 unseen types.Table 3 shows the types that we selected for training in each experiment setting.
The negative event mentions and arguments that belong to Other are sampled from the output of the system developed by Huang et al. (2016) based on ACE05 training sentences, which groups all candidate triggers and arguments into clusters based on semantic representations and assigns a type/role name to each cluster.We sample the negative event mentions from the clusters (e.g., Build, Threaten) which cannot be mapped to ACE event types.We sample the negative arguments from the arguments associated with these negative event mentions.Table 4   To show the effectiveness of structural similarity in our approach, we design a baseline, WSD- Table 5: Hit@K Performance on Trigger and Argument Classification.
Embedding, which directly grounds event mentions and arguments to their candidate types and roles using our pre-trained word sense embeddings.Table 5 shows that the structural similarity is much more effective than lexical similarity for both trigger and argument classification.Also, as the number of seen types in training increases, the transfer model's performance improves.
We further evaluate the performance of our transfer approach on similar and distinct unseen types.The 33 sub-types defined in ACE fall within 8 coarse-grained main types, such as Life, Justice.Each subtype belongs to one main type.Subtypes that belong to the same main type tend to have similar structures.For example, Trial-Hearing and Charge-Indict have the same set of argument roles.For training our transfer model, we select 4 subtypes of Justice: Arrest-Jail, Convict, Charge-Indict, Execute.For testing, we select another 3 subtypes of Justice: Sentence, Appeal, Release-Parole.Additionally, we also select one subtype from each of the other seven main types for comparison.Table 6 shows that, when testing on a new unseen type, the more similar it is to the seen types, the better performance is achieved.

ACE Event Identification & Classification
Considering that ACE05 corpus includes the richest annotations for event extraction to date, to assess our transferable neural architecture on a large number of unseen types when trained on limited annotations of seen types, we construct a new event ontology which combines 33 ACE event types and argument roles, and 1,161 frames from FrameNet except for the most generic frames such as Entity, Locale.Some ACE event types easily align to frames, e.g., Die is aligned with Death.Some frames are instead more accurately treated as inheritors of ACE types, such as Suicide-Attack, which inherits from Attack.We manually mapped the selected frames to ACE types.We compare our approach against the following supervised methods: • LSTM: A long short-term memory neural network (Hochreiter and Schmidhuber, 1997) based on distributed semantic features, similar to (Feng et al., 2016).
• Joint: A structured perceptron model based on symbolic semantic features (Li et al., 2013).
For our approach, we follow the experiment setting D in Section 3.2, but target at the 1191 event types in our new event ontology.For evaluation, we sample 150 sentences from the remaining ACE05 data, which contain 129 annotated event mentions for the 23 testing types.For both LSTM and Joint approaches, we use the entire ACE05 annotated data for 33 ACE event types for training except for the held-out 150 evaluation sentences.
We identify the candidate triggers and arguments based on the approach in Section 2.2, and map each candidate trigger and argument to the target event ontology.We evaluate on the event mentions which are classified into the 23 testing ACE types.Table 7 shows the performances.
To further demonstrate the zero-shot learning ability of our framework and the significance on saving human annotation effort, we use the supervised LSTM approach for comparison, because it achieved state-of-the-art performance on ACE event extraction (Feng et al., 2016).The training data of LSTM contains 3,464 sentences with 905 annotated event mentions for the 23 testing event types.We divide these event annotations into 10fold and successively add another 10% into the training data of LSTM. Figure 4 shows the learning curve.Without any annotated mentions of the 23 test event types in its training set, our transfer learning approach achieves performance comparable to that of the LSTM, which is trained on 3,000 sentences with 500 annotated event mentions.In analyzing the triggers which are annotated with ACE types but misclassified into incorrect types or frames, we observe that most errors occur among the types that are defined under the same scenario.For example, in the following sentence "Abby was a true water birth ( 3kg -normal ) and with Fiona I was dragged out of the pool after the head crowned", birth should be a Being-Born event while our approach misclassified it as Giving-Birth because both Being-Born and Giving-Birth are defined for Birth-Scenario and have very similar predefined roles.For argument classification, our approach heavily relies on the semantics of argument path and argument concepts, while many argument roles such as Entity, Organization, are not informative enough to be matched with argument concepts.

Event Extraction on New Types
In Section 3.3, as we use 1,194 event types as the target ontology, we further evaluate the performance of our approach on non-ACE types.From the testing results of Section 3.3, we randomly sample 200 event mentions assigned with non-ACE types and ask a linguistic expert to manually assess them.For each mention, the annotator can see its trigger, arguments, the source sentence, the frame and roles assigned by our approach, as well as the definition and examples of the frame from FrameNet1 .The annotator marks true or false by judging whether the type and argument roles are correct.Table 8 shows the performance.
Our approach can discover a lot of new events that are not annotated in ACE.For example, in the sentence "15 dead as suicide bomber blasts student bus, Israel hits back in Gaza", blasts is correctly identified as an Explosion event.However, many triggers are mapped to the correct scenario but assigned with incorrect types.For example, in the sentence "But Anwar's lawyers said they were filing a fresh request for bail pending a further appeal.",filing is identified as a trigger and mapped to the correct Bail related scenario but misclassified as a Bail-Decision event.We find that, to determine the type of an event mention, besides the consistent semantics between event mentions and type structures, the trigger sense should also be consistent with the definition of the event type.

Impact of AMR
In our work, we use AMR parsing output to construct event structures.To assess the impact of AMR parser (Wang et al., 2015a) on event extraction, we choose a subset of ERE corpus which has perfect AMR annotations2 .We select the top-6 most popular event types (Arrest-Jail, Execute, Die, Meet, Sentence, Charge-Indict) with 548 manual annotations as seen types.We sample 500 negative event mentions from distinct types of clusters generated from the system (Huang et al., 2016) based on ERE training sentences.We combine the annotated events for seen types and the negative event mentions, and use 90% for training and 10% for development.For evaluation, we select 200 sentences from the remaining ERE subset, which contains 128 Attack event mentions and 40 Convict event mentions.Table 9 shows the event extraction performances based on perfect AMR and system AMR respectively.
Using the same data sets, we further evaluate the performance of our approach using different

Related Work
Most of previous event extraction methods were based on supervised learning using symbolic features (Ji and Grishman, 2008;Miwa et al., 2009;Liao and Grishman, 2010;Liu et al., 2010;Hong et al., 2011;McClosky et al., 2011;Riedel and McCallum, 2011;Chen and Ng, 2012;Li et al., 2013;Liu et al., 2016) or distributional features (Chen et al., 2015;Nguyen and Grishman, 2015;Feng et al., 2016;Nguyen et al., 2016) from a large amount of training data, regarding event types and argument roles as symbols.Such work can achieve high quality for given types, but cannot be applied to new types without annotation.In contrast, we provide a new angle to vision event extraction and model it as a "grounding" task by taking advantage of rich semantics of event types.Some other IE paradigms such as Open IE (Etzioni et al., 2005;Banko et al., 2007Banko et al., , 2008;;Etzioni et al., 2011;Ritter et al., 2012), Pre-emptive IE (Shinyama and Sekine, 2006), On-demand IE (Sekine, 2006), Liberal IE (Huang et al., 2016(Huang et al., , 2017)), and semantic frame based event discovery (Kim et al., 2013) can discover many events without pre-defined event schema.The event types and argument roles are inferred by a cluster of similar events.These paradigms heavily rely on information redundancy, so cannot work when the input consists of only a few sentences.Our work can discover events from any size of given corpus and can also be complementary with these paradigms because it can ground each event cluster to a rich predefined event ontology.
Zero-Shot learning has been widely applied in visual object classification (Frome et al., 2013;Norouzi et al., 2013;Socher et al., 2013a), finegrained name tagging (Ma et al., 2016;Qu et al., 2016) and relation extraction (Verga et al., 2016).Different from these tasks, the seen types in event extraction are limited.The most popular event schemas, such as ACE, only defined 33 event types while most visual object training sets contain more than 1,000 types.Thus, the methods proposed for zero-shot visual object classification cannot be directly applied to event extraction because of overfitting.Thus, we design a new loss function by creating "negative" training instances to avoid overfitting.

Conclusions and Future Work
In this work, we take a fresh look at the event extraction task and model it as a grounding problem.We propose a transferable neural architecture, which leverages existing human constructed event schemas and manual annotations for a small set of seen types, and transfers the knowledge from the existing types to the extraction of unseen types, to improve the scalability of event extraction as well as save human effort.Without any annotation, our approach can achieve comparable performance with state-of-the-art supervised models trained from a large amount of labeled data.In the future, we will extend our framework by incorporating event definitions and argument descriptions to improve the event extraction perfor-mance.

Figure 1 :
Figure 1: Event Mention Example: dispatching is the trigger of a Transport-Person event with four arguments.

Figure 2 :
Figure 2: Examples of Event Mention and Type Structures from ERE.

Figure 4 :
Figure 4: Comparison between Our Approach with LSTM on 23 Testing Event Types.
shows the statistics of the training, development and testing data sets.

Table 3 :
Seen Types in Each Experiment Setting.

Table 4 :
Statistics for Positive/Negative Instances in Training, Dev, and Test Sets for Each Experiment.

Table 6 :
Performance on Various Types Using Justice Subtypes for Training

Table 7 :
Performance of Trigger and Argument Extraction on ACE Types.(%)