A Two-stage Approach for Extending Event Detection to New Types via Neural Networks

We study the event detection problem in the new type extension setting. In particular, our task involves identifying the event instances of a target type that is only speci-ﬁed by a small set of seed instances in text. We want to exploit the large amount of training data available for the other event types to improve the performance of this task. We compare the convolutional neural network model and the feature-based method in this type extension setting to investigate their effectiveness. In addition, we propose a two-stage training algorithm for neural networks that effectively transfers knowledge from the other event types to the target type. The experimental results show that the proposed algorithm outperforms strong baselines for this task.


Introduction
Event detection (ED) is an important task of information extraction that seeks to locate instances of events with some types in text. Each event mention is associated with a phrase, the event trigger 1 , which evokes that event. Our task, more precisely stated, involves identifying event triggers of some types of interest. For instance, in the sentence "A cameramen was shot in Texas today", an ED system should be able to recognize the word "shot" as a trigger for the event "Attack". ED is a crucial component in the overall task of event extraction, which also involves event argument discovery.
There have been two major approaches to ED in the literature. The first approach extensively leverages linguistic analysis and knowledge resources to capture the discrete structures for ED, focusing on the combination of various properties 1 most often a single verb or nominalization such as lexicon, syntax, and gazetteers. This is called the feature-based approach that has dominated the ED research in the last decade (Ji and Grishman, 2008;Gupta and Ji, 2009;Liao and Grishman, 2011;McClosky et al., 2011;Li et al., 2013;Venugopal et al., 2014). The second approach, on the other hand, is proposed very recently and uses convolutional neural networks (CNN) to exploit the continuous representations of words. These continuous representations have been shown to effectively capture the underlying structures of a sentence, thereby significantly improving the performance for ED (Nguyen and Grishman, 2015;Chen et al., 2015).
The previous research has mainly focused on building an ED system in a supervised setting. The performance of such systems strongly depends on a sufficient amount of labeled instances for each event type in the training data. Unfortunately, this setting does not reflect the real world situation very well. In practice, we often have a large amount of training data for some old event types but are interested in extracting instances of a new event type. The new event type is only specified by a small set of seed instances provided by clients (the event type extension setting). How can we effectively leverage the training data of old event types to facilitate the extraction of the new event type?
Inspired by the work on transfer learning and domain adaptation (Blitzer et al., 2006;Jiang and Zhai, 2007;Daume III, 2007;Jiang, 2009), in this paper, we systematically evaluate the representative methods (i.e, the feature based model and the CNN model) for ED to gain an insight into which kind of method performs better in the new extension setting. In addition, we propose a two-stage algorithm to train a CNN model that effectively learns and transfers the knowledge from the old event types for the extraction of the target type.
The experimental results show that this two-stage algorithm significantly outperforms the traditional methods in the type extension setting for ED and demonstrates the benefit of CNN in transfer learning. To our knowledge, this is the first work on the type extension setting as well as on transferring knowledge with neural networks for ED of natural language processing.

Task Definition
The event type extension setting in this work is as follow. We are given a document set D annotated for a large set D A of trigger words (positive instances) of some event types (the auxiliary types, denoted by A). However, we are interested in extracting trigger words of a new event type T (the target type, T / ∈ A) that is only specified by a small annotated set D T of positive instances (the seeds) in D. Note that while D A involves all the positive instances of the auxiliary types, D T might only be partial and not necessarily include all the trigger words of type T in D.
Also, we call D N the set of the negative instances generated from D under this setting (to be discussed in more details later). In general, D N might contains unannotated trigger words of T (false negatives), making this task more challenging. Eventually, our goal is to learn an event detector for T , leveraging the training data D T , D A and D N for both the target and auxiliary types. Note that our work is related to Jiang (2009) who studies the relation type extension problem.

Models for Event Detection
In this section, we first present the representative approaches for ED. The two-stage algorithm will be discussed in the next section.
We treat the event detection problem for the target type T as a binary classification problem. For every token in a given sentence, we want to predict if the current token is an event trigger of type T or not? The current token along with its context in the sentence constitute an event trigger candidate or an example in the binary classification terms.

The Feature-based Model
In the feature-based model (denoted by FET), the event trigger candidates are first transformed into rich feature vectors to encapsulate linguistically useful properties for ED. These vectors are then fed into a statistical classifier such as maximum entropy (MaxEnt) and classified as the type T or not. In this work, we employ the feature set for ED from (Li et al., 2013), which is the state-of-the-art FET.

The Convolutional Neural Networks
In a CNN for ED, we limit the context of the trigger candidates to a fixed window size by trimming longer sentences and padding shorter sentences with a special token when necessary. Let 2w + 1 be the fixed window size, and x = [x −w , x −w+1 , . . . , x 0 , . . . , x w−1 , x w ] be some trigger candidate where the current token is positioned in the middle of the window (token x 0 ). Before entering CNN, each token x i is transformed into a real-valued vector x i by concatenating the continuous look-up vectors from the following tables: 1. Word Embedding Table E (Turian et al., 2010;Mikolov et al., 2013a;Mikolov et al., 2013b). Table: to embed the relative distance i of x i to the current token x 0 . Table: to capture the entity type information for each token. Following Nguyen and Grishman (2015), we assign the entity type labels to each token using the heads of the entity mentions in x with the BIO schema.

As a result, the original event trigger can
. This matrix will serve as the input for CNN.
For CNN, the matrix x is first passed through a convolution layer and then a max pooling layer to compute the global representation vector R C for the trigger candidate x (Nguyen and Grishman, 2015). In addition, we obtain the local representation vector R L by concatenating the embedding vectors of the words in a window size 2d + 1 of x 0 , motivated by the models in Chen et al. (2015): Finally, the concatenation of the global and local vectors R C and R L is used as the input for a feed-forward neural network with a softmax layer in the end to perform trigger identification for T . Note that our CNN model is similar to (Nguyen and Grishman, 2015) and applies multiple window sizes for the feature maps in the convolution layer.

The Baseline Systems
For each of the two models presented above (i.e, FET and CNN), we have two baseline mechanisms to train an event detector for T (Jiang, 2009). In the first baseline (denoted by TARGET), we use the small instance set D T of the target type T together with the negative instances in D N to train a binary classifier for T . In the second baseline (denoted by UNION), we combine the positive instances in both D T and D A as well as the negative instances in D N to train a binary classifier for T .
Eventually, we have 4 baseline systems corresponding to the two choices of models (i.e, FET, CNN) and the two choices of the training mechanisms (i.e, TARGET, UNION). We denote these four baselines by: FET-TARGET, FET-UNION, CNN-TARGET, and CNN-UNION.

Hypothesis About the Baselines
The underlying assumption of transfer learning for type extension is the existence of the general features that are effective for prediction across different types (Jiang, 2009). The performance of a model for a given target type, thus, depends on two factors: (i) how well the model identifies and quantifies general features, and (ii) how effectively the model transfers the knowledge about the general features and adapt it to the target type.
Hypothesis: the UNION training mechanism is more effective than TARGET when the number of seed instances of the target type is small. The reason originates from the inclusion of the training data D A of the auxiliary types in UNION that would provide more evidences to estimate the importance of the general features better (factor (i)).

The Two-stage Algorithm
Although UNION can help to learn the general features, its major limitation lies in the lack of the directing mechanisms to make the model specific to the target type (factor (ii)). Essentially, UNION treats the positive instances of the target and auxiliary types similarly, making it more about a general purpose event detector rather than a specific detector for the target type. Therefore, we propose to consider the positive instances of the target D T and the auxiliary types D A in two separate stages.
In the first stage, a large amount of the training data D A of the auxiliary types are used by a CNN to learn the general feature extractors across event types. In the second stage, the seed instances of the target type in D T are used to adapt the models to the target type. In order to transfer the knowledge from the auxiliary types to the target type between these two stages, we propose to utilize a CNN that facilitates the transferring process via the weight initialization. The two-stage algorithm (CNN-2-STAGE) is presented below. Note that similar to stage I of the algorithm and previous work on neural networks (Nguyen and Grishman, 2015;Chen et al., 2015), the weight matrices and embedding tables are also initialized randomly in the training mechanisms UNION and TARGET. The only exception is the word embedding table that is pre-trained on a large corpus for UNION, TARGET as well as the stage I.
All the weight matrices and embedding tables are optimized during training (for UNION, TAR-GET as well as CNN-2-STAGE) to achieve the optimal state. This is especially important in Stage II of CNN-2-STAGE as it helps to adapt the general feature extractors in Stage I to the target type T .

Training
Following Nguyen and Grishman (2015), we train the NN models using stochastic gradient descent with shuffled mini-batches, the AdaDelta update rule, back-propagation and dropout. Finally, we rescale the weights whose l 2 -norms exceed a predefined threshold.

Parameters and Resources
For all the experiments below, we utilize the pretrained word embeddings word2vec (300 dimensions) from Mikolov et al. (2013a) to initialize the word embedding table. The parameters for CNN and training the network are inherited from the previous studies, i.e, the fixed window size w = 15, the window size set for feature maps = {2, 3, 4, 5}, 150 feature maps for each window size, 50 dimensions for all the embedding tables (except the word embedding table), the dropout rate = 0.5, the mini-batch size = 50, the hyperparameter for the l 2 norms = 3 and the window for local context d = 5 (Nguyen and Grishman, 2015;Chen et al., 2015).

Dataset and Settings
Following the previous work (Li et al., 2013;Chen et al., 2015;Nguyen and Grishman, 2015), we consider the ED task of the 2005 Automatic Context Extraction (ACE) evaluation that annotates 8 event types and 33 event subtypes 2 . As the numbers of event mentions (triggers) for each subtype in ACE are small, in this work, we focus on the extraction of the event types: "Life", "Movement", "Transaction", "Business", "Conflict", "Contact", "Personell", and "Justice". We remove the event triggers of types "Transaction" and "Business" due to their small numbers of occurrences, resulting in the dataset with six remaining event types (denoted from 1 to 6).
In the experiments, we use the same data split in Li et al. (2013) with 40 newswire documents as a test set, 30 other documents as a development set and the 529 remaining documents as a training set. Note that the training documents correspond to our original dataset D above. Let P i be the positive instance set of the type i in D (i = 1 to 6).
We take each event type i as the target type T and treat the other 5 types as the auxiliary types, constituting 6 sets of experiments. In each set of experiments for a target type i (T ), we randomly select S positive instances of T for the seed set D T (S = |D T |) and treat the remaining target instances P i \ D T as negative. Note that this essentially introduces false negatives into the training data and makes the task more challenging.
In order to deal with false negatives, we remove all the sentences that do not contain any events in the original dataset D. In this way, we remove a large number of true negatives along with a fraction of the false negatives, leading to the reduced dataset D . We do the experiments on D with: We note that (Jiang, 2009) uses a different setting in training where she removes all the remaining target instances P i \ D T directly. In our opinion, this is unrealistic as it assumes the label of the instances in P i \ D T while we are only provided with the label of the seed set D T in practice.
Finally, similar to (Jiang, 2009), we remove the positive instances of the auxiliary event types from the test set to concentrate on the classification accuracy for the target type. We also remove all the positive instances of the target type in the development set to make it more realistic.

Evaluation
This section compares the four baseline models in Section 4.1 with the proposed two-stage model CNN-2-STAGE. For completeness, we also evaluate the transfer learning model in Jiang (2009), adapted to the event type extension task (called JIANG). For JIANG, we apply the automatic feature separation method as the general syntactic patterns and type constraints for relation in Jiang (2009) are not applicable to our ED task. For each described model, we perform six sets of experiments in Section 6.2, where the number of seed instances |D T | is varied from 0 to 150. We then report the average F-scores of the six experiment sets for each value of S. Figure 1 shows the curves.
Assuming the same kind of model (i.e, either FET or CNN), we see that UNION is better than TARGET when |D T | is small, confirming our hypothesis in Section 4.2. This demonstrates the benefit of UNION and the training data D A of the auxiliary types when there are not enough training . . . and two Israeli soldiers were wounded, one critically. Witnesses said the soldiers responded by firing tear gas and rubber bullets, which led to ten demonstrators being injured. John Hinckley attempted to assassinate Ronald Reagan.

Justice
Since May, Russia has jailed over 20 suspected terrorists without a trial. A judicial source said today, Friday, that five Croatians were arrested last Tuesday during an operation . . . Table 2: Examples for the trigger words with the latent semantic. The trigger words are underlined.
instances for T . However, when we are provided with more seed instances for the target type (i.e, |D T | becomes larger), TARGET turns out to be significantly better than UNION.
We also observe that CNN outperforms FET in the TARGET mechanism. This is consistent with the previous studies for ED (Nguyen and Grishman, 2015). However, in the UNION mechanism, CNN is less effective than FET, suggesting that UNION is not a good mechanism to transfer knowledge in CNN.
We do not see much performance improvement of JIANG over FET-UNION. This can be explained by the lack of explicit linguistic guidance (i.e, the syntactic patterns and type constraints) for the general features in the event extension task that are crucial to the success of the model in Jiang (2009).
Finally and most importantly, we see that the two-stage model CNN-2-STAGE outperforms all the compared models regardless of |D T |. This is significant when |D T | is greater than 50. These results suggest the effectiveness of the two-stage training algorithm on transferring knowledge from the auxiliary types to the target type for CNN.

Analysis
In order to further understand the systems on the separate event types, Table 1 presents the performance of the compared systems for the six experiment sets in Section 6.2 (corresponding to the 6 different choices of the event target type T in the dataset) when S is set to 100.
One of the most important observations from the table is that CNN-2-STAGE is significantly better than JIANG, CNN-TARGET and CNN-UNION on five target types (i.e, Y = {Movement, Personnel, Conflict, Life, Justice}) 3 and only worse than CNN-TARGET on the Contact type. This raises a question on the distinction between Contact and the other event types in Y that affects the transferring effectiveness of CNN-2-STAGE. Also, what is the common feature of the event types in Y that helps CNN-2-STAGE successfully transfers knowledge between them?
The key insight of our system output analysis is the shared latent semantic among a large por-3 Although it is less pronounced for Justice Table 3: Event types, subtypes and the most frequent trigger words.
tion of trigger words of the four event types in Y \ {Movement}. In particular, all the four event types in Y \ {Movement} includes trigger words that induce some level of conflict between their subjects and objects. These conflicts are often manifested by some physical and irritating actions between the two engaged parties. Some examples of the trigger words with the latent semantic for the event types in Y \ {Movement} are given in Table 2 4 . This latent semantic is first captured by word embeddings and CNN in Stage I of CNN-2-STAGE, and then transferred to the target type in Stage II. The feature-based transfer learning systems like JIANG, on the other hand, cannot encode such latent semantics effectively as they rely on the discrete features with the symbolic representation of words.
In the ACE 2005 corpus, the event type Movement only has one subtype of Transport which mainly focuses on the transportation of weapons, vehicles or people. The context of the trigger words of the subtype Transport often involves the military or struggling objects such as soldiers, Iraq, forces etc. These context words are similar to those of the trigger words of the types Conflict and Life. As a result, the CNN-2-STAGE algorithm can learn these general features from the trigger words of Conflict and Life, and then transfer them to improve the extraction of Movement. Finally, regarding the event type Contact, it occurs when two or more entities engage in discussion either directly or remotely 5 . The purpose of such discussions are often about information or opinion exchange rather than a mean to express discussions or conflicts with irritating actions (as the event types in Y do). This divergence between Contact and Y leads to the poor quality of the general features learnt by the transfer learning methods (i.e, JIANG and CNN-2-STAGE), eventually degrading their performances. Some examples of the Contact event type are given below: 1. People can communicate with international friends without the hefty phone bills.
2. I'm chewing gum and talking on the phone while writing this note.
3. Mr. Erekat is due to travel to Washington to meet with US Secretary of State Madeleine Albright and other US officials . . .
In order to further demonstrate the difference between Contact and the other event types, Table  3 enumerates the event subtypes and the most frequent trigger words for each event. The event subtypes in Table 3 can be considered as the concepts or topics covered by the corresponding event types in the ACE 2005 corpus. As we can see from the
The application of neural networks to EE is very recent. In particular, Zhou et al. (2014) and Boros et al. (2014) use neural networks to learn word embeddings from a corpus of specific domains and then directly utilize these embeddings as features in statistical classifiers. Chen et al. (2015) apply dynamic multi-pooling CNNs for EE in a pipelined framework, while Nguyen et al. (2016) propose joint event extraction using recurrent neural networks.
Finally, domain adaptation and transfer learning have been studied extensively for various NLP tasks, including part of speech tagging (Blitzer et al., 2006), name tagging (Daume III, 2007), parsing (McClosky et al., 2010, relation extraction (Plank and Moschitti, 2013;Nguyen and Grishman, 2014;Nguyen et al., 2015a), to name a few. For event extraction, Miwa et al. (2013) study instance weighting and stacking models while Riedel and McCallum (2011b) examine joint models with domain adaptation. However, none of them studies the new type extension setting for ED using neural networks like we do.