Extensively Matching for Few-shot Learning Event Detection

Current event detection models under supervised learning settings fail to transfer to new event types. Few-shot learning has not been explored in event detection even though it allows a model to perform well with high generalization on new event types. In this work, we formulate event detection as a few-shot learning problem to enable to extend event detection to new event types. We propose two novel loss factors that matching examples in the support set to provide more training signals to the model. Moreover, these training signals can be applied in many metric-based few-shot learning models. Our extensive experiments on the ACE-2005 dataset (under a few-shot learning setting) show that the proposed method can improve the performance of few-shot learning.


Introduction
Event Detection (ED) is an important task in Information Extraction (IE) in Natural Language Processing (NLP). Event Detection is the task to detect event triggers from a given text (e.g. a sentence) and classify it into one of the event types of interest. The following sentence is an example of ED: In 1997, the company hired John D. Idol to take over as chief executive.
In this example, an ideal event detection system should detect the word hired as an event, and classify it to class of Personnel:Start-Position, assuming that Personnel:Start-Position is in the set of interested classes.
The current works in ED typically employ traditional supervised learning based on feature engineering (Li et al., 2014;Chen et al., 2017) and neural networks (Nguyen et al., 2016a;Chen et al., 2018;Lu and Nguyen, 2018). The main problem with supervised learning models is that they can not perform well on unseen classes (e.g. training a model to classify daily events, then run this model to classify laboratory operations). As a result, supervised learning ED can not extend to unseen event types. A trivial solution is to annotate more data for unseen event types, then retraining the model with newly annotated data. However, this method is usually impractical because of the extremely high cost of annotation (Liu et al., 2019).
A human can learn about a new concept with limited supervision e.g. one can detect and classify events with 3-5 examples (Grishman et al., 2005). This motivates the setting we aim for event detection: few-shot learning (FSL). In FSL, a trained model rapidly learns a new concept from a few examples while keeping great generalization from observed examples (Vinyals et al., 2016). Hence, if we need to extend event detection into a new domain, a few examples are needed to activate the system in the new domain without retraining the model. By formulating ED as FSL, we can significantly reduce the annotation cost and training cost while maintaining highly accurate results.
In a few shot learning iteration, the model is given a support set and a query instance. The support set consists of examples from a small set of classes. A model needs to predict the label of the query instance in accordance with the set of classes appeared in the support set. Typical methods employ a neural network to embed the samples into a low-dimension vector space (Vinyals et al., 2016;Snell et al., 2017), then, classification is done by matching those vectors based on vector distances (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018). One potential problem of prior FSL methods is that the model relies solely on training signals between query instance and the support set (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018). Thus, the matching information between samples in the support set has not been exploited yet. We believe that this is not an efficient use of training data because dataset in ED is very small (Grishman et al., 2005). Therefore, in this study, we propose to train an ED model using matching information (1) between query instance and the support set and (2) between the samples in the support themselves. This is implemented by adding two auxiliary factors into the loss function to constrain the learning process.
We apply the proposed training signals to different FSL models on the benchmark event detection dataset (Grishman et al., 2005). The experiments show that the training signal can improve the performance of the examined FSL models. To summarize, our contributions to this work include: • We formulate event detection as a few-shot learning problem to extend ED to new event types and provide a baseline for this new research direction. To our best knowledge, this is a new branch of research that has not been explored.
• We propose two novel training signals for FSL. These signals can remarkably improve the performance of existing FSL models. As these signals do not require any additional information (e.g. dependency tree or part-of-speech), they can be applied in any metric-based FSL models.

Related work
Early studies in event detection mainly address feature engineering for statistical models (Ahn, 2006;Ji and Grishman, 2008;Hong et al., 2011;Li et al., 2014Li et al., , 2015 including semantic features and syntactic features. Recently, due to the advances with deep learning, many neural network architectures have been presented for ED, e.g. convolutional neural networks (CNN) (Chen et al., 2015;Nguyen andGrishman, 2015, 2016;Nguyen et al., 2016b), recurrent neural networks (RNN) (Liu et al., 2017;Chen et al., 2018;Nguyen et al., 2016a;Nguyen and Nguyen, 2018) and graph convolutional neural networks (GCN) (Nguyen and Grishman, 2018;Pouran Ben Veyseh et al., 2019). These methods formulate ED as a supervised learning problem which usually fails to predict the labels of new event types. By transitioning the symbolic event types to descriptive event types in the form of bags of keywords (Bronstein et al., 2015;Peng et al., 2016;Lai and Nguyen, 2019), the adaptibility of event detection can be formed as a supervised-learning problem. However, these studies have not examined FSL as we do in this work. One can also address this problem in zero-shot learning with data generated from abstract meaning representation (Huang et al., 2018) or two-stage pipeline ( trigger identification and few-shot event classification) based on dynamic memory network (Deng et al., 2020). A recent study has employed few-shot learning for event classification (Lai et al., 2020). Our work is similar in terms of formulation, however, we consider it in a larger extent of event detection where the NULL event is also included.
Few-shot learning has been studied early in the literature (Thrun, 1996). Before the era of the deep neural network, FSL approaches focused on building generative models that can transfer priors across classes. However, these methods are hard to apply to real applications because they require a subject-dedicated design such as handwritten characters (Lake et al., 2013;Wong and Yuille, 2015). As a result, they cannot capture the nature of the distribution (Salimans et al., 2016). Later studies, based on deep neural network, proposed metric learning to model the distribution of distance among classes, (Koch et al., 2015) with many incremental improvements in distance functions such as cosine similarity (Vinyals et al., 2016), Euclidean distance (Snell et al., 2017) and learnable distance function (Sung et al., 2018). Metric-based FSL presents its advantages in two dimensions. First, it is based on the well-studied theory in distance functions. Second, the simplicity in architecture and training processes can encourage its application in practice. Recently, meta-learning with parameter update strategy is also proposed to enable the models to learn quickly in few training iterations (Santoro et al., 2016;Finn et al., 2017).

Methodology
Our goal in this work is to formulate ED as a FSL problem, which has not been done in prior work. In order to achieve this, this section is divided into three parts. In the section 3.1 we present the overall framework that formulate Event Detection as an Few-Shot Learning problem. Then, we present popular models for FSL in the prior work and common sentence encoders which have been widely used in ED in section 3.2. Finally, we present two novel reguarlization technique to further improve the FSL model for ED in section 3.3.

Event Detection as Few-shot Learning
In few-shot learning, models learn to predict the label of a query instance x given a support set S (a set of well-classified instances) and a set of classes C, which appears in the support set S. Prior studies in FSL employ N -way K-shot setting, in which there are N clusters, which represent N classes, each cluster contains K data points (i.e., examples).
However, this setting is designed for problems that do not involve the "NULL" class (e.g., image classification and event classification). In event detection, the systems need to predict whether a query instance is an event (positive event type) or not (negative event type -the "NULL" type) before it is further classified into one of the classes of interest. To this end, we propose to extend the Nway K-shot setting to be N+1-way K-shot setting. In this setting, the support set contains N clusters representing N positive event types and 1 cluster representing the NULL event type. The support set is denoted as follows: where: • {t 1 , t 2 , · · · t N } is the set of positive labels, which indicate an event • t null a special label for non-event.
• (s j i , a j i , t i ) indicates that the a j i -th word in the sentence s j i is the trigger word of an event mention with the event type t i

Framework
Follow prior studies in FSL (Gao et al., 2019), we employ the metric-based FSL framework with three components: instance encoder, prototype encoder, and classification module.

Instance Encoder
Given a sentence of L words {w 1 , w 2 , · · · , w L } and the event mention w a , which is the a-th word of the sentence, we first map discrete words to a continuous high dimensional vector space to facilitate neural network using both pre-trained word embedding and position embedding as follow: • In order to capture the syntactic and semantic of the word itself, we map each word in the sentence to a single vector using pre-trained word embedding, following previous studies in ED (Nguyen and Grishman, 2015). After this step, we derive a sequence of vectors {e 1 , e 2 , · · · , e L } where e i ∈ R u .
• To provide a sense of the relative position of a word regarding the position of the anchor word, we further provide position embedding. It is mapped from the relative distance, i − a, of the i-th word with respect to the anchor word, a-th word to a single vector p i ∈ R v . We randomly initialize this word embedding and update the embedding during the training process.
• Following previous work (Nguyen and Grishman, 2015), the final embedding of a word w i is derived by concatenating word embedding and position embedding Once we get the embedding for the whole sentence E(s) = {m 1 , m 2 , · · · , m L }, we employ a neural network, denoted as f , to encode the information of an instance (s, a) of the anchor w a under the context in the sentence s into a single vector v = f (E(s), a). In this work, consider the three following neural network architectures for this encoding purpose: • Convolution Neural Network (CNN) (Kim, 2014) encodes the sentence by convolution operation on k consecutive vectors representing k-gram. Follow (Nguyen and Grishman, 2015), we use multiple kernel sizes k ∈ {2, 3, 4, 5} to cover the context with 150 filters for each kernel size. To squeeze the information of the sentence, we apply max pooling to the top convolution layer to get a pooled vector p. We also introduce local embedding e [a−w,a+w] with window size w = 2. We concatenate pooled vector and local embeddings, and feed them through multiple dense layer to get the final representation:

Prototype Encoder
This module computes a representative vector, called prototype, for each class t ∈ T in the support set S from its instances' vectors. We employ two variants of prototype computation.
The first version, proposed in the original Prototypical Network (Snell et al., 2017), considers all representation vectors are equally important. To calculate the prototype for a class t i , it aggregates all the representation vectors of the instance of class t i , and then perform averaging over all vectors : On the other hand, it was claimed that the supporting vectors are conditionally important with respect to the query (q, p). Thus, the second version computes the prototype as a weighted sum of the supporting vectors. The weights are obtained by attention mechanism according to the representational vector of the query as follow: p)) ; denotes the element-wise product. (2)

Classification Module
This module computes the distribution on all the event types T of a query instance x = (q, p) using a distance/similarity function d : R ← R d .
(3) where d is a distance/similarity function, and c i and c j are the prototype vectors obtained in either Equation (1) or Equation (2) from the support set S.
In this paper, we examine three kinds of distance/similarity function with prototype module to form 4 model as follow: • Cosine similarity with averaging prototype as Matching network (Vinyals et al., 2016).
• Learnable distance function with averaging prototype as Relation network (Sung et al., 2018).

Training Objectives
In the literature, a metric-based FSL model is typically trained by minimizing the negative loglikelihood as follow: L query (x, S) = − log P (y = t|x, S) where x, t, S are query instance, ground truth label, and support set, respectively.  This loss function exploits the signal of matching information between the query instance and the supporting instances. It can work efficiently in computer vision because the number of samples in computer vision datasets are typically huge. However, in NLP tasks, the dataset is commonly relatively much smaller (e.g. ACE 2005 contains 4000 positive examples). So using this loss function is not enough to deliver a good system. Therefore, providing more training signals is crucial to the problem which involves a small dataset. Fortunately, the support set is a well-classified set of instances with K examples per class in a total of N classes. In this paper, we proposed two ways to exploit this resourceful set as follow: • Intra-cluster matching: We argue that the representational vectors in the same class should be close to each other. Therefore, we minimize the distance between instance in the same class.
• Inter-cluster information: We also argue that the clusters should distribute far away from each other. Hence, their prototypes are also distant from the other. Hence, we maximize the distances between pairs of prototypes.
In this work, we train our model using a combination of the loss functions in equations 4, 5,6. We control the contribution of the additional losses by two hyperparameters β and γ as follow: whereL intra andL inter are scaled losses with respect to L query , and β and γ are the trade-off parameters.

Data
We use the ACE-2005 dataset to evaluate all of the models in this study. ACE-2005 is a benchmark dataset in event detection with 33 positive event subtypes, which are grouped into 8 event types Business, Contact, Conflict, Justice, Life, Movement, Personnel, and Transaction. Although the dataset is split into training, development, and testing sets, we cannot use these splits directly because, in FSL, the set of event types in the training set and testing sets are disjoint. Therefore, we further split these datasets to satisfy three conditions for FSL: • The set of event types in the training set T train are disjoint to those in the development and the testing set: • In order to run FSL with the 10-way 10-shot setting, the set of event subtypes should contain at least 10 subtypes.
• The training set should contain as many samples as possible.
Based on these criteria, we use all samples belonging to 4 event types: Business, Contact, Conflict and Justice as the training set. While the rest (Life, Movement, Personnel and Transaction) are used for the development and testing sets. We split the sample by ratio 50:50 in every subtype to ensure the balance of the development and the testing set. Finally, since there are event types that have less than 15 examples, we eliminate all of these from the training, development, and testing set.

Hyper-parameters
We evaluate using 5+1-way 5-shot and 10+1-way 10-shot FSL settings. Although it was seen that the higher number of classes we have during the training time, the better performance on testing (Snell et al., 2017), we avoid feeding all event types in every iteration during training time. We manage to sample 20 positive classes (over 21 in the training set) in each training iteration.
We initialize the embedding vectors with 300dimension GLoVe embedding, trained from 6 billion tokens. We use 50-dimension position embedding and initialize it randomly. These embedding vectors are updated during training time.
We train Proto, Proto+Att, and Matching using Stochastic Gradient Decent (SGD) optimizer while Relation is trained with AdaDelta optimizer because SGD hardly converges with Relation network. The learning rate is initialized to 0.03 and decays after every 500 iterations. We trained our models in 2500 iterations and evaluation at every 200 iterations.

Result
In this section, we perform our experiment in three steps:(1) find the best FSL models among Proto, Proto+Att, Matching, Relation models; (2) evaluate the proposed additional training factors and (3) analyze the effectiveness of each training factor in an ablation study. Table 1 shows the F-scores of four models using three kinds of sentence encoders on the ACE-2005 dataset under 5+1-way 5-shot and 10+1-way 10shot FSL settings without our proposed losses. As can be seen from the Table 1, the performance of the models on 5+1-way 5-shot is always better than 10+1-way 10-shot because the number of classes needs to be classified in the 10+1-way setting is almost twice as much of in 5+1-way setting. Second, we can see that Prototypical-based (Proto and Proto+Att) models outperform the Matching network and the Relation network on both FSL settings. Among Prototypical network models, Proto+Att is slightly better than Proto with a 0.8% performance gap in the 10+1-way 10-shot setting.
Most importantly, Table 2 presents the F-scores of Proto and Proto+A with the proposed loss functions (i.e., L intra , L inter ). As we can see from the table, the proposed loss functions can significantly improve the performance of Proto and Proto+Att models over different encoders (i.e., CNN, LSTM, and GCN), clearly demonstrating the benefits of the intra and inter-similarity constraints in this work.

Ablation Study
In this study, we introduce two penalization factors, presented in Equations 5 and 6.
Besides the FSL formulation for event detection, a major contribution in this work involves the two loss functions L intra and L inter to improve the representation vectors for the models. To evaluate the contribution of these terms, Table 3 shows the performance of the FSL models with different combinations of loss functions on the development set. In particular, we focus on the prototypical-based FSL model on the 5+1-way 5-shot setting in this analysis (although the similar trends of the performance are also observed for the other models and settings). The "Original" column corresponds to the models where both L inter and L intra are not applied. The other columns, on the other hand, report the performance of the models when the combinations L inter , L intra , and L inter + L intra of the loss terms are introduced.
It is clear from the table that both loss terms are important for the FSL models for ED as eliminating any of them would significantly hurt the performance excepting the Proto+Att model with GCN encoder. The best performance is achieved with

Conclusion
In this paper, we address the problem of extending event detection to unseen event types through fewshot learning. We investigate four metric-based few-shot learning models with different encoder types (CNN, LSTM, and GCN). Moreover, we propose two novel loss functions to provide more training signals to the model exploiting domainmatching information in the support set. Our extensive experiments show that our method increases the efficiency of using training data, resulting in better classification performance. Our ablation study shows that both intra-cluster matching and intercluster matching contributes to the improvement. Alex Wong and Alan L Yuille. 2015. One shot learning via compositions of meaningful patches. In ICCV.