Semi-supervised New Event Type Induction and Event Detection

Most previous event extraction studies assume a set of target event types and corresponding event annotations are given, which could be very expensive. In this paper, we work on a new task of semi-supervised event type induction, aiming to automatically discover a set of unseen types from a given corpus by leveraging annotations available for a few seen types. We design a Semi-Supervised Vector Quan-tized Variational Autoencoder framework to automatically learn a discrete latent type representation for each seen and unseen type and optimize them using seen type event annotations. A variational autoencoder is further introduced to enforce the reconstruction of each event mention conditioned on its latent type distribution. Experiments show that our approach can not only achieve state-of-the-art performance on supervised event detection but also discover high-quality new event types. 1


Introduction
Event extraction is a task of automatically identifying and typing event trigger words (Event Detection), and extracting participants for each trigger (Argument Extraction) from natural language text. Traditional event extraction studies (Ji and Grishman, 2008;McClosky et al., 2011;Li et al., 2013;Chen et al., 2015;Yang and Mitchell, 2016;Liu et al., 2018;Nguyen and Nguyen, 2019;Lin et al., 2020;Li et al., 2020) usually assume there exists a set of predefined event types and argument roles, so that supervised machine learning models, e.g., deep neural networks, can be employed to extract events for each type based on human annotations. However, in practice, it is usually very expensive and time-consuming to manually craft an event schema, which defines the types and complex templates of 1 The programs are publicly available for research purpose at https://github.com/wilburOne/SSVQVAE the expected events. Moreover, the coverage of manually crafted schemas is often very low, making them fail to generalize to new scenarios.
Recent studies have shown that it's possible to automatically induce an event schema from raw text. Some researchers explore probabilistic generative methods (Chambers, 2013;Nguyen et al., 2015;Yuan et al., 2018; or ad-hoc clustering-based algorithms  to discover a set of event types and argument roles. Several studies Lai and Nguyen, 2019) also explore zero-shot and few-shot learning approaches to leverage available resources and extend event extraction to new types. Generally, event schema induction can be divided into two steps: event type induction, aiming to discover a set of new event types for the given scenario, and argument role induction which discovers a set of argument roles for each type. In this work, we focus on tackling the first problem only.
We propose a task of semi-supervised event type induction, which is shown in Figure 1 and aims to leverage available event annotations for a few types, which are called as seen types, and automatically discover a set of new unseen types, as well as their corresponding event mentions. As a solution, we design a new Semi-supervised Vector Quantized Variational Autoeocoder framework (short as SS-VQ-VAE) which first assigns a discrete latent type representation for each seen and unseen type, and optimizes them during the process of projecting each candidate trigger into a particular seen or unseen type. The candidate triggers are discovered with a heuristic approach.
Experiments under the setting of both supervised event detection and new event type induction demonstrate that our approach can not only detect event mentions for seen types with high precision, but also discover high-quality new unseen types.

Approach
Linear Classifier

Trigger Representation
E was E arrested Ayman was arrested and was sentenced to life in prison.

Seen Types
Unseen Types

BERT Encoding
Seen Type Annotations As Figure 2 shows, given an input sentence, we first automatically discover all candidate triggers and encode each trigger with a contextual vector using a pre-trained BERT (Devlin et al., 2019) encoder. Then, we predict the type of each candidate trigger by looking up a dictionary of discrete latent representations of all seen and unseen types. Meanwhile, to avoid the type prediction to be over-fitted to seen types, we apply a variational autoencoder (VAE) as a regularizer to first project each trigger into a latent variational embedding and then reconstruct the trigger conditioned on its type distribution.

Event Trigger Identification
Similar to , we identify all candidate triggers based on word sense induction. Specifically, for each word, we disambiguate its senses and link each sense to OntoNotes (Hovy et al., 2006) using a word sense disambiguation system -IMS (Zhong and Ng, 2010) 2 . We consider all noun and verb concepts that can be mapped to OntoNotes senses as candidate triggers. In addition, the concepts that can be matched with verbs or nominal lexical units in FrameNet (Baker et al., 1998) are also considered as candidate triggers.

Trigger Representation Learning
Given a sentence s = [w 1 , ..., w n ], where we assume w i is identified as a candidate trigger, we use a pre-trained BERT encoder to encode the whole sentence and get a contextual representation for w i . If w i can be split into multiple subwords or words, we use the average of all subword vectors as the final trigger representation.

Event Type Prediction with Vector Quantization
To predict a type for a candidate trigger, an intuitive approach is to learn a classifier using the event annotations of seen types. However, as we also aim to discover a set of unseen types, without any annotations, the classifier for the unseen types cannot be optimized.
To solve this problem, we employ a Vector Quantization (Gersho and Gray, 2012) strategy. We first define a discrete latent event type embedding space E ∈ R k×d , where k is the number of candidate event types, and d is the dimensionality of each type embedding e i . Each e i can be viewed as the centroid of the triggers belonging to the corresponding event type. For each seen type, we initialize e with the contextual vector of a trigger which is randomly selected from the corresponding annotations. For each unseen type, we initialize e with the contextual vector of another trigger which is randomly picked from all unannotated event mentions. Assuming there are m seen types, we arbitrarily assign E [1:m] as their type representations.
Given a candidate trigger t and its contextual vector v t , we first apply a linear encoder f c (v t ) ∈ R d to extract type-specific features. Then, we compute a type distribution y based on f c (v t ) by looking up all the discrete latent event type embeddings with inner-product operation The feature encoder f c (.) is optimized using all event annotations for seen types (the cross-entropy term in Equation 2) and event mentions for unseen types (the second term in Equation 2 3 ). The intuition of the second term in Equation 2 is that, for each new event mention, we don't know the correct type but we know that the type must be from a set of unseen types, so we maximize the margin between the probability of the most likely unseen type and the highest probability of the incorrect seen type.
where −ỹ t is the ground truth label. D s and D u denote the set of annotated event mentions for seen types and new event mentions for unseen types. y are the type prediction scores for seen and unseen types respectively.
To optimize the type embeddings E, we follow the VQ objective (van den Oord et al., 2017) and use l 2 error to move the type vector e i towards the type-specific feature f c (v t ) (the first term in Equation 3) while e i of t is determined by y t . To make sure f c (.) commits to an embedding, we add a commitment loss (the second term in Equation 3) where sg stands for the stop gradient operator to make its operand to be a non-updated constant. The output of sg is the same as the input in the forward pass, and it is zero when computing gradients in the training process.

Variational Autoencoder as Regularizer
To avoid the type prediction to be over-fitted to the seen types, we employ a semi-supervised variational autoencoder as a regularizer. The intuition is that each event mention can be generated conditioned on a latent variational embedding z and its corresponding type distribution y, which is predicted by the approach described in Section 2.3.
We first describe the semi-supervised variational inference process. It consists of an inference network q(z|t) which is a posterior of the learning of a latent variable z given the trigger t, and a generative network p(t|z, y) to reconstruct the candidate trigger t from the latent variable z and type information y. For each candidate trigger t with human annotated label y, the likelihood p(t, y) can be approximated to a variational lower bound log p(t, y) ≥ log p(t|y, z) − KL(q(z|t)||p(z)) = −L(t, y) where log p(t|z, y) is the expectation of reconstruction of t conditioned on z and y, p(z) is the prior Gaussian distribution. For each unlabeled candidate trigger t, the likelihood p(t) approximates to another variational lower bound log p(t) ≥ y q(y|t)(−L(t, y)) − q(y|t) log q(y|t) = −L(t) where q(y|t) is obtained from Equation 1.
As for model implementation, given a candidate trigger t and its contextual embedding v t , we first pass it through an encoder f e (v t ) to extract features. As we assume the latent variatonal embedding z t follows Gaussian distribution z t ∼ N (µ t , σ t ), we apply two linear functions to obtain the mean vector µ t = f µ (f e (v t )) and a variance vector σ t = f σ (f e (v t )). For decoding, we employ another linear function to reconstruct v t from the concatenation of z t and y t : v t = f r ([z t : y t ]). We optimize the following objective for the semi-supervised VAE The overall loss function for optimizing the whole SS-VQ-VAE framework is where α, β and γ are hyper-parameters to balance these three objectives.

Dataset
We perform experiments on Automatic Content Extraction (ACE) 2005 dataset and evaluate our approach under two settings: (1) supervised event extraction, where the target types include 33 ACE predefined types and other, thus k is set as 34. Giving all candidate triggers, the goal is to correctly identify all ACE event mentions and classify them into corresponding types. We follow the same data split with prior work (Li et al., 2013;Nguyen et al., 2016) in which 529/30/40 newswire documents are used for training/dev/test set.
(2) new event type induction, where we follow a previous study  and use top-10 most popular event types from ACE05 data as seen and the remaining 23 types as unseen. Given all ACE annotated event mentions, the goal of this task is to test whether the approach can automatically discover the remaining 23 unseen ACE types and categorize each candidate trigger into a particular seen or unseen type. In this experiment, k is set as 500.
In terms of implementation details, we use the pre-trained bert-large-cased 4 model for fine-tuning,    Table 1 compares our approach with several baselines. We conduct ablation study to testify the impact of the VQ and VAE components: SS-VQ-VAE w/o VQ-VAE is only optimized with the classification loss (Equation 2) while SS-VQ-VAE w/o VAE is optimized with the classification loss (Equation 2) and the VQ objective (Equation 3). As we can see, BERT based approaches generally outperform the methods using CNN, RNN or GRU. Our approach achieves the state-of-theart among all methods. In particular, the recall of our approach is much higher than other methods, which demonstrate the effectiveness of the trigger identification step. It can narrow the learning space of the model. The ablation studies also prove the effectiveness of the VQ and VAE components.

New Event Type Induction
For new event type induction, we compare our approach with another intuitive baseline, BERT-C-Kmeans, which takes in the BERT based trigger representations and group all candidate triggers into clusters with a Constrained K-means (Wagstaff et al., 2001), a semi-supervised clustering algorithm which enforces all trigger candidates annotated with the same seen type to belong to the same cluster. Table 2 shows the performance with several clustering metrics (Chen and Ji, 2010), which measure the agreement between the ground truth class assignment and system based unseen type prediction.
Normalized Mutual Info is a normalization of the Mutual Information (MI) score and scales the MI score to be between 0 and 1.
where Y denotes the ground truth class labels, C denotes the cluster labels, H(.) denotes the entropy function and I(Y ; C) is the mutual information between Y and C. (Fowlkes and Mallows, 1983) is to evaluate the similarity between the clusters obtained from our approach and ground-truth labels of the data. where T P means True Positive, which is calculated as the number of data point pairs that are in the same cluster in Y and in C. F P refers to False Positive and it is calculated as the number of data point pairs that are in the same cluster in Y but not in C. F N is False Negative and it is calculated as the number of pair of data points that are not in the same cluster in Y but are in the same cluster in C.

Fowlkes Mallows
Completeness : A clustering result satisfies completeness if all members of a given class are assigned to the same cluster.
where H(C|Y ) is the conditional entropy of the clustering output given the class labels.
Homogeneity : A clustering result satisfies completeness if all of its clusters contain only data points which are members of a single class.
V-Measure (Rosenberg and Hirschberg, 2007) is the weighted harmonic mean between homogeneity score and completeness score.
where h denotes the homogeneity score and c refers to the completeness score.
As qualitative analysis, we further pick 6 unseen ACE types and randomly select at most 100 event mentions for each type. We visualize their type distribution y using TSNE 5 . As Figure 3 shows, most of the event mentions that are annotated with the same ACE type tends to be predicted to the same new unseen type.

Related Work
Traditional event extraction studies (Ji and Grishman, 2008;McClosky et al., 2011;Li et al., 2013;Chen et al., 2015;Yang and Mitchell, 2016;Liu et al., 2018;Nguyen and Nguyen, 2019;Lin et al., 2020;Li et al., 2020) assume all the target event types and annotations are given. They can extract high-quality event mentions for the given types, but cannot extract mentions for any new types. Recent studies Chan et al., 2019;Ferguson et al., 2018) leverage annotations for a few seen event types or several keywords provided for the new types to extract mentions for new types. However, all these studies assume all the target types are given, which is very costly when moving to a new scenario.
Recent studies have explored probabilistic generative methods (Chambers, 2013;Nguyen et al., 2015;Yuan et al., 2018; or ad-hoc clustering based algorithms  to automatically discover a set of event types as well as argument roles. Most of these studies are completely unsupervised and mainly rely on statistical patterns or semantic matching, while our work tries to leverage the knowledge learned from available annotations to discover new event types.

Conclusion and Future Work
We have designed a semi-supervised vector quantized variational autoencoder approach which automatically learns a discrete representations for each seen and unseen type and predict a type for each candidate trigger. Experiments show that our approach achieves the state-of-the-art on supervised event extraction and discovers a set of high-quality unseen types. In the future, we will extend this approach to argument role induction to discover complete event schemas.

Acknowledgement
This research is based upon work supported in part by U.S. DARPA KAIROS Program Nos. FA8750-19-2-1004, U.S. DARPA AIDA Program No. FA8750-18-2-0014 and Air Force No. FA8650-17-C-7715. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.