Improving Event Detection via Open-domain Trigger Knowledge

Event Detection (ED) is a fundamental task in automatically structuring texts. Due to the small scale of training data, previous methods perform poorly on unseen/sparsely labeled trigger words and are prone to overfitting densely labeled trigger words. To address the issue, we propose a novel Enrichment Knowledge Distillation (EKD) model to leverage external open-domain trigger knowledge to reduce the in-built biases to frequent trigger words in annotations. Experiments on benchmark ACE2005 show that our model outperforms nine strong baselines, is especially effective for unseen/sparsely labeled trigger words. The source code is released on https://github.com/shuaiwa16/ekd.git.


Introduction
Event Detection (ED) aims at detecting trigger words in sentences and classifying them into predefined event types, which shall benefit numerous applications, such as summarization  and reading comprehension (Huang et al., 2019). For instance, in S1 of Figure 1, ED aims to identify the word fire as the event trigger and classify its event type as Attack. Mainstream researches (Chen et al., 2015;Liu et al., , 2018bLiao and Grishman, 2010b;Liu et al., 2018a) focus on the second step event type disambiguation via lexical and contextual features. However, it is also crucial to identify trigger words correctly as the preliminary step.
Trigger word identification is a non-trivial task, which suffers from the long tail issue. Take the benchmark ACE2005 as an example: trigger words with frequency less than 5 account for 78.2% of the * Corresponding author.

Densely Labeled Triggers
Unseen/Sparsely Labeled Triggers S1: Now we 're hearing the boom of Iraqi guns as they fireAttack towards our positions . total. The long tail issue makes supervised methods (Li et al., 2013;Yang et al., 2019) prone to overfitting and perform poorly on unseen/sparsely labeled triggers (Lu et al., 2019). Automatically generating more training instances seems to be a solution: expanding more instances by bootstrapping (Ferguson et al., 2018;Zhang et al., 2019;Cao et al., 2019) and expending more data from distantly supervised methods Wang et al., 2019a). However, the performance of these methods on unseen/sparsely labeled trigger words is still unsatisfied, as shown in Table 1. We argue that these methods either lead to the homogeneity of the generated corpus, or subject to the low coverage of knowledge base. More importantly, the expanded data itself is unevenly distributed, and we cannot expect to alleviate the long tail problem with built-in bias data.
In the paper, we empower the model with external knowledge called Open-Domain Trigger Knowledge to provides extra semantic support on unseen/sparsely labeled trigger words and improve trigger identification. Open-Domain Trig- Table 1: F score on unseen/sparsely and densely labeled triggers. DMBERT (Chen et al., 2015) refers to a supervised-only model with dynamic multi-pooling to capture contextual features; BOOTSTRAP (He and Sun, 2017) expands training data via bootstrapping. DGBERT expands training data with Freebase   Figure 1, open-domain trigger knowledge argues that exploded is the trigger word, while under the labeling rules of ACE2005, intifada is the trigger word. Specifically, we propose an Enrichment Knowledge Distillation (EKD) model to efficiently distill open-domain trigger knowledge from both labeled and abundant unlabeled corpora. We first apply a light-weight pipeline to equipment unlabeled sentences with trigger knowledge from WordNet. The method is not limited to specific domains, and thus can guarantee the coverage of trigger words. Then, given the knowledge enhanced data as well as ED annotations, we train a teacher model for better performance; meanwhile, a student model is trained to mimic teacher's outputs using data without knowledge enhancement, which conforms to the distribution during inference. We further promote the generalization of the model by adding noise to the inputs of the student model. We evaluate our model on the ACE2005 ED benchmark. Our method surpasses nine strong baselines, and is especially effective for unseen/sparsely labeled triggers word. Experiments also show that the proposed EKD architecture is very flexible, and can be conveniently adapted to distill other knowledge, such as entity, syntactic and argument.
Our contributions can be summarized as: • To the best of our knowledge, we are the first to leverage the wealth of the open-domain trigger knowledge to improve ED.
• We propose a novel teacher-student model (EKD) that can learn from both labeled and unlabeled data, so as to improve ED performance by reducing the in-built biases in annotations.
• Experiments on benchmark ACE2005 show that our method surpasses nine strong baselines which are also enhanced with knowledge. Detailed studies show that our method can be conveniently adapted to distill other knowledge, such as entities.
2 Related Work

Event Detection
Traditional feature-based methods exploit both lexical and global features to detect events (Li et al., 2013). As neural networks become popular in NLP (Cao et al., 2018), data-driven methods use various superior DMCNN, DLRNN and PLMEE model (Duan et al., 2017;Nguyen and Grishman, 2018;Yang et al., 2019) for end-to-end event detection.
Recently, weakly-supervised methods (Judea and Strube, 2016;Huang et al., 2017;Zeng et al., 2018; has been proposed to generate more labeled data. (Gabbard et al., 2018) identifies informative snippets of text as expending annotated data via curated training. (Liao and Grishman, 2010a;Ferguson et al., 2018) rely on sophisticated pre-defined rules to bootstrap from the paralleling news streams. (Wang et al., 2019a) limits the data range of adversarial learning to trigger words appearing in labeled data. Due to the long tail issue of labeled data and the homogeneity of the generated data, previous methods perform badly on unseen/sparsely labeled data and turn to overfitting densely labeled data. With open-domain trigger knowledge, our model is able to perceive the unseen/sparsely labeled trigger words from abundant unlabeled data, and thus successfully improve the recall of the trigger words.

Knowledge Distillation
Knowledge Distillation, initially proposed by (Hinton et al., 2015), has been widely adopted in NLP to distill external knowledge into the model (Laine and Aila, 2016;Saito et al., 2017;Ruder and Plank, 2018). The main idea is to adopt a student model to learn from a robust pre-trained teacher model. Gong et al., 2018) reinforces the connection between teacher and student model by singular value decomposition and the laplacian regularized least squares. (Tarvainen and Valpola, 2017;Huang et al., 2018) stabilize the teacher model by a lazy-updated mechanism to enable student model not susceptible to external disturbances.  uses an adversarial imitation approach to enhance the learning procedure. Unlike previous methods that relied on golden annotations, our method is able to learn from pseudo labels and effectively extract knowledge from both labeled and unlabeled corpus.

Methodology
In the section, we introduce the proposed Enrichment Knowledge Distillation (EKD) model, which leverages open-domain trigger knowledge to improve ED. In general, we have a teacher model and a student model. The teacher is fully aware of open-domain trigger knowledge, while the student is not equipped with open-domain trigger knowledge. We make the student model to imitate the teacher's prediction to distill the open-domain trigger knowledge to our model. Figure 2 illustrates the architecture of the proposed EKD model. During training, we first pre-train the teacher model on labeled data, and then force the student model, under the knowledge-absent situation, to generate pseudo labels as good as the teacher model on both labeled and unlabeled data. By increasing the cognitive gap between teacher and student model, the student model has to learn harder. We first introduce how to collect the opendomain trigger knowledge in Knowledge Collection. We then illustrate how to exploit the labeled data to pre-train the teacher model in Feature Extraction and Event Prediction. Finally, we elaborate on how to force the student model to learn from the teacher model in Knowledge Distillation.

Given the labeled corpus
our goal is to jointly optimize two objections: 1) maximize the prediction probability P (Y i |S i ) on labeled corpus L, 2) minimize the prediction probability discrepancy between the teacher P (Y k |S + k ) and student model P (Y k |S − k ) on both L and U , where N T stand for the total number of sentences in both labeled and unlabeled data. S + and S − stand for the enhanced and weakened variant of the raw sentence S, we will explain them in detail in the Section 3.5. Y = {y 1 , y 2 , . . . , y n } stands for the golden event type label, where each y ∈ Y belongs to the 33 event types pre-defined in ACE and a "NEGATIVE" event type (Chen et al., 2015;Nguyen et al., 2016;. Y is the pseudo label proposed by pre-trained teacher model.

Knowledge Collection
Open-domain trigger knowledge elaborates whether a word triggers an event from the perspective of word sense. Whether the trigger is densely labeled or unseen/sparsely labeled, open-domain trigger knowledge will identify them without distinction. For instance in S3 in Figure 1, although hacked is a rare word and has not been labeled, judging from word sense, open-domain trigger knowledge successfully identifies hacked as a trigger word.
We adopt a light-weight pipeline method, called Trigger From WordNet (TFW), to collect opendomain trigger knowledge (Araki and Mitamura, 2018).
TFW uses WordNet as the intermediary. It has two steps, 1) disambiguate word into WordNet sense, 2) determine whether a sense triggers an event. For the first step, we adopt IMS (Zhong and Ng, 2010) to disambiguate word into word sense in WordNet (Miller et al., 1990). We obtain the input features by POS tagger and dependency parser in Stanford CoreNLP (Manning et al., 2014). For the second step, we adopt the simple dictionary-lookup approach proposed in (Araki and Mitamura, 2018) to determine whether a sense triggers an event. TFW is not limited to particular domains, which is able to provide unlimited candidate triggers. With the support of the lexical database, TFW has high efficiency and can be applied to large-scale knowledge collection. Finally, we obtain a total of 733,848 annotated sentences from New York Times (Sandhaus, 2008) S5 Troops were trying to break up stone-throwing protests, but not use live fire.  corpus in the first half of 2007. The total number of triggers is 2.65 million, with an average of 3.6 triggers per sentence.

Feature Extraction
We adopt BERT to obtain the hidden representation for both labeled and unlabeled sentences. BERT is a pre-trained language representation model, and BERT has achieved SOTA performance on a wide range of tasks, such as question answering and language inference. The powerful capability of BERT has also been demonstrated in ED scenario (Wang et al., 2019a). Formally, given the raw sentence S and knowledge-attending sentence S + , we feed them into BERT respectively, and adopt the sequence output of the last layer as the hidden representation for each word in S and S + .

Event Prediction
After obtaining the hidden representation of sentencen S, we adopt a full-connected layer to determine the event type Y for each word in sentence S. We use S (i) and Y (i) to denote the i-th training sentence and its event type in labeled corpus L. We first transform the hidden representation H obtained from Section 3.3 to a result vector O, where O ijc represents the probability that the j-th word in S i belongs to the c-th event class. And then we normalize O by the softmax function to obtain the conditional probability.
, the optimization object is defined as:

Knowledge Distillation
In this section, we distill open-domain trigger knowledge into our model. The main idea is to force the student model, with only raw texts as the input, to generate as good pseudo labels as the teacher model on both labeled and unlabeled data. Formally, given golden event type Y , the objective is: where p(Y |S + θ) and p(Y |S − , θ) are the predictions from the teacher and student model respectively.
We share the parameters of the teacher and student model. The input of teacher model S + is aware of the open-domain trigger knowledge, and the input of student model S − does not know. We give the detailed construction process of S + and S − below.
Knowledge-attending Sentences (S + ) We embed the open-domain trigger knowledge into the sentence by Marking Mechanism. Specifically, we introduce two symbols, named B-TRI and E-TRI to mark the beginning and ending boundary of triggers identified by open-domain trigger knowledge. Formally, given the raw sentence S = {w 1 , w 2 , . . . , w i , . . . , w n } and trigger w i identified by open-domain trigger knowledge, the knowledge-attending sentence is S + = {w 1 , w 2 , . . . ,B-TRI, w i ,E-TRI, . . . , w n }. Marking mechanism works well for our feature extractor BERT (Soares et al., 2019), which is very flexible in embedding knowledge, and can be conveniently adapted to other types of knowledge without heavily-engineered work.
Note that the newly added symbols are lack of pre-trained embedding in BERT. Random initialization undermines the semantic meaning of the introduced symbols, where B-TRI indicates the beginning of a trigger, and E-TRI means the ending. We address the issue by fine-tuning BERT on the annotation sentences in Section 3.2. Specifically, we adopt Masked LM task (Devlin et al., 2018) to exploit surrounding words to learn the semantic representation of the introduced symbols (B-TRI and E-TRI) based on the Harris distributional hypothesis (Harris, 1954). The mask word rate is set to 0.15 and the accuracy of masked words achieves 92.3% after fine-tune.
Knowledge-absent Sentences (S − ) To make the student model learn harder from the teacher model, we further disturb the input of student model by randomly masking out triggers identified by open-domain trigger knowledge. In this way, the student model has to judge the event type of trigger word solely based on the surrounding context. Formally, given the raw sentence S = {w 1 , w 2 , . . . , w i , . . . , w n } and trigger w i identified by open-domain trigger knowledge, the knowledge-absent sentence is S − = {w 1 , w 2 , . . . , [MASK], . . . , w n }. The mask words are not randomly selected, but among triggers determined by open-domain trigger knowledge, avoiding the model is optimized only for the non-trigger negative class.

KL-divergence Loss
We move the added symbols to the end of the sentence to ensure strict alignment of words in S + and S − , and then we minimize the discrepancy between conditional probability p(Y |S − , θ) and p(Y |S + θ) with KL-divergence loss. Given the collection of labeled and unlabeled corpus T = {(S k )} N L +N U k=1 , the KL-divergence loss is: KL divergence is asymmetric in the two distributions. We treat predictions from knowledge-absent inputs as approximate distributions and predictions from knowledge-attending inputs as approximated distributions. If we reverse the direction of approximation, the experimental results decline significantly. The reason may be that we should ensure the low-confidence predictions approximate the high-confidence predictions.

Joint Training
The final optimization objection is the integration of the supervised loss from labeled dataset and KLdivergence loss from unlabeled dataset defined in Equation 4 and 6.
We stop the gradient descent of teacher model when calculating J T to ensure that the learning is from teacher to student. Since unlabeled data is much larger than the labeled data, joint training leads the model quickly overfitting the limited labeled data while still underfitting the unlabeled data. To handle the issue, we adopt the Training Signal Annealing (TSA) technique proposed in (Xie et al., 2019) to linearly release the 'training signals' of the labeled examples as training progresses.

Experiment Setup
Datasets For the labeled corpus, we adopt dataset ACE2005 to evaluate the overall performance. ACE2005 contains 13,672 labeled sentences distributed in 599 articles. Besides the pre-defined 33 event types, we incorporate an extra "Negative" event type for non-trigger words. Following (Chen et al., 2015), we split ACE2005 into 529/30/40 for train/dev/test respectively.
Evaluation We report the Precision, Recall and micro-averaged F1 scores in the form of percentage over all 33 events. A trigger is considered correct if both its type and offsets match the annotation.
Hyperparameters For feature extraction, we adopt BERT as our backbone, which has 24 16head attention layers and 1024 hidden embedding dimension. For the batch size, The batch size of labeled data is 32, and we set the proportion of labeled and unlabeled data to 1:6. For most of our experiments, we set the learning rate 3e-5, the maximum sequence length 128 and the λ in joint training 1. Our model trains on one V100 for a half day. The best result appears around 12,500 epochs. Balancing the performance and training efficiency, we actually use 40,236 unlabeled data for knowledge distillation unless otherwise stated. All reported results are the average results of ten runs. We use Adam as the gradient descent optimizer.
Baselines As our methods incorporate opendomain trigger knowledge, for fair competition, we compare our methods with two data-driven methods and five state-of-the-art knowledge-enhanced methods, including: DMCNN proposes a dynamic multi-pooling layer above CNN model to improve event detection (Chen et al., 2015). DLRNN exploits document information via recurrent neural networks (Duan et al., 2017). ANN-S2 exploits argument information to improve ED via supervised attention mechanisms .GMLATT adopts a gated cross-lingual attention to exploit the complement information conveyed by multilingual data (Liu et al., 2018a). GCN-ED exploits structure dependency tree information via graph convolutions networks and entity mentionguided pooling (Nguyen and Grishman, 2018). Lu's DISTILL proposes a -learning approach to distill generalization knowledge to handle overfitting (Lu et al., 2019). TS-DISTILL exploits the entity ground-truth and uses an adversarial imitation based knowledge distillation approach for ED . AD-DMBERT adopts an adversarial imitation model to expend more training data (Wang et al., 2019b). DRMM employs an alternative dual attention mechanism to effectively integrate image information into ED (Tong et al., 2020). The last two baselines both use BERT as feature extractor.  Table 2 presents the overall performance of the proposed approach on ACE2005. As shown in Table 2, EKD (our) outperforms various state-ofthe-art models, showing the superiority of opendomain trigger knowledge and the effectiveness of the proposed teacher-student model. BERT-based models AD-DMBERT, DRMM and EKD (ours) significantly outperform the CNN-based or LSTMbased models, which is due to the ability to capture contextual information as well as large scale pretraining of BERT. Compared to these BERT-based models, our methods consistently improves the F score by 3.5% and 2.3%, which shows the superiority of our method even if the encoder is powerful enough.

Overall Performance
Compared to data-driven methods DMCNN and DLRNN, knowledge enhanced methods Lu's DIS-TILL, TS-DISTILL and EKD (ours) improve the recall by a large margin. Due to the small scale of ACE2005, it is quite tricky to disambiguate triggers solely based on the surrounding context. Enhanced by external knowledge, these methods have a stand-by commonsense to depend on, which prevents from overfitting densely labeled trigger words and thus can discover more trigger words. Among them, our model achieves the best performance, which may be caused by two reasons: 1) The superiority of open-domain trigger knowledge. Compared to general linguistic knowledge used in Lu's DISTILL and entity type knowledge used in TS-DISTILL, open-domain trigger knowledge is more task-related, which directly provides trigger candidates for trigger identification, and thus is more informative. 2) The superiority of the proposed teacher-student model. Our method is able to learn open-domain trigger knowledge from unlimited unlabeled data, while Lu's DISTILL and TS-DISTILL can only learn from labeled data.
It is worth noting that our model simultaneously improves precision. Unseen/sparsely labeled trigger words are usually rare words, which are typically monosemous and exhibiting a single clearly defined meaning. These words are easier for the model to distinguish, thereby resulting in the improvement of the overall precision.
To evaluate whether EKD has distilled knowledge into model, we report the performance of EKD in the test set with and without knowledge. As illustrated in Table 3, whether the input data masters the open-domain knowledge or not, the performance makes no big difference (78.4% vs 78.6%), which shows EKD (our) already distills the knowledge into the model. During testing, our model needs no more engineering work for knowledge collection.

Domain Adaption Scenario
We use ACE2005 to simulate a domain adaption scenario. ACE2005 is a multi-domain dataset, with six domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl). Following the common practice (Plank and Moschitti, 2013;Nguyen and Grishman, 2014), we adopt the union of bc and nw as source domains, and bc, ct, wl as three target domains. The event types and vocabulary distribution are quite different between the source and target domains (Plank and Moschitti, 2013). For evaluation, we split source domain data into train/test 4:1 and report the average results on ten runs as the final result. For baselines, MaxEnt and Joint (Li et al., 2013) are two feature-enriched methods, exploiting both lexical and global features to enhance the domain adaption ability. Nguyen's CNN (Nguyen and Grishman, 2015) integrates the feature and neural approaches and proposes a joint CNN for domain adaption. We also compare with supervised SOTA PLMEE (Yang et al., 2019), which exploits the pre-trained language model BERT for event extraction. As illustrated in Table 4, our method achieves the best adaptation performance on both bc and wl target domains and achieve comparable performance on cts target domain. The superior of domain adaption may come from the open-domain trigger knowledge. The open-domain trigger knowledge is not subject to specific domains, which will detect all the event-oriented trigger words and cover the event type from both the source and the target domains. Armed with open-domain trigger knowledge, our model reinforces associations between source and target data, and thus has superior performance in domain adaption.

Various Labeling Frequencies
In the section, we answer the question whether our model can address the long tail problem. Ac-cording to the frequency in the training set, we divide trigger words into three categories: Unseen, Sparsely-Labeled and Densely-Labeled. The frequency of Sparsely-Labeled is less than 5 and the frequency of Densely-Labeled is more than 30. The baselines are 1) supervised-only method DM-BERT (Chen et al., 2015), 2) distant-supervised method DGBERT  and 3) semisupervised method BOOTSTRAP (He and Sun, 2017). We replace the encoders in the three baselines to more powerful BERT to make the baseline stronger.
As illustrated in Table 5, all the three baselines show a significant performance degradation in unseen/sparsely labeled scenarios due to the limited training data. Our method surpasses the baselines in all three settings. Especially, our method gains more improvement on unseen (+6.1%) and sparsely-labeled settings (+2.8%). Open-domain trigger knowledge allows us to discover unseen/sparsely triggers from the large-scale unlabeled corpus, which increases the frequency at which the model sees unseen/sparsely triggers.

Knowledge-Agnostic
Then, to evaluate whether EKD (ours) can distill other knowledge types, we conduct experiments on the three most commonly used knowledge in ED scenario: 1) Entity knowledge. Entity type is an important feature for trigger disambiguation in ED (Zhang et al., 2007). We compare with , which distills ground-truth entity type knowledge via an adversarial teacher-student model. 2) Syntactic knowledge. Syntactic knowledge is implied in the dependency parse tree. The closer in tree, the more important of the word for the trigger (McClosky et al., 2011). Our baseline (Nguyen and Grishman, 2018) is the best syntactic knowledge enhanced model, which exploits structure dependency tree information via graph convolutions networks. 3) Argument knowledge. Event arguments play an important role in ED. Our baseline ANN-S2  designs a supervised attention to leverage the event argument knowledge.
For the adaption of our model, we obtain entity annotations by Stanford CoreNLP, syntactic by NLP-Cube (Boro et al., 2018) and argument by CAMR (Wang et al., 2015). The marking contents are: 1) For entity, we tag three basic entity types, People, Location and Organization. 2) For  syntactic, we take the first-order neighbor of trigger word on dependency parse tree. We consider neighbors in both directions. 3) For argument, we focus on the words played as the ARG0-4 roles of the trigger in AMR parser following (Huang et al., 2017). As we do not know trigger words on unlabeled data, we use pseudo labels generated by pre-trained BERT instead. We encode the entity, syntactic and argument knowledge into sentences with the same Marking Mechanism in Section 3.2.
To prevent information leakage, we only use that knowledge in the training procedure.
As illustrated in Table 7, Our three adaption models, EKD-Ent, EKD-Syn and EKD-Arg, consistently outperform baselines on the F score, proving that the effectiveness of EKD is independent to specific knowledge type. EKD increases the cognitive gap between teacher model and student model to maximize knowledge utilization, and the idea universally works for all types of knowledge distillation. If we compare the performances from the perspective of knowledge type, the results show that open-domain trigger knowledge (EKD) is better than the argument knowledge (EKD-Arg), and they are both superior to the entity knowledge (EKD-Ent) and syntactic knowledge (EKD-Syn). The reason might be the more task-related of the knowledge, the more informative of the knowledge. Since open-domain trigger knowledge and event argument knowledge consider the important words directly from the event sides, they are more valuable than the entity and syntactic knowledge in ED.

Case Study
We answer the question of how and when the open-domain trigger knowledge enhances the understanding of event triggers. Table 6 gives examples about how open-domain trigger knowledge affects predictions of ED. In S1, since trek is a rare word that never shows up in the training procedure, supervised-only method fails to recognize it. Opendomain trigger knowledge provides the priory that trek should be an event trigger. Coupled with pretrained information that trek is similar to denselylabeled trigger words such as move, our model successfully recalls it. In S3, be is a very ambiguous word, and in most cases, be is not used as a trigger word in the labeled data. Supervised-only method is prone to overfitting the labeled data and fails to recognize it. Open-domain trigger knowledge owns word sense disambiguation ability, which knows that be here belongs to the word sense 'occupy a certain position' instead of the common word sense 'have the quality of being', and thus can successfully identify be as the trigger for event Start-Position.

Conclusion
We leverage the wealth of the open-domain trigger knowledge to address the long-tail issue in ACE2005. Specifically, we adopt a WordNet-based pipeline for efficient knowledge collection, and then we propose a teacher-student model, EKD, to distill open-domain trigger knowledge from both labeled and abundant unlabeled data. EKD forces the student model to learn open-domain trigger knowledge from teacher model by mimicking the Table 6: Error analysis: How and When does the open-domain trigger knowledge improve ED? GT refers to the ground truth labels. On the unlabeled data, we use a majority vote of three humans as the ground truth. Sentence GT Prediction S S + S1: Mr. Caste leaves at 5 A.M. for a train trek to manhatten and does not return utill 6 P.M.
Transport O Transport S2: Militants in the region escalate their attacks in the weeks leading up to the inauguration of Nigeria's president.

Start-Position O Start-Position
S3: Mr.Mason, who will be president of CBS radio, said that it would play to radio's strengths in delivering local news. predicted results of the teacher model. Experiments show that our method surpasses seven strong knowledge-enhanced baselines, and is especially efficient for unseen/sparsely triggers identification.