Distilling Discrimination and Generalization Knowledge for Event Detection via Delta-Representation Learning

Event detection systems rely on discrimination knowledge to distinguish ambiguous trigger words and generalization knowledge to detect unseen/sparse trigger words. Current neural event detection approaches focus on trigger-centric representations, which work well on distilling discrimination knowledge, but poorly on learning generalization knowledge. To address this problem, this paper proposes a Delta-learning approach to distill discrimination and generalization knowledge by effectively decoupling, incrementally learning and adaptively fusing event representation. Experiments show that our method significantly outperforms previous approaches on unseen/sparse trigger words, and achieves state-of-the-art performance on both ACE2005 and KBP2017 datasets.


Introduction
Event detection (ED) aims to identify triggers of specific event types. For instance, an ED system will identify fired as an Attack event trigger in the sentence "An American tank fired on the Palestine Hotel." Event detection plays an important role in Automatic Content Extraction (Ahn, 2006), Information Retrieval (Allan, 2012), and Text Understanding (Chambers and Jurafsky, 2008).
Due to the ambiguity and the diversity of natural language expressions (Li et al., 2013;Nguyen and Grishman, 2015), an effective approach should be able to distill both discrimination and generalization knowledge for event detection. Discrimination knowledge aims to distinguish ambiguous triggers in different contexts. As shown in Figure  1, to identify fired in S4 as an EndPosition trigger rather than an Attack trigger, an ED system needs to distill the discrimination knowledge from S1 and S2 that (fired, Attack) usually co-occurs with {tank, death, enemy, ...} and (fired, EndPosition) usually co-occurs with {work, fault, job, ...}. Unlike discrimination knowledge, generalization knowledge aims to detect unseen or sparsely labeled triggers, thus needs to be transferred between different trigger words. For example, to identify the unseen word hacked in S5 as an Attack trigger, an ED system needs to distill the generalized Attack pattern "[Trigger] to death" from S3.
Currently, most neural network ED methods (Chen et al., 2015;Grishman, 2015, 2016;Duan et al., 2017;Yang and Mitchell, 2017) work well on distilling discrimination knowledge, but poorly on distilling generalization knowledge. Table 1 shows the performances of several models on both sparsely (OOV/OOL) and densely (Other) labeled trigger words. These models work well on densely labeled trigger words, i.e., they have a good discrimination ability. But they perform poorly on unseen/sparsely labeled trigger words, i.e., they have a poor generalization ability. This is because these approaches are mostly trigger-centric, thus hard to be generalized well to sparse/unseen words. Furthermore, the lack of  Table 1: F1 Scores of previous approaches on different types of triggers (ACE2005), where OOV words are the out-of-vocabulary words in the training corpus, OOL words are the out-of-label words, i.e., an instance whose (word, event type) never occurs in the training corpus but the word is not OOV. DMCNN (Chen et al., 2015) refers to dynamic multi-pooling based CNN; Bi-LSTM (Duan et al., 2017) refers to bidirectional LSTM based RNN. ELMo refers to the fixed task-independent word representations proposed by Peters et al. (2018).
large-scale training data also limits the generalization ability of learned models. Table 1 also shows the performance of using general pre-trained word representation -ELMo (Peters et al., 2018). We can see that, this task-independent lexical-centric representation achieves nearly the same performance to task-specific representations.
In this paper, we propose a ∆-representation learning approach, which can incrementally distill both discrimination and generalization knowledge for event detection. ∆-representation learning aims to decouple, learn, and fuse alterable ∆parts for event representation, instead of learning a single comprehensive representation. Specifically, we decouple an event representation r ed into three parts r ed = r w r d r g (Section 2), where r w is the pre-trained word representation of trigger words, r d is the lexical-specific event representation which captures discrimination knowledge for distinguishing ambiguous triggers, r g is the lexical-free event representation which captures generalization knowledge for detecting unseen/sparse triggers, and is the fusion function to fuse different parts. Here r d and r g are the ∆parts of our representation, i.e., they are independently learned starting from r w and are intended for capturing incremental knowledge for event detection. To incrementally learn the ∆-parts r d and r g , we propose a ∆-learning framework (Section 3), i.e., a lexical enhanced ∆-learning algorithm is designed to learn the discrimination knowledge r d which is both event-related and lexical-relevant part, and a lexical adversarial ∆-learning is designed to learn the generalization knowledge r g which is event-related but lexical-irrelevant part. Finally, a lexical gate fusion mechanism (Sec-

Decoupled Representations
That officer was fired from his job.

Lexical Gate Fusion
Lexi Δ-Learning Figure 2: The framework of our ∆-learning approach.
Dashed lines indicate the learning process; solid lines indicate the event detection process. tion 2.3) is proposed to adaptively fuse these learned representations. Figure 2 shows the architecture of our method.
We conduct experiments 1 on two standard event detection datasets: ACE2005 2 and TAC KBP 2017 Event Nugget Detection Evaluation 3 (KBP2017). Experimental results show that the proposed method significantly improves the performance on sparsely labeled triggers, and retains a high performance on densely labeled triggers.
The main contributions of this paper are: 1. We propose a new representation learning framework -∆-learning, which can incrementally distill both discrimination and generalization knowledge during representation learning. Since the ambiguity and the diversity problem of natural language expressions are common in NLP, our framework can potentially benefit many other NLP tasks.
2. We design a new event detection approach. By effectively decoupling, independently learning, and adaptively fusing event representation, our approach works well on both sparsely and densely labeled triggers and achieves the stateof-the-art performance on both ACE2005 and KBP2017 datasets.

Decoupling Lexical-Specific and Lexical-Free Representations for Event Detection
To distill both discrimination and generalization knowledge, this section decouples event represen-tation into three parts: r ed = r w r d r g , where r w is the word representation of trigger words, such as word embeddings/ELMo (noted that r w is fixed during all our training process); r d is a lexical-specific event representation which captures discrimination knowledge; r g is a lexicalfree representation which captures generalization knowledge. By decoupling event representations, r d and r g will be independently learned using our ∆-learning algorithm in Section 3. Finally, a gate mechanism is proposed to adaptively fuse the above representations for event detection.
Formally, an event detection instance is a pair of trigger candidate and its context, i.e., x = (t, c), where t is a trigger candidate, and c = {c −m , ..., c −1 , c 1 , ..., c m } is its context. For example, (fired, "That officer was from his job.") is an instance for candidate fired.
Following previous work (Nguyen and Grishman, 2015; Liu et al., 2018a), given an instance x, we embed each token t i as t i = [p w ; p p ; p e ], where p w is its word embedding, p p is its position embedding, and p e is its entity tag embedding. Therefore t 0 is the representation of trigger candidate. In this paper, lexical-specific model Θ d and lexical-free model Θ g use independent embeddings.
To capture discriminative clues for trigger candidates, we design a lexical-centered context selection attention. And we refer it as ATT-RNN and describe it as follows. Lexical-Centered Context Selection. To select discriminative context words, the attentive context selection mechanism models the association between the trigger candidate and its context words. For instance, we want our attention mechanism to capture the association between "work" and fired in S1, and between "tank" and fired in S2.
Concretely, we first feed [t −m , ..., t 0 , ..., t m ] into a bidirectional GRU to get all tokens' contextaware token encoding [h −m , ..., h 0 , ..., h m ]. Then our attention mechanism models (trigger, context word) pair's relevance with a Multi-Layer Perceptron (MLP), and uses a softmax function normalizing relevance scores to attention weights: Given the attention weights, the lexical-specific context representation is summarized as c 0 = i∈C α i · h i . And the final lexical-specific representation of instance x is the concatenation of its token representation h 0 and the lexical-specific context representation c 0 , i.e., The lexical-specific representation can effectively disambiguate trigger words by capturing (trigger, context word) associations. However, this representation is lexical-specific, thus hard to generalize well to sparse/unseen words.

Lexical-Free Representation
In contrast to lexical-specific representation, lexical-free event representation r g aims to capture generalization knowledge for ED, which can be transferred between different trigger words. For example, we want to capture the trigger wordirrelevant knowledge such as "[Trigger] to death" being a strong trigger pattern for Attack event, which can be used to detect many different trigger words, such as fired, hacked, beat. In this way, even an unseen trigger candidate t can be easily identified by leveraging such knowledge.
Obviously, the lexical-free event representation r g should be lexical-irrelevant, but event-specific. To this end, we represent all tokens in x as t i , then employ a lexical-independent context selection module for r g . We simply use DMCNN (Chen et al., 2015) as our lexical-independent context selection module, but design a new adversarial ∆-learning algorithm in Section 3.2 which can eliminate lexical-relevant information from r g . Lexical-Independent Context Selection. To select lexical-independent but event-relevant context words, we employ the same CNN architecture as Chen et al. (2015). For instance, we want to capture "to death" and "criminal" being relevant for Attack event in S5.
Given token sequence [t −m , ..., t 0 , ..., t m ], a hwidth convolutional layer captures local context feature l i from t i:i+h−1 : l i = tanh(w · t i:i+h−1 + b), where w is the convolutional filter, and b is the bias term. To summarize important signals from different pieces of a sentence, a dynamic pooling layer (Chen et al., 2015) is used to produce the left and right context features l lef t , l right : Finally, we concatenate the left context feature l lef t and the right context feature l right as our lexical-free representation r g = [l lef t ; l right ].

Lexical Gate Mechanism for Representation Fusion
The above two representations are complementary to each other: r d captures discrimination knowledge, and r g captures generalization knowledge. However, simple concatenation is not effective for event detection: for frequently labeled trigger words in training data, lexical-specific representation is more useful; and for sparsely labeled or unseen trigger words, lexical-free representation is more helpful. Based on this observation, our system needs to rely more on r d to detect frequent candidate fired, but more on r g to detect the OOV candidate hacked. That is, we need to adaptively fuse different representations for different words, rather than simply concatenate them.
To adaptively fuse lexical-specific representation r d , lexical-free representation r g and word representation r w , we design a lexical gate mechanism to fuse different representations: r ed = r w r d r g , where is the fusion gate, and r ed is the final event representation. Concretely, we first map these representations to a universal space: where f Spec→U (·), f F ree→U (·) and f Lexi→U (·) are linear layers with a nonlinear function; then we fuse them via the gated mechanism: g i (i ∈ {d, g, w}) correspondingly indicates the confidence of the evidences provided by r s , r f and r l ; g i andg i have the same dimensions as r i ; f U →G (·) is a linear layer with a nonlinear function. Finally, we combine all representations: where is element-wise multiplication. After fusion, r ed will be fed to the event detection classifier, which computes a classification (a) Lexical-Enhanced

Binary Lexical Classifier
Lexical-specific Representation Learning

Event Detection Classifier
Lexical-free Representation Learning Lexi probability for each event type y t (including NIL for not a trigger):

Binary Lexical Classifier
where w t is the weight vector, and b t is the bias term. In this way, we identify trigger words of all pre-defined event types.

Distilling Discrimination and Generalization knowledge via ∆-Learning
This section describes our ∆-learning framework, which can learn lexical-specific representation r d and lexical-free representation r g independently.
To distill discrimination knowledge to r d , we design a lexical-enhanced ∆-learning algorithm. To distill generalization knowledge to r g , we design a lexical adversarial ∆-learning algorithm. Finally, we fine-tune the full event detection model in Figure 2.

Distilling Discrimination Knowledge via Lexical-Enhanced ∆-Learning
This section describes our lexical-enhanced ∆learning algorithm for lexical-specific representation r d . To ensure r d be both event-relevant and lexical-specific, we use two types of supervision signals: first, we want the learned representation r d can predict its event type y with the help of word representation r w ; second, we want the learned r d can also predict its trigger word t. For example, we want the learned r d of the instance (fired, An solider to death) can predict both its event type Attack and its trigger word fired.
To achieve the above goal, we remove the lexical-free part in Figure 2 and show the lexicalenhanced ∆-learning framework in Figure 3 (a). The input of our lexical-enhanced learning framework is a triple (t, c, w), where t is the trigger, c is its context, and w is a sampled word. The output is two-fold: the event classifier will output the event type of (t, c), and the auxiliary lexical classifier will output 1 if t = w and 0 otherwise. In this way, the event classifier can propagate the event type supervision signal to our lexical-specific representation learning component, and the auxiliary lexical binary classifier ensures that the learned representation r d is lexical-specific.
Specifically, for each ED instance x = (t, c), we generate a positive lexical-enhanced training instance (t, c, t) with label (y, 1), and n negative instances 4 (t, c, w) with label (y, 0), where w is a word randomly sampled from context c.
For each ED instance x = (t, c) in the train dataset D, the event classifier loss is: and the lexical binary classifier loss is: Therefore, the loss function of lexical-enhanced ∆-learning is: By adding the auxiliary lexical classification task, this learning algorithm will ensure the learned representation be both event-related and lexicalrelevant.

Distilling Generalization Knowledge via Lexical-Adversarial ∆-Learning
In contrast to lexical-specific representation r d , the lexical-free representation r g needs to eliminate lexical-specific information, so that it can be transferred between different words. To achieve this goal, we adopt adversarial techniques and design a lexical-adversarial ∆-learning algorithm. Specifically, we remove the lexical-specific part in Figure 2 and show the lexical-adversarial ∆learning framework in Figure 3 (b). We can see that, the input and the output of our adversarial ∆learning framework are still (t, c, w) and (y, 1/0). The event classifier is used to propagate the event type supervision signal to our lexical-free representation learning component, so that r g will capture event related information. The difference between Figure 3 (a) and 3 (b) is that they use different auxiliary tasks: Figure 3 (a) uses a lexicalenhanced auxiliary task, and Figure 3 (b) uses a lexical-adversarial auxiliary task.
To eliminate lexical-specific information, we design a two-player min-max game (Goodfellow et al., 2014) for the lexical-adversarial auxiliary task. Given (t, c, w), our binary lexical classifier Θ DeLexi attempts to predict whether r g is specific to w, but the lexical-free model Θ g tries to produce r g to confuse Θ DeLexi . The min-max objective function for lexical-adversarial ∆-learning is: In this way, we can remove the lexical-specific information from r g .
The above adversarial loss leads two different optimized directions for Θ g and Θ DeLexi , which can be implemented by a gradient reversal layer (Ganin et al., 2016) during backpropagation. That is, L minmax is jointly optimized with the main ED task objective L event , while gradients from adversarial loss are reversed with the factor λ adv when they reach r g . By this means, we can unify the optimized directions of these components. Therefore, the loss function of our lexical-adversarial ∆-learning is: Following Liu et al. (2019), we divide the lexical-adversarial ∆-learning into two stages: 1. In the pretraining stage, we first update Θ g using the main ED task objective, then freeze Θ g and update Θ DeLexi using the Equation 10.
2. In the adversarial learning stage, we update parameters using Equation 11.
In practice, we find that the factor λ adv is sensitive to the even of min-max game. A large λ adv is easy to make the binary lexical classifier to be weak (the binary classfication accuracy tends to 50%). In this paper, λ adv is set as 1 −3 , and the accuracy of our binary lexical classifier Θ DeLexi always keep over 75% in the adversarial learning stage.
By adding the auxiliary lexical-adversarial task, this learning algorithm will ensure the learned representation be event-related but lexical-irrelevant.

Full Model Fine-Tuning
Given the pre-trained lexical-specific representation model Θ d and the pre-trained lexical-free representation model Θ g , we finally fine-tune the full model Θ in Figure 2 by optimizing the event classification loss function: where λ reg is the weight coefficient of regularization item and Θ indicates all parameters. L(Θ) can be optimized using mini-batch based stochastic gradient descent algorithms, such as Adadelta (Zeiler, 2012).

Experimental Settings
Dataset. We conduct experiments on two standard English event detection datasets: ACE2005 and KBP2017. ACE2005 (LDC2006T06) contains 599 documents annotated with 33 event types. Following previous studies (Liao and Grishman, 2010;Li et al., 2013;Chen et al., 2015;Liu et al., , 2018a, we use the same 529/30/40 train/dev/test document splits in our experiments. We use ACE2005 as the primary dataset, as the same as previous studies (Nguyen and Grishman, 2018 (Lin et al., 2018a), we randomly sample 20 documents from the 2016 evaluation dataset as the development set.
We evaluate different event detection systems using precision, recall, and F1-score. For ACE2005, we compute these criteria as the same as previous work (Li et al., 2013;Chen et al., 2015). For KBP2017, because TAC KBP2017 allows each team to submit 3 different runs, to make our results comparable with the evaluation results, we select 3 best runs of each system on the development set and report the best test performance among them using the official evaluation toolkit 5 , which is referred as Best3 in previous work (Lin et al., 2018a).
Baselines. We compare our approach with three types of baselines: Feature based Approaches rely on rich handdesigned features, including: MaxEnt (Li et al., 2013) which employs hand-designed features and uses Max-Entropy Classifier; Combined PSL (Liu et al., 2016b) -the best reported feature-based system which combines global and latent features using Probabilistic Soft Logic framework.
Representation Learning based Approaches employ neural networks to automatically extract features for event detection, including: DMCNN (Chen et al., 2015) which uses CNN as sentence feature extractor and concatenates sentence feature and lexical feature for event detection classifier; NC-CNN (Nguyen and Grishman, 2016) which extends traditional CNN by modeling skipgrams for exploiting non-consecutive k-grams; Bi- RNN (Nguyen et al., 2016) which embeds each token using additional dependency features for bidirectional RNN feature extractor, and jointly extracts triggers with its arguments.
External Resource based Approaches aim to enhance event detection with external resources, including: SA-ANN-Arg  which injects event arguments information via supervised attention mechanism; GCN-ED (Nguyen and Grishman, 2018) which exploits syntactic information to capture more accurate context using Graph Convolutional Networks (GCN); GMLATT (Liu et al., 2018a) which exploits the multi-lingual information for more accurate context modeling; HBTNGMA (Chen et al., 2018) which fuses both sentence-level and document-level information, and collectively detects different events in a sentence.
For our approach and all baselines, we adopt the pre-trained word embedding using Skip-gram 6 and the open released ELMo models 7 . We also report the performance of ELMo as a baseline for demonstrating the performance of universal pretrained representations. All hyper-parameters are tuned on development set.   Table 2 shows the overall ACE2005 results of all baselines and our approach. For our approach, we show the results of four settings: our approach using word embedding as its word representation r w -∆ w2v ; our approach using ELMo as r w -∆ ELM o ; our approach simply concatenating [r d , r g , r w ] as instance representation -∆ concat * . From Table 2, we can see that:

Overall Performance
1. By distilling both discrimination and generalization knowledge, our method achieves state-of-the-art performance. Compared with the best feature system, ∆ w2v and ∆ ELM o gain 2.8 and 4.6 F1-score improvements. Compared to the representation learning based baselines, both ∆ w2v and ∆ ELM o outperform all of them. Notably, ∆ ELM o outperforms all the baselines using external resources.
2. By incrementally distilling generalization knowledge, our method can achieve both high recall and high precision. Our method obtains a high recall -71.9, which outperforms most methods by a large margin, and retains a high precision -76.3. We believe this is because the generalization knowledge is incrementally distilled using ∆learning, so there is no need to make the precisionrecall tradeoff during training.
3. The lexical gate provides an effective mechanism for adaptively fusing discrimina-  tion and generalization knowledge. Compared with the naive fusion baselines -∆ concat * , ∆ w2v and ∆ ELM o correspondingly gain 0.9 and 1.2 F1 improvements. This means that an adapative fusion mechanism can get benefits from both discrimination and generalization knowledge, rather than make tradeoff between them.
4. Although universal pre-trained representations can achieve a good performance, task-specific representations are still crucial. Compared with the strong universal representation baseline ELMo, our task-specific event detection representations all achieve a significant performance improvements. This also verifies that ∆-learning is an effective way for incrementally learning task-specific representation. Table 3 further compares our method with the Top 3 systems in TAC 2017 Event Detection Track (Mitamura et al., 2017). Because these teams had no access to gold entity information during evaluation, we exclude entity embedding in our KBP2017 experiments for a fair comparison. We can see that, the proposed method can significantly outperform the best ED systems in TAC 2017, despite these systems are ensemble models which have leveraged various external resources.

Detailed Analysis
To analyze the effect of our method in detail, Table  4 shows the performance of our method on different types of trigger words, including: OOV (out-of-vocabulary) and OOL (out-oflabel) are of the same as in Table 1.
Sparse instance means the event trigger rate of the given word P (e|w) = #(e,w) #(w) is less than 10% in training corpus, i.e, < 10% occurrences of word w are labeled with the event type e (NIL including).
Dense means all other instances except OOV,

Related Work
Event Detection. In recent years, neural approaches have achieved significant progress in event detection. Most neural approaches focus on learning effective instance representations (Chen et al., 2015;Grishman, 2015, 2016;Feng et al., 2016;Ghaeini et al., 2016;Lin et al., 2018b). The main drawback of these methods is that they mostly only learn a single and lexical-specific representation, which works well on distilling discrimination knowledge but poorly on generalization knowledge. Some approaches enhance representation learn-   (Chen et al., 2015) in Table 1 and Table 2, DMCNN * excludes lexical feature but includes entity feature.
ing using external resources. One strategy is to employ extra knowledge for better representation learning, such as document (Duan et al., 2017;Liu et al., 2018b), syntactic information (Nguyen and Grishman, 2018; Sha et al., 2018;Orr et al., 2018;Liu et al., 2018c), event arguments , knowledge bases (Yang and Mitchell, 2017;Lu and Nguyen, 2018) and multi-lingual information (Liu et al., 2018a). The other strategy is generating additional training instances from extra knowledge bases (Liu et al., 2016a; or news paragraph clusters (Ferguson et al., 2018). Our method does not use any external resources, which could be a good complementary to these methods.
Representation Learning via Auxiliary Learning. In recent years, many auxiliary learning techniques have been proposed for better representation learning. Self-supervised learning learns representation by designing auxiliary tasks rather than using manually labeled data. Examples include colorization in vision tasks (Doersch and Zisserman, 2017), language modeling in text tasks (Rei, 2017). Adversarial learning attempts to fool models through malicious input (Kurakin et al., 2016), it has been broadly used in many scenarios, e.g., domain adaptation (Zeng et al., 2018), knowledge distillation (Qin et al., 2017) and attribute cleaning (Elazar and Goldberg, 2018). Some adversarial-based techniques have been used for event detection. Hong et al. (2018) overcomes spurious features during training via selfregularization. Liu et al. (2019) distills extra knowledge from external NLP resources using a teacher-student network. This paper employs ad-versarial ∆-learning algorithm to eliminate lexical information in event representation so that both discrimination and generalization knowledge can be incrementally distilled.

Conclusions
This paper proposes a new representation learning framework -∆-learning, which can distill both discrimination and generalization knowledge for event detection. Specifically, two effective ∆-learning algorithms are proposed to distill discrimination and generalization knowledge independently, and a lexical gate mechanism is designed to fuse different knowledge adaptively. Experimental results demonstrate the effectiveness of our method. Representation learning is a fundamental technique for NLP tasks, especially for resolving the ambiguity and the diversity problem of natural language expressions. For future work, we plan to investigate new auxiliary ∆-learning algorithms using our ∆-learning framework.