How Does Context Matter? On the Robustness of Event Detection with Context-Selective Mask Generalization

Event detection (ED) aims to identify and classify event triggers in texts, which is a crucial subtask of event extraction (EE). Despite many advances in ED, the existing studies are typically centered on improving the overall performance of an ED model, which rarely consider the robustness of an ED model. This paper aims to fill this research gap by stressing the importance of robustness modeling in ED models. We first pinpoint three stark cases demonstrating the brittleness of the existing ED models. After analyzing the underlying reason, we propose a new training mechanism, called context-selective mask generalization for ED, which can effectively mine context-specific patterns for learning and robustify an ED model. The experimental results have confirmed the effectiveness of our model regarding defending against adversarial attacks, exploring unseen predicates, and tackling ambiguity cases. Moreover, a deeper analysis suggests that our approach can learn a complementary predictive bias with most ED models that use full context for feature learning.


Introduction
Event detection (ED), a crucial subtask of event extraction (EE), aims to identify and categorize event triggers in texts. For example, in a sentence S1: "During a war, invaders destroyed the whole town", ED requires a system to detect an event trigger destroyed, along with its event type ATTACK 1 . Building a robust ED system is shown to benefit a wide range of applications including document summarization (Filatova and Hatzivassiloglou, 2004), knowledge base population (Ji and Grishman, 2011;Mitamura et al., 2017), question answering (Berant et al., 2014), and others.
In recent years, great advances have been made in ED (Ji and Grishman, 2008;Li et al., 2013;Chen 1 According to ACE event ontology. S1: During a war, invaders destroyed the whole town. et Nguyen et al., 2016;Feng et al., 2016;Liu et al., 2018bLiu et al., ,a, 2019b. However, the vast majority of existing studies focus on improving the overall performance of an ED model (usually on a fixed test set), which rarely consider the robustness (and generalization capability) of an ED model. For example, most of existing methods do not answer questions such as when/why an ED system would fail, how to handle new, previously unseen data, despite these considerations are especially crucial for designing real-world ED systems. This paper focuses on the robustness aspect of ED models. We first emphasize the necessity of this research by pinpointing three stark cases demonstrating the vulnerability of existing ED models. These cases are: 1) adversarial attack, which refers to adding small perturbations in the original sentences (Papernot et al., 2016;Alzantot et al., 2018). As shown in Figure 1, a well-trained event detector can correctly recognize the event trigger destroyed at first. But when we replace destroyed with a rare trigger annihilated, despite the meaning of the sentence does not change, we note the same event detector fails to identify the trigger. A quantitative evaluation suggests that the performance of a stateof-the-art (SoTA) ED model (Chen et al., 2015) drops significantly from 69.1% to 19.2% facing adversarial attack. 2) Unseen predicates, which measures whether an ED model can tackle new, previously unseen data. We note the existing ED models demonstrate a rather poor generalization capability: they achieve only 14.2% in F1 for the previously unseen triggers, despite 74.9% in F1 for the already seen triggers. 3) Event type disambiguation, which refers to assign a correct event type to ambiguous triggers, considering that over 70% of triggers can express different types of events (Liu et al., 2018b). While, our pilot experiments suggest that a SoTA ED model obtains only 50.4% in F1 in tackling the high-ambiguity cases, comparing to 70.6% in F1 in tackling the low-ambiguity cases.
The above phenomena reflect the fact that current ED models have a poor ability in modeling contexts, where the underlying reason may highly relate to reasoning shortcuts (Jiang and Bansal, 2019) -owing to the limited (and biased) training data, an ED model may have only learned lexical pattern, i.e., word-to-trigger mapping (such as destroyed → Attack), owing to its prevalence in data. By adopting such reasoning shortcuts, an ED model may explain the training data well, but fail in the more context-dependent scenarios noted above, as they never capture the underlying regularities about how event triggers appear in texts.
In light of the above analysis, we propose a new training paradigm, termed as context-selective mask generalization, aiming to prevent reasoning shortcuts and robustify an ED model. Our method is intuitive and straightforward: To prevent lexical bias, we explicitly delexicalize triggers for training/testing, by replacing them with placeholders. This forces our model to make predictions using contexts solely. For instance, a training example S1 is transferred as: "During a war, invaders [MASK] the whole town", and our model is forced to predict the event label of the masked word. As the lexical information of the trigger is completely masked, our model has to mine the more essential contextual clues for reasoning. This prevents our model simply remembering word-to-trigger shortcuts, but to learning the underlying regularities regarding how events are described in texts.
The proposed learning paradigm consists of two complementary training objectives: contextselective discriminative learning and contextualized similarity learning. The former is an intrasentence objective, considering that contextual words are usually of different importance, for example, in S1, "wars" and "invaders" may be more important than "town" for predicting the Attack event. We devise a method combing selective attention (Lin et al., 2016) with model uncertainty (Gal and Ghahramani, 2016) to weigh contexts and select the salient parts for learning. The latter is an inter-sentence objective, with an assumption that: event triggers have same types may occur in similar contexts, derived from the well-known distributional hypothesis of words (Harris, 1954). We take in pairs of mask-containing sentences as input, and encourage their contextual representations to be similar if the masked triggers express the same type of events.
To verify the effectiveness of our approach, we have conducted extensive experiments on the benchmark event dataset, and we show the definite advantages of our approach over previous methods with respect to: 1) defending against adversarial attack, 2) tackling unseen predicates, and 3) handling ambiguity cases. Moreover, a deeper analysis suggests that our approach can learn a complementary predictive bias with the existing ED models using full context for reasoning.
Contributions. 1) In this work, we stress the importance of robustness modeling in ED, a problem less studied in the existing literature. We pinpoint three stark cases demonstrating the brittleness of existing ED methods, with qualitative evaluation, and analyze the underlying reason. 2) We propose a new training paradigm, called contextselective mask generalization, which can effective mine context-specific patterns for ED, shedding lights on building ED systems of decent robustness. 3) We report on extensive experiments demonstrating the advantages of our model in defending against adversarial attack, handling unseen predicates, and tackling ambiguous cases. We also give a deeper analysis exploring the predictive bias of our method.

Event Detection
ED is a crucial subtask of EE that aims to find event triggers in texts. Earlier approaches for ED are feature based. To name a few, Ahn (2006) exploited lexical, syntactic, and external knowledge based features for the task; Ji and Grishman (2008) combined global and local decision features for the task. Liao and Grishman (2010) and Hong et al. (2011) investigated cross-event/cross-entity inference for the task; Li et al. (2013) proposed a joint framework for the task. Modern approaches for ED are neural network based. For example, Chen et al. (2015) leveraged Convolutional Neural Networks (CNNs) for the task; Nguyen et al. (2016) used Recurrent Neural Networks (RNNs) for the task; Feng et al. (2016) combined CNNs with RNNs and Liu et al. (2018b) explored Graph Convolutional Networks (GCNs) for the task. More recent works have designed advanced architectures for the task (Liu et al., , 2018aLu et al., 2019;Liu et al., 2019a).
Despite many advances in ED, to date rare work has studied the robustness (and generalization capability) of an ED model. The work of Lu et al. (2019) is related to ours, which improved the generalization of an ED model by decoupling lexicalspecific and lexical-free representations via adversarial training. Compared to their work, the introduction of placeholders in our work can naturally decouple lexical-specific and lexical-free representations, which avoids the unstable adversarial learning process. Moreover, our work evaluates three aspects of robustness, rather than only unseen predicates. Our work also relates to the study of Huang et al. (2018), which aims to recognize events of never-seen event types, i.e. zero-shot EE. Their work lies in an orthogonal dimension of our work regarding the generalization of ED models.

Robustness Probing in Natural Language
Processing Applications Enhancing the robustness of a model is a challenging and long-standing goal of AI research community. In computer vision, Szegedy et al. (2014) first pointed out that a crafted input with small perturbations could easily fool a neural model, referring to it as adversarial example.  (Vaswani et al., 2017) in sentiment analysis, entailment and machine translation under adversarial attacks. But to our best knowledge, there is no work systematically studying the robustness of ED.
3 Approach Figure 2 visualizes the overview of our approach, by taking S1 as an example. Let a sentence of N words be S = [w 1 , w 2 , ..., w N ]. Following previous works (Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016;Lu et al., 2019), we formulate the ED task as a token-level classification problem. That is, for each word in S, we consider it as a candidate trigger, and our goal is to assign a correct event label to it (A type of NIL is used to indicate a non-trigger word).

Trigger Delexicalization
Following recent advances in ED (Yang et al., 2019), we adopt BERT architecture (Devlin et al., 2019) to learn the input representations, by first adding special tokens at the both ends of S to construct an extended sequence "[CLS] S [SEP]". Note we do not allow our model to leverage lexical clues, we explicitly delexicalize the candidate trigger, by replacing it with a placeholder [MASK]. Consider S1 and S2 in Figure 1. If we take destroyed or annihilated as the candidate trigger, the mask-containing sequence is "[CLS] During a war, invaders [MASK] the whole town [SEP]". Next. we use BERT for sequence encoding and take the final hidden layer 2 of BERT as the input representations, denoted as H S ∈ R (N +2)×d . We use h w i ∈ R d to denote the representation of a specific token w i .
[CLS] During a war, invaders [MASK] the whole town [SEP] BERT During a war, invaders destroyed the whole town.  Figure 2: The overview of our approach, taking "destroyed" as the candidate trigger. Our approach includes two complementary training objectives -an intra-sentence context-selective discrimination learning (left) and an inter-sentence contextualized similarity learning (right).

Context-Selective Discriminative Learning
Context-selective discriminative learning aims to predict the event label for the masked candidate trigger, by selectively attend to contexts. In our method, we first compute an (unsupervised) attention vector: , the representation of the masked candidate trigger as query vector (Bahdanau et al., 2014). W a ∈ R d×d is an attention matrix. Then we conduct a weighted summation computation over H S using α u as the weight vector and compute a feature vector for the masked candidate trigger, denoted by F [MS] . Finally, F [MS] is used for event label prediction by computing an output vector containing the probability of different event labels: where W m and b m are model parameters. The predicted event label corresponds to the index having the highest value in o [MS] . Considering that unsupervised attention may not always learn a good pattern (Wiegreffe and Pinter, 2019), we devise a "trial-and-error" approach to guide the learning. Specifically, at the training time, we also generate random context mask 3 and normalize it as a weight vector α r . Our intuition is, if α r leads to a better result than using α u , it might be a better selective pattern for our model to learn. Note there are cases where the predicted event labels are the same for α r and α u , and here we introduce model uncertainty (Gal and Ghahramani, 2016) to evaluate whether the result is improved. Specifically, we compute the model uncertainty by making predictions many times but with dropout layers being activated, and the model uncertainty empirically equals to the prediction variance. When we note a reduced model uncertainty, we consider α u improves the result and we then encourage α u to approach α u , under a guidance of mean square error (MSE) loss. Therefore, the overall loss function of context-selective discriminative learning is: (2) where t ranges over each token in the training set; y (t) is t's ground-truth event label; x[j] denotes the jth element of x; δ αu,αr takes a value of 1 if α r improves the result (regarding model uncertainty), and 0 otherwise.

Contextualized Similarity Learning
The philosophy of contextualized similarity learning is that "events of the same types may have similar contexts", derived from the distributional hypothesis of words (Harris, 1954). We enforce this assumption in our model by taking in pairs of mask-containing sentences as input, and have an objective to encourage their representations to be similar if they express the same type of events.
Let the learned feature vector of two (masked) candidate event triggers t 1 and t 2 be F t 1 →[MS] and F t 2 → [MS] , and their event labels be y 1 and y 2 . We define the similarity of F t 1 →[MS] and F t 2 →[MS] as: Based on this similarity measurement, we devise the following loss to encourage triggers of same types to have larger similarity score: where δ y 1 ,y 2 is the Kronecker function that takes 1 is y 1 and y 2 are same, and 0 otherwise. We do not consider cases where both of y 1 and y 2 are NIL.

Attentive Feature Fusion
Using only context-specific features for prediction may lead to sub-optimal performance. The attentive representation fusion is devised to balance the context-specific features and full contexts features, to make the reasoning more comprehensive.
Learning Full Context Features. The full context feature of a candidate trigger is learned in a similar way as in context-selective discriminative learning, but the candidate trigger is not masked and the context-selective attention is not performed. Note if we adopt a BERT-based full context feature learning, we can share the BERT encoder for full context feature learning and context-specific feature learning, and in this way, we do not need to double the model parameters. The impact of using other architectures for full context feature fusion is studied in § 6.1.
The Attentive Sentinel. The attentive sentinel aims to learn a trade-off between the contextspecific feature F [M S] and full context feature, denoted by F [F CT ] for a candidate trigger. Specifically, we first compute an attention weight via: where W g and b g are model parameters. Then, using this weight, we compute a weighted summation of F [M S] and F [F CT ] to compute the final feature of the candidate trigger: This attention mechanism enable us to learn a dynamically combination of the two features to make the final prediction.

Training and Optimization
Finally, in our full approach we take F com as the input and conduct an event label classification via: where o F inal contains probabilities of different event labels, and the predicted event label corresponds to the element have a maximal value; W f and b f are model parameters. A cross-entropy loss is adopted to train our full model, which is: where symbols have similar meanings as in Eq (2). We conduct a leaning paradigm of pre-training followed by fine-tuning: we first pre-train our model using L D and L S ; then we fine-tune our model using L F . In the later stage, L F and L D is also considered to keep the context-specific feature discriminative enough for prediction. We adopt Adam (Kingma and Ba, 2015) to update model parameters.

Experimental Setups
Datasets and Evaluations. We take ACE 2005 and KBP 2017 as the benchmark datasets.
For ACE 2005, we split the corpus as training/developing/testing sets as recommend in previous works (Li et al., 2013;Chen et al., 2015). For KBP 2017, we adopt the official evaluation settings for training and testing. For evaluations, we adopt Precision (P), Recall (R), and F1-score (F1) as evaluation metrics, same as previous works for a meaningful comparison. We use two-tailed Wilcoxon test for significant test, with a significance level p=0.05.
Implementation Details. Our model is implemented with BERT Large , which has 24 layers, 1024 hidden units, and 16 heads, and is pre-trained on large text corpora. We tune hyper-parameters via grid search on the developing set. Finally, the learning rate is set as 1e −5 (from [1e −5 , 2e −5 to 1e −4 ]); the batch size is set as 10 (from [2, 5 to 10]). A negative sampling rate of 0.7 is adopt to tackle the unbalance of positive and negative examples (Chen et al., 2015). As in KBP 2017 one event trigger might express multiple event types simultaneously, we adapt the multi-label cross entropy loss to binary cross-entropy loss, and a threshold of 0.3 is used for prediction.
Baselines. The following models are used as baselines: 1) DNNED, which adopts a feedforward neural network for the task -it completely ignores context information; 2) DMCNN (Chen et al., 2015) and 3)   and contextualized similarity learning respectively can improve the performance, but the latter is more important -with out it, a model suffers from a drop of 2.3% in ACE 2005 and 2.4% in KBP 2017. Another interesting finding is obtained by comparing DNNED with M MASK , which adopt only lexical or context information for the task. We conclude lexical information is much more important than context information in the standard evaluation. While, learning only such reasoning shortcuts may lead to poor robustness as shown in the following.

Robustness Probing
We conduct robustness probing regarding defending against adversarial attacks, unseen predicates, and tackling ambiguity cases. To maintain tractability, in the following experiments, we take model achieving best performance on the development set for testing, instead of adopting 5-run average as in previous evaluation. Moreover, to simplicity analysis, our experiments are mostly conducted on ACE 2005.

Defending Against Adversarial Attacks
In adversarial attacks, we adopt list-based method (Alzantot et al., 2018) to generate adversarial examples. Specifically, for a word, we first find its semantically similar words based on GloVe embeddings (Pennington et al., 2014), and then we replace the original word with each word and evaluate the new sentence with a GPT language model (Radford et al., 2019). We take the new sentence with the largest score as adversarial example. Some cases in    Table 3, previous methods suffer from a severe drop (>47.1%/49.8%) in F1 facing adversarial attacks. By comparison, our full approach achieves the best performance -47.9% and 43.3% regarding ADT and ADC respectively. M MASK ranks secondly and demonstrates the smallest performance gap regarding adversarial attack -ADT even does not affect its performance as it does not rely on lexical information of trigger for prediction.

Exploring Unseen Predicates
The original testset may not be a good testbed for exploring unseen predicates, as it is highly biased (unseen cases only account for 8.1%). We adopt a new setting in exploring unseen predicates: we  first divide the whole ACE corpus as C1 and C2 with a ratio of 1:2 randomly, and C1 is used for training/developing. Then, for each sentence in C2, we put it into a SEEN or UNSEEN set based on whether it contains a trigger that is in C1 or not (for sentence that does not have event triggers, we randomly put it into the SEEN or UNSEEN set). Finally, we end up with a SEEN set with a size of 2, 896, and an UNSEEN set with a size of 1, 409. Table 4 show the results of different models. We note previous methods behave poorly on the UN-SEEN set and demonstrate a large performance gap (>50.4%) in handle SEEN and UNSEEN. By contrast, our full approach achieves the best performance on SEEN (78.2%) and UNSEEN (47.6%), with a relatively small gap (30.6%). Moreover, M MASK ranks secondly on the UNSEEN set, outperforming all other baselines including M BERT .

Tackling Ambiguity Cases
Regarding tackling ambiguity cases, we first define the ambiguity of a word as the entropy of its word-type distribution. We then sort all sentences based on their averaged word ambiguity. For example, a high-ambiguity sentence is "There was no shots fired", where "shots" can trigger Attack, Die, Execute, and NIL and "fired" can trigger Attack, End-Position, and NIL. We select 500 sentences with the highest ambiguity to construct a HA set; 500 sentences with the lowest ambiguity to construct a LA set (each of the sentence should contain at least one event trigger).
From the results shown in Table 5, previous ED systems (except M BERT ) have a relatively large performance gap in tackling low-ambiguity and highambiguity cases. By contrast, our full approach achieves the best performance with a small gap. Interestingly, M BERT demonstrates a rather good   performance in tackling ambiguity cases, which may benefit from its ability in modeling contexts by pre-training on large corpus. We also note M MASK show comparable performance in tackling low-and high-ambiguity cases.
6 Further Discussion

Predictive Bias Probing
We first explore the integration of M MASK with existing ED models learning full context features. From the results in Table 7, M MASK has a complementary effect with existing ED systems and boosts performance. The gain on DNNED is the most salient, as DNNED only uses trigger information but context information for reasoning, which is the opposite of M MASK . Additionally, we compare performance of M BERT , M MASK , and M FULL on different event types in Figure 3. From the results M BERT performs better on types having relatively fewer expressions such as Marry and Convict, but worse on types having diverse expressions such as Start-ORG, Phone-Write, and Transfer-Ownership. M MASK is just the opposite. M FULL can take advantages of feature fusion from M BERT and M MASK , yielding the best performance.

Case Study
We conduct case study to explore the outputs of M BERT and our model M MASK , and the representative and interesting cases are shown in Table  6. From the results, in a) and b), M BERT makes wrong predictions, which may due to the prevalent of the pattern release → Transfer-Money (100%) and launched → Transport (78.5%) in the training set. M BERT also misses c) and d), as the detection of reaching and admits is completely depended on contexts. By contrast, M MASK correctly identify all of them. More interesting cases are shown in the second part of Table 6. We note our model M MASK makes wrong predictions in e) and f). This makes sense, as M MASK does not aware trigger lexical information -even human may wrongly predict an Arrest-Jail event considering "convict has ever been [MASK] in [...]". Example g), h) and i) are worth further discussion. From our opinion, M MASK assigns an Arrest-Jail event to pull in g), and a Contact-Meet event to address in h), which are quite reasonable. But these cases are not labeled in the golden annotations, which may be missed by the ACE annotators. This also implies the challenging of the ED task.

Conclusion and Future Work
This paper focuses on the robustness of ED. We highlight three stark cases showing the brittleness of existing ED models. Then we propose a new approach called context-selective masking generalization shedding lights on robustifying an ED model. In future, we would like to extend our method to other tasks where exploiting context information is crucial, such as named entity recognition and relation extraction.