Towards Open Domain Event Trigger Identification using Adversarial Domain Adaptation

We tackle the task of building supervised event trigger identification models which can generalize better across domains. Our work leverages the adversarial domain adaptation (ADA) framework to introduce domain-invariance. ADA uses adversarial training to construct representations that are predictive for trigger identification, but not predictive of the example’s domain. It requires no labeled data from the target domain, making it completely unsupervised. Experiments with two domains (English literature and news) show that ADA leads to an average F1 score improvement of 3.9 on out-of-domain data. Our best performing model (BERT-A) reaches 44-49 F1 across both domains, using no labeled target data. Preliminary experiments reveal that finetuning on 1% labeled data, followed by self-training leads to substantial improvement, reaching 51.5 and 67.2 F1 on literature and news respectively.


Introduction
Events are a key semantic phenomenon in natural language understanding. They embody a basic function of language: the ability to report happenings. Events are a basic building block for narratives across multiple domains such as news articles, stories and scientific abstracts, and are important for many downstream tasks such as question answering (Saurí et al., 2005) and summarization (Daniel et al., 2003). Despite their utility, event extraction remains an onerous task. A major reason for this is that the notion of what counts as an "event" depends heavily on the domain and task at hand. For example, should a system which extracts events from doctor notes only focus on medical events (eg: symptoms, treatments), or also annotate lifestyle events (eg: dietary changes, ex-ercise habits) which may have bearing on the patient's illness? To circumvent this, prior work has mainly focused on annotating specific categories of events (Grishman and Sundheim, 1996;Doddington et al., 2004;Kim et al., 2008) or narratives from specific domains (Pustejovsky et al., 2003;Sims et al., 2019). This has an important implication for supervised event extractors: they do not generalize to data from a different domain or containing different event types (Keith et al., 2017). Conversely, event extractors that incorporate syntactic rule-based modules (Saurí et al., 2005;Chambers et al., 2014) tend to overgenerate, labeling most verbs and nouns as events. Achieving a balance between these extremes will help in building generalizable event extractors, a crucial problem since annotated training data may be expensive to obtain for every new domain.
Prior work has explored unsupervised (Huang et al., 2016;Yuan et al., 2018), distantly supervised (Keith et al., 2017;Chen et al., 2017;Araki and Mitamura, 2018;Zeng et al., 2018) and semisupervised approaches (Liao and Grishman, 2010;Huang and Riloff, 2012;Ferguson et al., 2018), which largely focus on automatically generating in-domain training data. In our work, we try to leverage annotated training data from other domains. Motivated by the hypothesis that events, despite being domain/ task-specific, often occur in similar contextual patterns, we try to inject lexical domain-invariance into supervised models, improving generalization, while not overpredicting events.
Concretely, we focus on event trigger identification, which aims to identify triggers (words) that instantiate an event. For example, in "John was born in Sussex", born is a trigger, invoking a BIRTH event. To introduce domain-invariance, we adopt the adversarial domain adaptation (ADA) framework (Ganin and Lempitsky, 2015) which constructs representations that are predictive for trigger identification, but not predictive of the example's domain, using adversarial training. This framework requires no labeled target domain data, making it completely unsupervised. Our experiments with two domains (English literature and news) show that ADA makes supervised models more robust on out-of-domain data, with an average F1 score improvement of 3.9, at no loss of in-domain performance. Our best performing model (BERT-A) reaches 44-49 F1 across both domains using no labeled data from the target domain. Further, preliminary experiments demonstrate that finetuning on 1% labeled data, followed by self-training leads to substantial improvement, reaching 51.5 and 67.2 F1 on literature and news respectively.

Approaching Open Domain Event Trigger Identification
Throughout this work, we treat the task of event trigger identification as a token-level classification task. For each token in a sequence, we predict whether it is an event trigger. To ensure that our trigger identification model can transfer across domains, we leverage the adversarial domain adaptation (ADA) framework (Ganin and Lempitsky, 2015), which has been used in several NLP tasks (Ganin et al., 2016;Li et al., 2017;Shah et al., 2018;. Figure 1 gives an overview of the ADA framework for event trigger identification. It consists of three components: i) representation learner (R) ii) event classifier (E) and iii) domain predictor (D). The representation learner generates tokenlevel representations, while the event classifier and domain predictor use these representations to identify event triggers and predict the domain to which the sequence belongs. The key idea is to train the representation learner to generate representations which are predictive for trigger identification but not predictive for domain prediction, making it more domain-invariant. A notable benefit here is that the only data we need from the target domain is unlabeled data.  {(x a 1 , d a 1 ), ..., (x a n , d a n )}, where x a i is the token sequence and d a i is the domain label, using token sequences from D s and unlabeled target domain sentences. The representation learner R maps a token sequence

Adversarial Domain Adaptation
The domain predictor D creates a pooled representation p i = P ool(h i1 , ..., h ik ) and maps it to domain label d a i . Given this setup, we apply an alternating optimization procedure. In the first step, we train the domain predictor using D a , to optimize the following loss: argmin In the second step, we train the representation learner and event classifier using D s to optimize the following loss: L refers to the cross-entropy loss and λ is a hyperparameter. In practice, the optimization in the above equation is performed using a gradient reversal layer (GRL) (Ganin and Lempitsky, 2015). A GRL works as follows. During the forward pass, it acts as the identity, but during the backward pass it scales the gradients flowing through by −λ. We apply a GRL g λ before mapping the pooled representation to a domain label using D. This changes the optimization to: In our setup, the event classifier and domain predictors are MLP classifiers. For the representation learner, we experiment with several architectures.

Representation Learner Models
We experiment with the following models: 2 LSTM: A unidirectional LSTM over tokens represented using word embeddings.

Results and Analysis
Tables 2 and 3 present the results of our experiments. Table 2 shows the results when transferring from LitBank to TimeBank while Table 3 presents transfer results in the other direction. From Table 2 (transfer from LitBank to TimeBank), we see that ADA improves out-of-domain performance for all models, by 6.08 F1 on average. BERT-A performs best, reaching an F1 score of 49.6, using no labeled news data. Transfer experiments from TimeBank to LitBank (Table 3) showcase similar trends, with only BiLSTM not showing improvement with ADA. For other models, ADA results in an average out-of-domain F1 score improvement of 1.77. BERT-A performs best, reaching an F1 score of 44.1. We also note that models transferred from LitBank to TimeBank have high precision, while models transferred in the other direction have high recall. We believe this difference stems from the disparity in event density across corpora (Table 1). Since event density in LitBank is much lower, models transferred from LitBank tend to be slightly conservative (high precision), while models transferred from TimeBank are less so (high recall). When transferring from LitBank to TimeBank, LSTM generalizes better than BiLSTM, which may be because BiLSTM has twice as many parameters making it more prone to overfitting. ADA gives a higher F1 boost with BiLSTM, indicating that it may be acting as a regularizer. Another interesting result is the poor performance of POS when transferring from LitBank to TimeBank. This might stem from the Stanford CoreNLP tagger (trained on news data) producing inaccurate tags for Lit-Bank. Hence using automatically generated POS tags while training on LitBank does not produce  On average, ADA makes supervised models more robust on out-of-domain data, with an average F1 score improvement of 3.9, at no loss of in-domain performance.
What cases does ADA improve on? To gain more insight into the improvements observed on using ADA, we perform a manual analysis of out-ofdomain examples that BERT labels incorrectly, but BERT-A gets right. We carry out this analysis on 50 examples from TimeBank and LitBank each. We observe that an overwhelming number of cases from TimeBank use vocabulary in contexts unique to news (43/50 or 86%). This includes examples of financial events, political events and reporting events that are rarer in literature, indicating that ADA manages to reduce event extraction models' reliance on lexical features. We make similar observations for LitBank though the proportion of improvement cases with literature-specific vocabulary is more modest (22/50 or 44%). These cases include examples with archaic vocabulary, words that have a different meaning in literary contexts and human/ animal actions, which are not common in news. Table 4 presents a detailed breakdown of

Incorporating Minimal Labeled Data
Finetuning on labeled data: We run finetuning experiments to study improvement in model performance on incorporating small amounts of labeled target domain data. For both domains, we finetune BERT-A, slowly increasing the percentage of labeled data used from 1%-5%. 5 We compare BERT-A with two other models. The first model is naive BERT with no domain adaptation (BERT-NoDA). The second model is a BERT model trained via supervised domain adaptation (BERT-FEDA), which we use as an indicator of ceiling performance. The supervised domain adaptation method we use is the neural modification of frustratingly easy domain adaptation developed in Kim et al. (2016). Frustratingly easy domain adaptation (Daumé III, 2007) uses a feature augmentation strategy to improve performance when annotated data from both source and target domains is available. This algorithm simply duplicates input features 3 times, TimeBank 68.9 65.5 67.2 LitBank 40.3 71.5 51.5 Table 5: Model performance on both domains in the self-training paradigm creating a source-specific, target-specific and general version of each feature. For source data, only the source-specific and general features are active, while only the target-specific and general features are active for target data. The neural modification works by duplicating the feature extractor module, which is the BiLSTM in our case. Figures 2 and 3 present the results of these experiments. Performance of all models steadily improves with more data, but BERT-A starts with a much higher F1 score than BERT-NoDA, demonstrating that ADA boosts performance when little annotated training data is available. Performance increase of BERT-NoDA is suprisingly rapid, especially on LitBank. However, it is worth noting that 5% of the LitBank training set is ∼10,000 tokens, which is a substantial amount to annotate. Therefore, BERT-A beats BERT-NoDA on sample efficiency. We can also see that BERT-A does not do much worse than BERT-FEDA, which performs supervised adaptation. Using BERT-A to provide weak supervision: We run further experiments to determine whether finetuned BERT-A can be leveraged for selftraining (Yarowsky, 1995;Riloff and Wiebe, 2003). Self-training creates a teacher model from labeled data, which is then used to label a large amount of unlabeled data. Both labeled and unlabeled datasets are jointly used to train a student model. Algorithm 1 gives a quick overview of our self-training procedure. We use 1% of the training data as D l , with the remaining 99% used as D u . BERT-A acts as T , while S is a vanilla BERT model. Table 5 shows the results of self-training on both domains. Self-training improves model performance by nearly 7 F1 points on average. Increase on Time-Bank is much higher which may be due to the high precision-low recall tendency of the teacher model.

Conclusion
In this work, we tackled the task of building generalizable supervised event trigger identification models using adversarial domain adaptation (ADA) showed that ADA made supervised models more robust on out-of-domain data, with an average F1 score improvement of 3.9. Our best performing model (BERT-A) was able to reach 44-49 F1 across both domains using no labeled target domain data. Preliminary experiments showed that finetuning BERT-A on 1% labeled data, followed by selftraining led to substantial improvement, reaching 51.5 and 67.2 F1 on literature and news respectively. While these results are encouraging, we are yet to match supervised in-domain model performance. Future directions to explore include incorporating noise-robust training procedures (Goldberger and Ben-Reuven, 2017) and example weighting (Dehghani et al., 2018) during self-training, and exploring lexical alignment methods from literature on learning cross-lingual embeddings.