MAVEN: A Massive General Domain Event Detection Dataset

Event detection (ED), which identifies event trigger words and classifies event types according to contexts, is the first and most fundamental step for extracting event knowledge from plain text. Most existing datasets exhibit the following issues that limit further development of ED: (1) Small scale of existing datasets is not sufficient for training and stably benchmarking increasingly sophisticated modern neural methods. (2) Limited event types of existing datasets lead to the trained models cannot be easily adapted to general-domain scenarios. To alleviate these problems, we present a MAssive eVENt detection dataset (MAVEN), which contains 4,480 Wikipedia documents, 117,200 event mention instances, and 207 event types. MAVEN alleviates the lack of data problem and covers much more general event types. Besides the dataset, we reproduce the recent state-of-the-art ED models and conduct a thorough evaluation for these models on MAVEN. The experimental results and empirical analyses show that existing ED methods cannot achieve promising results as on the small datasets, which suggests ED in real world remains a challenging task and requires further research efforts. The dataset and baseline code will be released in the future to promote this field.


Introduction
Event detection (ED) is an important task of information extraction, which aims to identify event triggers (the words or phrases evoking events in text) and classify event types.For instance, in the sentence "Bill Gates founded Microsoft in 1975", an ED model should recognize that the word "founded" is the trigger of a Found event.ED is the first stage to extract event knowledge from text (Ahn, 2006) and also fundamental to various Preprint.Work in progress.NLP applications (Yang et al., 2003;Basile et al., 2014;Cheng and Erk, 2018).
Different from the advanced models having been continuously proposed, the benchmark datasets for ED are upgraded slowly.As event annotation is complex and expensive, the existing datasets are mostly small-scale.As shown in Figure 1, the most widely-used ACE 2005 English dataset (Walker et al., 2006) only contains 599 documents and 5, 349 annotated instances.Due to the inherent data imbalance problem, 20 of its 33 event types only have fewer than 100 annotated instances.As recent neural methods are typically data-hungry, these small-scale datasets are not sufficient for training and stably benchmarking modern models.Moreover, the covered event types in existing datasets are limited.The ACE 2005 English dataset only contains 8 event types and 33 specific subtypes.The Rich ERE ontology (Song et al., 2015) used by TAC KBP challenges (Ellis et al., 2015(Ellis et al., , 2016) ) covers 9 event types and 38 subtypes.The coverage of these datasets is low for general domain events, which results in the model trained on these datasets cannot be easily transferred and applied on general applications.
Recent research (Huang et al., 2016;Chen et al., 2017) has shown that the existing datasets suffering from the lack of data and low coverage problems are increasingly incompetent for benchmarking emerging supervised methods, i.e., the evaluation results on these datasets are difficult to reflect the effectiveness of novel methods.To tackle these issues, some works adopt the distantly supervised methods (Mintz et al., 2009) to automatically annotate data with existing event instances in knowledge bases (Chen et al., 2017;Zeng et al., 2018;Araki and Mitamura, 2018) or use bootstrapping methods to generate new instances (Ferguson et al., 2018;Wang et al., 2019a).However, the generated data are inevitably noisy and homogeneous due to the limited number of event instances in existing knowledge bases.
In this paper, we present MAVEN, a humanannotated large-scale general domain event detection dataset constructed from English Wikipedia1 and FrameNet (Baker et al., 1998), which significantly alleviate the lack of data and low coverage problem: (1) Our MAVEN dataset contains 109, 567 different events, 117, 200 event mentions, which is twenty times larger than the most widelyused ACE 2005, and 4, 480 annotated documents in total.To the best of our knowledge, this is the largest human-annotated event detection dataset until now.(2) MAVEN contains 207 event types in total, which covers a much broader range of general domain events.These event types are manually selected and derived from the frames defined in the linguistic resource FrameNet (Baker et al., 1998).As previous works (Aguilar et al., 2014;Huang et al., 2018) show, the FrameNet frames have good coverage of general event semantics.In our event schema, we maintain the good coverage of event semantics and simplify event types to enable the crowd-sourced annotation by manually selecting the event-related frames and merging some finegrained frames.
We reproduce some recent state-of-the-art ED models and conduct a thorough evaluation of these models on our MAVEN dataset.From the experimental results and empirical analyses, we observe a significant performance drop of these models as compared with on existing ED benchmarks.It indicates that detecting general-domain events is still challenging and the existing datasets are difficult to support further explorations.We hope that all contents of our MAVEN could encourage the community to make further exploration and breakthroughs towards better ED methods.

Event Detection Formalization
In our dataset, we mostly follow the setting and terminologies defined in the ACE 2005program (Doddington et al., 2004).We specify the vital terminologies as follows: An event is a specific occurrence involving participants (Consortium, 2005).In MAVEN, we mainly focus on extracting the basic events that can be described in one or a few sentences rather than the high-level grand events.Each event will be labeled as a certain event type.An event mention is a sentence that its semantic meaning can describe an event.As the same event may be mentioned many times in different sentences of a document, the number of event mentions will be more than the number of events.An event trigger is the key word or phrase in an event mention that most clearly expresses the event.
The ED task is to identify event triggers and classify event types for given sentences.Accordingly, ED is conventionally divided into two subtasks: Trigger Identification and Trigger Classification (Ahn, 2006).Trigger identification is to identify the annotated triggers from all possible words and phrases.Trigger classification is to classify the corresponding event types for the identified triggers.Both the subtasks are evaluated with precision, recall, and F-1 scores.Recent neural methods typically formulate ED as a token-level multi-class classification task (Chen et al., 2015;Nguyen et al., 2016) or sequence labeling task (Chen et al., 2018;Zeng et al., 2018), and only report the classification results for comparison (add an additional type N/A to be classified at the same time, indicating that it is not a trigger).In MAVEN, we inherit all the abovementioned settings in both constructing the dataset and evaluating the typical models.

Data Collection
In this section, we will introduce the detailed data collection process of MAVEN, including following stages: (1) Event schema construction, (2) Document selection, (3) Automatic labeling and candidate selection, (4) Human annotation.

Event Schema Construction
As the event schema used by the existing ED datasets like ACE (Doddington et al., 2004), Light ERE (Aguilar et al., 2014) and Rich ERE (Song et al., 2015) only includes limited event types (e.g.Life, Movement, Contact, etc).Hence, we need to construct a new event schema with a good coverage of general-domain events for our dataset .
Inspired by Aguilar et al. (2014), we mostly use the frames defined in FrameNet (Baker et al., 1998) as our event types for a good coverage.FrameNet follows the frame semantic theory (Fillmore et al., 1976;Charles et al., 1982) and defines over 1, 200 semantic frames along with corresponding frame elements, frame relations, and lexical units.From the ED perspective, some of the frames and lexical units can be used as event types and example triggers respectively.The frame elements can also be used as event arguments in the future.
Considering FrameNet is primarily a linguistic resource constructed by linguistic experts, it prioritizes lexicographic and linguistic completeness over ease of annotation (Aguilar et al., 2014).To enable the crowd-source annotation with massive annotators, we simplify the original frame schema to construct our event schema.We first collect 598 event-related frames from FrameNet by selecting the frames inherited from the Event frame.Then we manually filter out some abstractive frames (e.g.Activity ongoing, Process resume, etc), merge some similar frames (e.g.Choosing and Adopt selection ), and assemble those too fine-grained frames into more generalized frames (e.g. Visitor arrival, Visit host arrival, and Drop in on into Arriving).Finally, we get a new event schema with 207 event types.

Document Selection
To support the annotation, we need to select informative documents as our basic corpus.Each document should contain a number of events and the events should have connections (e.g. a storyline) to facilitate further research like event coreference resolution, event relation extraction.To this end, we choose English Wikipedia as our data source considering it is informative, widelyused (Rajpurkar et al., 2016;Yang et al., 2018).Meanwhile, Wikipedia contains rich entities and will benefit further event argument annotation in the next step.
To effectively select the articles containing enough events, we follow a simple intuition that the articles describing grand "topic events" may contain much more basic events than the articles expressing specific entity definitions.Recently, a knowledge base for major events EventWiki (Ge et al., 2018) is proposed, and each major event in EventWiki is described with a Wikipedia article.We thus utilize the articles indexed by EventWiki as base and select some articles to annotate their basic events concerned with our event schema.Finally, we select 4, 480 documents in total, covering 90 of the 95 major event types defined in Even-tWiki.Table 1 shows the top 5 EventWiki types of our selected documents.
To ensure the quality of articles, we follow the previous settings (Yao et al., 2019) to use the introductory sections of the Wikipedia articles for annotation.Moreover, we filter out the articles with fewer than 5 sentences or fewer than 10 event-related frames labeled by a semantic labeling tool (Swayamdipta et al., 2017).

Automatic Labeling and Candidate Selection
As we have a large scale of data to be annotated with 207 event types and our annotators are typically not linguistic experts, it is necessary to adopt some heuristic methods to limit the number of trigger candidates and event type candidates for each trigger candidate.(Bird, 2006).
We first do POS tagging with the NLTK toolkit (Bird, 2006) 2 , and select the content words (nouns, verbs, adjectives, and adverbs) as the trigger candidates to be annotated.As event triggers can also be phrases rather than single words, the phrases in documents that can be matched with the phrases provided in FrameNet are also selected as trigger candidates.For each trigger candidate, we only recommend 15 event types with the highest cosine similarities between trigger word embedding and the average of the word embeddings of entity types' corresponding lexical units in FrameNet.The word embeddings we used here are the pretrained word vectors3 trained with Glove (Pennington et al., 2014).
We also automatically label some of the trigger candidates with a frame semantic parser (Swayamdipta et al., 2017) and use the predicted frames as the default event types.The annotators can change them with more appropriate event types or just keep them as the final decision to save time and effort.

Human Annotation
The final step is human annotation.The annotators are required to label the trigger candidates with appropriate event types (if multiple event types are suitable for an event, we ask the annotators to label the most fine-grained one) and merge the event mentions (triggers) expressing the same event.
As the event annotation is complicated, to ensure the accuracy and consistency of our annotation, we follow the ACE 2005 annotation process (Consortium, 2005) to organize a two-stage iterative annotation.In the first stage, 121 crowd-source annotators will annotate the documents given the default results and candidate sets described in the last section.Each document will be annotated twice by two independent annotators.In the second stage, 17 experienced annotators and experts will review the result of the first stage and give final result given the results of the two first-stage annotators.

Data Analysis
In this section, we analyze our MAVEN to show the features of the dataset.

Data Size
We show the main statistics of our MAVEN compared with some existing widely-used ED datasets in Table 2, including the most widelyused ACE 2005 dataset (Walker et al., 2006) and a series of Rich ERE annotation datasets used by TAC KBP competition, which are DEFT Rich ERE English Training Annotation V2 (LDC2015E29), DEFT Rich ERE English Training Annotation R2 V2 (LDC2015E68), DEFT Rich ERE Chinese and English Parallel Annotation V2 (LDC2015E78), TAC KBP Event Nugget data 2014-2016 (LDC2017E02) (Ellis et al., 2014(Ellis et al., , 2015(Ellis et al., , 2016) ) and TAC KBP 2017 (LDC2017E55) (Get-   2019), but the combined dataset is still much smaller than our MAVEN.Our MAVEN is larger than all existing ED datasets, especially in the number of events.Hopefully, the large-scale dataset can facilitate the research on general domain ED.

Data Distribution
Figure 2 shows the top-50 event types number of instance distribution in our MAVEN.It is still a long tail distribution due to the inherent data imbalance problem.However, as our MAVEN is largescale, 32% and 87% event types have more than 500 and 100 instances in our MAVEN respectively.Compared with existing datasets like ACE 2005 (only 13 of 33 event types have more than 100 instances), MAVEN significantly alleviate the data sparsity and data imbalance problem, which will benefit developing strong ED models and various event-related downstream applications.

Benchmark Setting
Before the experiments, we will introduce the benchmark setting here.Our data are randomly split into training, development and test sets and the statistics of the three sets are shown in Table 3.After the data split, there are 52 and 167 of 207 event types have more than 500 and 100 training instances respectively, which ensures the models can be well-trained.Conventionally, the existing ED datasets only provide the standard annotation of positive instances (i.e. the annotated event triggers) and researchers will sample the negative instances (i.e.non-trigger words or phrases) from the documents by themselves, which may lead to potential unfair comparisons between different methods.In our MAVEN, we provide official negative instances to ensure a fair comparison.As described in the candidate selection part of Section 3, the negative instances are the content words labeled by the NLTK POS tagger or the phrases which can be matched with the lexical units in FrameNet.In other words, we only filter out the empty words, which will not influence the application of models developed on MAVEN.

Experiments
In this section, we conduct experiments to show the challenge of MAVEN and analyze the experiments to discuss the direction of ED.

Experimental Setting
We will introduce the experimental settings at first, including the representative models we used and the evaluation metrics.
Models Recently, various neural models have been developed for ED and achieved superior performances compared with traditional feature-based models.Hence, we reproduce three representative state-of-the-art neural models and report their performances on both MAVEN and widely-used ACE 2005 to assess the challenge of MAVEN.The models include: • DMCNN (Chen et al., 2015) is representative for convolutional neural network (CNN) models.It leverages CNN to automatically learn the sequence representations and proposes a dynamic multi-pooling mechanism to aggregate the CNN outputs into trigger-specific representations.Specifically, the dynamic multipooling separately max-pooling the feature vectors before and after the position of the trigger candidate, and concatenate the two features as the final sentence representation.
poses a multi-order graph attention network to effectively model the multi-order syntactic relations in dependency trees and improve ED.
• DMBERT is a vanilla BERT-based model proposed in Wang et al. (2019a).It takes advantage of the effective pre-trained language representation model BERT (Devlin et al., 2019) and also uses a dynamic multipooling method to aggregate the features.We use the BERT BASE model and released pretrained checkpoints4 .The DMBERT is implemented with HuggingFace's Transformers library (Wolf et al., 2019).In our reproduction, we insert special tokens around the trigger candidate and use a much larger batch size, hence the results are higher than the original implementation (Wang et al., 2019a).
Evaluation Following the widely-used setting introduced in Section 2, we report the precision, recall, and F-1 scores as our evaluation metrics and all the models directly do trigger classification without the additional identification stage.On MAVEN, all the models use the standard data split and negative instances described in Section 4.3.We run each model multiple times and report the average and standard deviation for each metric.On ACE 2005, we use 40 newswire articles for testing, 30 random documents for development, and 529 documents for training following previous works (Chen et al., 2015;Wang et al., 2019b), and sample all the unlabeled content words as negative instances.

Experimental Results
We report the overall experimental results in Table 4, from which we have the following observations: • Although the models perform well on the small-scale ACE 2005 dataset, their performances are significantly lower and not satisfying on MAVEN.It indicates that our MAVEN is challenging and the general domain ED still needs more research efforts.
• The deviations of DMBERT are larger than other models, which is consistent with recent findings (Dodge et al., 2020) that the finetuning processes of pre-trained language models (PLM) are often unstable.It is necessary to run multiple times and report averages or medians for evaluating PLM-based models.

Related Work
As stated in Section 2, we follow the ED task definition specified in the ACE challenges, especially the ACE 2005 dataset (Doddington et al., 2004) in this paper, which requires ED models to generally detect the event triggers and classify them into specific event types.The ACE event schema and annotation standard are simplified into Light ERE and further extended to Rich ERE (Song et al., 2015) to cover more but still a limited number of event types.Rich ERE is used to create various datasets and the TAC KBP 2014-1017 challenges (Ellis et al., 2014(Ellis et al., , 2015(Ellis et al., , 2016;;Getman et al., 2017).Nowadays, the majority of ED and event extraction models (Ji and Grishman, 2008;Li et al., 2013;Chen et al., 2015;Feng et al., 2016;Liu et al., 2017;Zhao et al., 2018;Yan et al., 2019) are developed on these datasets.
Our MAVEN follows the effective framework, and extends it to numerous general domain event types and contains a large scale of data.
There are also various datasets define the ED task in a different way.The early MUC series datasets (Grishman and Sundheim, 1996) define event extraction as a slot-filling task.The TDT corpus (Allan, 2012) and some other datasets (Araki and Mitamura, 2018;Sims et al., 2019;Liu et al., 2019) follows the open-domain paradigm, which does not require the models to classify the specific event types for better coverage but limits the downstream application of the extracted events.Some datasets are developed for domain-specific event extraction on special domains, like the biomedical domain (Pyysalo et al., 2007;Kim et al., 2008;Thompson et al., 2009;Buyko et al., 2010;Nédellec et al., 2013), literature (Sims et al., 2019), Twitter (Ritter et al., 2012;Guo et al., 2013) and breaking news (Pustejovsky et al., 2003).These datasets are also typically small-scale due to the inherent complexity of event annotation, but their different settings are complementary to our framework.

Conclusion and Future work
In this paper, we present a massive general domain event detection dataset (MAVEN), which significantly alleviates the data sparsity and low coverage problems of existing datasets.We conduct a thorough evaluation of the state-of-the-art models on our MAVEN and observe an obvious performance drop, which indicates that our MAVEN is challenging and may facilitate further research on ED.
We will further construct a hierarchical event schema and conduct more experiments to verify the effectiveness of MAVEN.The dataset and codes will be released when we finish all the data cleansing and experiments in more settings.If you need the dataset in this version during this period, please e-mail the authors.In the future, we will also explore to extend MAVEN to more event-related tasks like event argument extraction, event sequencing.

Table 2 :
Statistics of MAVEN compared with existing widely-used ED datasets.The #Event Type shows the number of the most fine-grained types (i.e. the "subtype" of ACE and ERE).For the multilingual datasets, we report the statistics of the English subset (typically largest subset) for direct comparison.We merge all the Rich ERE datasets and remove the duplicate documents to get the "All" statistics.The number of tokens are acquired with the NLTK tokenizer

Table 3 :
Statistics of MAVEN data split.

Table 4 :
The overall performances of different models on ACE 2005 and MAVEN.The DMCNN and MOGANED results on ACE 2005 are taken from original papers