A Comparison of Event Representations in DEFT

This paper will discuss and compare event representations across a variety of types of event annotation: Rich Entities, Relations, and Events (Rich ERE), Light Entities, Relations, and Events (Light ERE), Event Nugget (EN), Event Argument Extraction (EAE), Richer Event Descriptions (RED), and Event-Event Relations (EER). Comparisons of event representations are presented, along with a comparison of data annotated according to each event representation. An event annotation ex-periment is also discussed, including annotation for all of these representations on the same set of sample data, with the purpose of being able to compare actual annotation across all of these approaches as directly as possible. We walk through a brief example to illustrate the various annotation approaches, and to show the intersections among the various annotated data sets.


Introduction
This paper will discuss and compare event representations across the various types of event annotation that are part of the Deep Exploration and Filtering of Text (DEFT) program: Rich Entities, Relations, and Events (Rich ERE), Light Entities, Relations, and Events (Light ERE), Event Nugget (EN), Event Argument Extraction (EAE), Richer Event Descriptions (RED), and Event-Event Relations (EER). The DEFT program seeks to improve state-of-the-art capabilities in automated deep natural language processing, with a particular focus on technologies dealing with inference, causal relationships, and anomaly detection across several languages (DARPA, 2012). The processing of events and event-event relations underlies this focus. The annotation of events and event-event relations is a crucial part of supporting work on these technologies, and a variety of approaches are currently underway.
This paper presents a comparison of event representations, the data annotated, and possible future plans for event annotation. An event annota-tion experiment is also discussed, in which the same set of sample data was annotated in each of the above representations, with the purpose of being able to compare actual annotation across all of these approaches as directly as possible. The paper walks through a brief example to illustrate the various annotation approaches, and to show the intersections among the various annotated data sets.

Entities, Relations, and Events (ERE)
ERE was developed as an annotation task that would be supportive of multiple research directions and evaluations in the DEFT program, and that would provide a useful foundation for more specialized annotation tasks like inference and anomaly. The ERE tasks include the annotation of entities, relations, and events and their attributes, according to a specific taxonomy, as was also done in Automatic Content Extraction (ACE) (LDC, 2005;Walker et al., 2006). Light ERE was designed as a lighter-weight version of ACE and a simple approach to entity, relation, and event annotation, with the goal of making annotation easier and more consistent. Light ERE captures a reduced inventory of entity and relation types, with fewer attributes (compared to ACE; for example, only specific entities and actual relations are taggable, and entity subtypes are not labeled). The event ontology of Light ERE is similar to ACE, with slight modification and reduction, and there is strict coreference of events within documents (Aguilar et al., 2014). As in ACE, the annotation of each event mention includes the identification of a trigger, the labeling of the event type, subtype, and participating event argument entities and time expressions. Simplifying from ACE, only attested actual events are annotated (no irrealis events or arguments).
Rich ERE annotation expands on both the inventories and taggability of Light ERE . Rich ERE Entity annotation adds nonspecific entities and nominal head marking, in addition to adding a distinction between Location and Facility entity types. Rich ERE Relation annotation doubles the Light ERE ontology to twenty relation subtypes, and also adds future, hypothetical, and conditional relations. A new category of argument fillers was added for Rich ERE, to allow arguments that are not taggable as entities to be used as fillers for specific relation and event subtypes. For each event mention, Rich ERE labels the event type and subtype, its realis attribute, any of its arguments or participants that are present, and a required "trigger" string in the text. Rich ERE Event annotation includes increased taggability in several areas, compared to Light ERE or ACE event annotation: a slightly expanded event ontology, the addition of generic and other (irrealis) event mentions, the addition of event argument fillers that are otherwise not captured as entities, such as weapon, vehicle, money etc., the addition of argumentless triggers for event mentions, additional attributes for contact and transaction events, double tagging of event mentions for multiple types/subtypes, and multiple tagging of event mentions for certain types of coordination.
In Rich ERE, the concept of Event Hopper was also introduced as a more inclusive, less strict notion of event coreference than that used in Light ERE or ACE. Event hoppers contain mentions of events that are intuitively coreferential to the annotator even if they do not meet the earlier strict event identity requirement, and therefore group events according to a more inclusive coreference specification, which will allow a wider range of event mentions to be coreferential . Event mentions could be placed into the same event hoppers even if they differed in temporal or trigger granularity, their arguments were non-coreferential or conflicting, or if their realis mood differed, as long as they referred to the same event with the same type and subtype. For example, in the following two sentences:  Although the realis label differs for the two mentions (Other vs. Actual), if both mentions refer to the same trip to Paris in the document context, they still belong to the same event hopper.
The development of Rich ERE is intended to lay the groundwork for upcoming expansion into the realm of event-event relations, as well as cross-document and even cross-lingual event representation.

Event Argument Linking (EAL)
EAE was developed as a track within NIST's 2014 TAC Knowledge Base Population (KBP) (http://www.nist.gov/tac/) evaluation (Freedman et al., 2014). Following the paradigm of Slot Filling and Cold Start, two other KBP evaluation tracks, EAE in 2014 was evaluated by assessment over a pool of participant submissions .
A KBP EAE submission consisted of a table of event arguments. For each argument, systems returned the event type/subtype, the argument's role in the event, a canonical mention of the argument entity (e.g., the most informative name string found for entity), textual justification of the extraction, a realis label (Actual, Generic, Other), and a confidence score. In the 2014 evaluation, the final pool of responses for scoring included a humanproduced "manual run" developed by annotators along with up to 5 runs per participating team.
During assessment, annotators judged the correctness of the provided event type, the argument's assigned role in the event, the (possibly normalized) mention string of the argument entity, and the mention string for the argument entity from the larger provenance connecting it to the event. For correct responses, assessors also provided a realis label (Actual, Generic, Other), a mention type label (Name, Nominal, Other), and created equivalence class clusters. Assessment was performed on responses taken from approximately 250 source documents in 2014, 50 of which were dually assessed.
For dually assessed documents, agreement and kappa were calculated for two conditions (a) over all appropriate tags, (b) collapsing CORRECT and INEXACT (this reflects what is done for the official score). The agreement/kappa numbers only reflect cases where both annotators made an assessment (in some cases, e.g., assessors disagreed on the presence of event typeone assessor would perform an assessment and the other would not).
Numbers before the '/' are over all labels (requiring exact match); numbers after '/' collapse COR-RECT and INEXACT. In official scoring, an answer is added to the GS if Realis matches and all of the following are CORRECT/INEXACT: AET, AER, BF, CAS. CAS Type is used for analyzing output, but is not a part of the scoring process. The EAE task (and assessed results) differs in three important ways from the Rich ERE annotation. (1) The EAE task is assessed at the level of 'entity' and 'document-level-event' and not at the level of the event mention. As such, a system need only find an argument one time, even when an entity's participation in some event is made explicit in several mentions throughout a document. (2) The EAE task treats as correct a broader set inference about arguments, for example inferring the date/location of an event through general reasoning over the document or inferring participation in an event through group membership. (3) The EAE task does not assess a sub-sentence linguistic trigger/justification of the event. Instead the event type (ET) assessment of EAE task assesses whether or not a sentence length unit justifies the event given a reasonable reader's interpretation.
In 2015, EAE was extended to require systems to group together arguments participating in the same event. In the updated version of the task, named Event Argument Linking (EAL) and conducted as part of the 2015 TAC KBP evaluations, event arguments were extracted following the same guidelines as EAE, but then had to be grouped together into Event Hoppers, as defined by the Rich ERE task. Performance was measured by comparing system-developed argument clusters against those created by annotators following assessment, which included all responses provided by systems for each document .
To support the 2015 linking task, correct, nongeneric assessments from the system submitted output were grouped into argument-sets of hoppersized granularity. This annotation was performed on 50 sample documents as development data for the 2015 EAL evaluation and on 81 documents in the 2015 EAL test set. The development data overlaps with data for which Rich ERE annotation was performed. The overlapping data set could be used to explore how the differences in annotation procedure lead to differences in decisions about event granularity.

Event Nugget (EN)
An Event Nugget is a tuple of an event trigger, classification of event type and subtype, and realis attribute. It is similar to an event mention in ERE, but arguments are not labelled. EN annotation in 2014 focused on event nuggets (expanded triggers) only, and followed the same taxonomy of 33 event types and subtypes as Light ERE. However, instead of tagging minimal extent as the trigger, EN allowed multi-word event nuggets (Mitamura et al., 2015). Multi-word event nuggets can be either continuous or discontinuous, and are based on the goal of marking the maximal extent of a semantically meaningful unit to express the event in a sentence. EN also added a realis attribute for each event mention. The realis attribute labels each event as Actual, Generic, or Other. TAC KBP 2014 conducted a pilot evaluation on Event Nugget Detection (END), in which systems were required to detect event nugget tuples, consisting of an event trigger, the type and subtype classification, and the realis attribute.
In 2015, TAC KBP ran an open evaluation on EN that was expanded to three evaluation tasks: Event Nugget Detection, Event Nugget Detection and Coreference, and Event Nugget Coreference. Full Event Nugget Coreference is identified when two or more Event Nuggets refer to the same event. EN annotation in 2015 followed the Rich ERE event taxonomy, which added 5 event types and subtypes to make a total of 38 event types and subtypes, and also followed the Rich ERE guidelines on trigger extents, which adopted the minimal extent rule and disallowed discontinuous event triggers (Song et al., 2016). Annotation of Event Nugget Coreference adopted the concept of Event Hopper as in Rich ERE.

Event-Event Relations (EER)
EER annotation focuses on relations between events in the ERE/ACE taxonomy, both within document and cross-document (Hong et al., 2016). Our general goal is to construct event-centric knowledge networks, where each node is an event and the edges effectively capture the relations between any two events. EER includes five main types of event relations -Inheritance, Expansion, Contingency, Comparison and Temporalityalong with 21 sense-based subtypes (or relation senses), as shown in Table 1.
Events involved in a relation play certain roles. For example, an Attack event and an Injure event in a Contingency_Causality will play Cause and Result roles respectively. Figure 1 shows more information about types and roles.

Richer Event Descriptions (RED)
RED annotation (Ikuta et al., 2014) marks all events in a document, as well as certain relations between those events. RED combines coreference (Pradhan et al., 2007;Lee et al., 2012) and THYME Temporal Relations annotation  to provide a thorough representation of entities, events and their relations. The RED schema also goes beyond prior annotations of coreference or temporal relations by also annotating subevent structure, cause-effect relations and reporting relations. Guidelines for RED annotation can be found at https://github.com/timjogorman/RicherEventDescr iption/blob/master/guidelines.md.

Annotation Features and Data Annotated
The representation of events and the scope of annotation vary across the different annotation approaches. Table 2 compares the definition of events in each annotation. Table 3 compares how event coreference and event relations are annotated. Table 4 shows the annotated data volume for each annotation schema, completed as of 2015. Annotated data has been distributed to DEFT performers, and will be made available to the wider community as part of the Linguistic Data Consortium (LDC) catalog.

Comparison: The Event Annotation Experiment
As a result of this diversity in event annotation within DEFT, the Event Working Group which is focusing on events within DEFT has initiated an experiment in which participating teams perform event annotations on the same set of data using all of the different annotation schemas. The goal of the experiment is to make the differences and similarities between the approaches more apparent and to facilitate comparison. The 50 documents that had been dually assessed for EAE in 2014 were chosen as the data set for the experiment, as both system output and human assessment already exists for the data. Additionally, five newswire documents from the EAE pilot data pool were identified as particularly challenging for event annotation, and so these five documents were also included in the experiment set.
The annotation for this experiment has been completed and released to the DEFT community (LDC, 2015) and will be subsequently published in LDC's catalog, making it available to the broader research community. EN annotation in this experiment adopted the 2014 annotation schema, instead of the 2015 schema. Additionally EAE annotation in the experiment did not group arguments of the same event.
To illustrate the differences of the annotation tasks, below are the annotations in each represention for the following two sentences:

RED:
There are two events in the sentence according to RED. The first predicate, quit, is tagged as an Actual Event that occurs BEFORE Document Time. The second, replace, is tagged as an Actual Event that is expected to occur AFTER Document Time. RED also labels a causal and temporal relation between the two events, "BE-FORE/PRECONDITIONS", showing that the quitting event leads to, but does not directly cause, the replacement, and a temporal CONTAINS relation linking quit to Wednesday.  Event 1: quit -BEFORE DOCTIME, Actual Modality  Event 2: replace -AFTER DOCTIME, Actual Modality  Relation 1: quit BEFORE/ PRECONDI-TIONS replace  Relation 2: Wednesday CONTAINS quit Although RED does not annotate the arguments of events, it is intended to be combined with semantic role annotations such as PropBank (Bonial et al., 2014) or AMR (Banarescu et al., 2013), which would provide the argument information. For this example, the quit and replace events would also be given the predicate argument structures below: quit.01 Arg0: Media Tycoon Barry Diller Arg1: as chief of Vivendi Universal Entertainment ArgM-TMP: on Wednesday replace.01 Arg2: Parent company chairman Jean-Rene Fourtou Arg1: Diller ArgM-MOD: will ArgM-PRD: as chief executive of US unit.

EER:
The following events are connected by Condition and Temporality relations:  Event 1 (Personnel.EndPosition): quit  Event 2 (Personnel.StartPosition): replace A preliminary analysis of the Rich ERE and Event Argument annotations shows that EAE includes more annotated events, since both noninferred and inferred events are targeted by EAE, while Rich ERE currently does not annotate inferred events. Initial analysis of Rich ERE and Event Nugget annotation shows that there is considerable overlap in event mentions annotated in both Rich ERE and EN 2014. The main differences are as expected due to the differences in the annotation tasks, namely (1) annotated extents (EN 2014 tags maximal extents of event nuggets to capture the complete semantic unit of the nugget, whereas Rich ERE captures the minimal extent of the event trigger), and (2) double tagging (Rich ERE allows annotation of the same event trigger more than once when the trigger instantiates more than one event type/subtype or in case of conjunction, while EN in 2014 annotated each nugget for a single event type/subtype). To better align with the EAE evaluation as well as the existing Rich ERE annotated data, EN 2015 adopted the Rich ERE guidelines so these differences between Rich ERE and EN 2014 were eliminated.

Future Work
In future and on-going work, several of these approaches are coming together in certain aspects for the purpose of conducting event evaluations.
To avoid producing different annotation datasets to support Event Argument and Event Nugget evaluation (as was the case in both 2014 and 2015) and to bring Event Argument and Event Nugget evaluation together to address the goals of extraction and clustering of events for populating a knowledge base, both Event Argument and Event Nugget will use Event annotation in Rich ERE as evaluation data in 2016. Event Argument will be switching to a gold-standard based evaluation for measuring performance in extracting event arguments and within-document grouping of arguments when they participate in the same event. Event Nugget will evaluate event nugget detection and coreference, as in 2015. Both Event Argument and Event Nugget evaluations will be expanded to multiple languages (English, Chinese and Spanish). For both, the set of valid event types will be reduced to the 18 listed below: Additionally, a query-and assessment-based evaluation will also be conducted as a task in Event Argument evaluation to test capabilities in cross-document clustering of events. The size of the evaluation source corpus for Event Argument will increase dramatically (from 500 documents in 2015 to 90,000 documents in 2016).
Finally, the EN team is working on a possible pilot evaluation for 'event sequencing' for 2016.

Conclusion
The variety of event representations and annotations in the DEFT program seek to tackle the issues of event annotation from multiple directions, and each type of representation brings an aspect of the larger event picture into focus. The comparisons across these representations that will be possible using the annotation from the Event Annotation Experiment will provide on-going useful information as event annotation evolves to meet the goals of the DEFT program and the research community.