SemEval-2018 Task 5: Counting Events and Participants in the Long Tail

This paper discusses SemEval-2018 Task 5: a referential quantification task of counting events and participants in local, long-tail news documents with high ambiguity. The complexity of this task challenges systems to establish the meaning, reference and identity across documents. The task consists of three subtasks and spans across three domains. We detail the design of this referential quantification task, describe the participating systems, and present additional analysis to gain deeper insight into their performance.


Introduction
We present a "referential quantification" task that requires systems to establish the meaning, reference and identity of events 1 and participants in news articles. By "referential quantification", we mean questions concerning the number of incidents of an event type (e.g. How many killing incidents happened in 2016 in Columbus, MS?) or participants in roles (e.g. How many people were killed in 2016 in Columbus, MS?), as opposed to factoid questions for specific properties of individual events and entities (e.g. When was 2pac murdered?). The questions are given with certain constraints on the location, time, participants, and event types, which requires understanding of the meaning of words mentioning these properties (e.g. Word Sense Disambiguation), but also adequately establishing the identity (e.g. reference and coreference) across mentions. The task thus represents both an intrinsic and application-based evaluation, as systems are forced to resolve ambiguity of meaning and reference, as well as variation in reference in order to answer the questions. Figure 1 shows an overview of our quantification task. We provide the participants with a set of questions and their corresponding news documents. 2 Systems are asked to distill event-and participant-based knowledge from the documents to answer the question. Systems submit both a numeric answer (3 events in Figure 1), and the corresponding events with their mentions found in the provided texts (e.g., the leftmost incident in Figure 1 is referred to by the coreferring mentions "killed" and "assault" found in two separate documents). Systems are evaluated on both the numeric answers as well as on the sets of coreferring mentions. Mentions are represented by tokens and offsets provided by the organizers.
The incidents and their corresponding news articles are obtained from structured databases, which greatly reduces the need for annotation and mainly requires validation instead. Given this data and using a metric-driven strategy, we created a task that further maximizes ambiguity and variation of the data in relation to the questions. This ambiguity and variation includes a substantial amount of low-frequent, local events and entities, reflecting a large variety of long-tail phenomena. As such, the task is not only highly ambiguous but can also not be tackled by relying on the most frequent and popular (head) interpretations.
We see the following contributions of our task: 1. To the best of our knowledge, we propose the first task that is deliberately designed to address large ambiguity of meaning and reference over a high number of infrequent instances. 2. We introduce a methodology for creating large event-based tasks while avoiding a lot of annotation, since we base the task on structured data. The remaining annotation concerns targeted mentions given the structured data rather than full doc- uments with open-ended interpretations. 3. We made all of our code to create the task available, 3 which may stimulate others to create more tasks and datasets that tackle long-tail phenomena for other aspects of language processing, either within or outside of the SemEval competition. 4. This task provides insights into the strengths and weaknesses of semantic processing systems with respect to various long-tail phenomena. We expect that systems need to innovate by adjusting (deep) learning techniques to capture the referential complexity and knowledge sparseness, or by explicitly modeling aspects of events and entities to establish identity and reference.

Motivation & Target Communities
Expressions can have many different meanings and possibly an infinite number of references. At the same time, variation in language is also large, as we can make reference to the same things in many ways. This makes the tasks of Word Sense Disambiguation, Entity Linking, and Event and Nominal Coreference extremely hard. It also makes it very difficult to create a task that represents the problem at its full scale. Any sample of text will reduce the problem to a small set of meanings and references, but also to meanings that are popular at that time excluding many unpopular ones from the distributional long tail. Given this Zipfian distribution, a task that is challenging with respect to ambiguity, reference, and variation, and that is representative for the long tail as well, needs to fit certain constraints.
Our task directly relates to the following communities in semantic processing: 1. disambiguation and reference; 2. reading comprehension and question answering.

Disambiguation & Reference
Semantic NLP tasks are often limited in terms of the range of concepts and meanings that are covered. This is a necessary consequence of the annotation effort that is needed to create such tasks. Likewise, in , we observed that most well-known datasets for semantic tasks have an extremely low ambiguity and variation. Even in datasets that tried to increase the ambiguity and temporal diversity for the disambiguation and reference tasks, we still measured a notable bias with respect to ambiguity, variance, dominance, and time. Overall, tasks and their datasets show a strong semantic overfitting to the head of the distribution (the most popular part of the world) and are not representative for the diversity of the long tail.
Our task differs from existing ones in that: 1. we deliberately created a task with a high number of event instances per event, many of which with similar properties, leading to high confusability 2. we present an application-based task which requires to perform on a combination of intrinsic tasks such as reference, disambiguation, and spatial-temporal reasoning, that are usually tested separately in existing tasks.

Reading Comprehension & Question Answering
In several recent tasks, systems are asked to answer entity-based questions, typically by point-ing to the correct segment or coreference chain in text, or by composing an answer by abstracting over multiple paragraphs/text pieces. These tasks are based on Wikipedia (SQuAD (Rajpurkar et al., 2016), WikiQA (Yang et al., 2015), QASent (Wang et al., 2007), WIKIREADING (Hewlett et al., 2016)) or on annotated individual documents (MARCO (Nguyen et al., 2016), CNN and DailyMail datasets (Hermann et al., 2015)). Weston et al. (2015) outlined 20 skill sets, such as causality, resolving time and location, and reasoning over world knowledge, that are needed to build an intelligent QA system. These have been partially captured by the datasets MCTest (Richardson et al., 2013) and QuizBowl (Iyyer et al., 2014)), as well as the Se-mEval task on Answer Selection in Community Question Answering (Nakov et al., 2015(Nakov et al., , 2016. 4 However, all these datasets avoid representing real-world referential ambiguity to its full extent by mainly asking questions that require knowledge about popular Wikipedia entities and/or text understanding of a single document. 5 Unlike existing work, our task deliberately addresses the referential ambiguity of the world beyond Wikipedia, by asking questions about long-tail events described in multiple documents. By doing so, we require deep processing of text and establishing identity and reference across single documents.

Task Requirements
Our quantification task consists of questions like How many killing incidents happened in 2016 in Columbus, MS? on a dataset that maximizes confusability of meaning, reference and identity. To guide the creation of such task, we defined five requirements that apply to the data for a single event type, e.g. killing .
Each event type should contain: R1 Multiple event instances per event type, e.g. the killing of Joe Doe and the killing of Joe Roe. R2 Multiple event mentions per event instance within the same document. R3 Multiple documents with varying creation times that describe the same event. R4 Event confusability by combining one or multiple confusion factors: a) ambiguity of event mentions, e.g. John Doe fires a gun, and John Doe fires a worker. b) variance of event mentions, e.g. John Doe kills Joe Roe, and John Doe murders Joe Roe. c) time, e.g. killing A that happened in January 2013, and killing B in October 2016. d) participants, e.g. killing A committed by John Doe, and killing B committed by Joe Roe. e) location, e.g. killing A that happened in Columbus, MS, and killing B in Houston, TX. R5 Representation of non-dominant events and entities, i.e. instances that receive little media coverage. Hence, the entities would not be restricted to celebrities and the events are not widely discussed such as general elections.

Data & Resources
In this Section, we present our data sources and an example document. We also discuss considerations of licensing and availability.

Structured data
The majority of the source texts in this task are sampled from structured databases that contain supportive news sources about gun violence incidents. While these texts already contain enough confusability with respect to the aspects defined in Section 3, we add confusion through leveraging structured data from two other domains: fire incidents and business.
As a direct consequence of using these databases and our exploitation strategy, we are able to satisfy all requirements we set in Section 3. These databases contain many event instances per event type (R1), multiple event mentions in the same document per event instance (R2), cover a wide spread of publishing times per event instance (R3), represent non-dominant events and entities (R5), and contain rich annotation of event properties that allows us to create high confusability (R4, see Section 5.3 for our methodology).
For a large portion of the information in the structured databases, we manually validated that this information could be found in the supportive news sources, and excluded the documents for which this was not the case. For the remaining documents, we performed automatic tests to filter out low-quality entries.

Gun Violence
The gun violence data is collected from the standard reports provided by the Gun Violence Archive (GVA) website. 6 Each incident contains information about: 1. its location 2. its time 3. how many people were killed 4. how many people were injured 5. its participants. Participant information includes: (a) the role, i.e. victim or suspect (b) the name (c) the age 6. the news articles describing this incident. Table 1 provides a more detailed overview of the information available in the GVA.  To prevent systems from cheating (by using the structured data directly), the set of incidents and news articles is extended with news articles from the Signal-1M Dataset (Corney et al., 2016) and from the Web, that also stem from the gun violence domain, but are not found in the GVA.

Other domains
For the fire incidents domain, we make use of the FireRescue1 reports, 7 which describe the following information about 417 incidents: 1. their location as a surface form 2. their reporting time 3. one free text summary describing the incidents. 4. no information about participants. Based on this information, we manually annotated the incident time and mapped the location to its representation in Wikipedia.
We further carefully selected a small amount of news articles from the business domain from The Signal-1M Dataset. Since these documents were not semantically annotated with respect to event information, we manually annotated this data with the same kind of information as the other databases: incident location, time, and information on the affected participants.

Example document
For each document, we provide its title, content (tokenized), and creation time, e.g.: Title: $70K reward in deadly shooting near N. Philadelphia school Content: A $70,000 reward is being offered for information in a quadruple shooting near a Roman Catholic school ... DCT: 2017-4-5

Licensing & Availability
The news documents in our task are published on a very diverse set of (commercial) websites. Due to this diversity, there is no easy mechanism to check their licenses individually. Instead, we overcome potential licensing issues by distributing the data under the Fair Use policy. 8 9 During the SemEval-2018 period, but also afterwards, systems can easily test their submissions via our competition on Codalab. 10

Task Design
For every incident in the task, we have finegrained structured data with respect to its event type, location, time, and participants, and unstructured data in the form of the news sources that report on it. In this Section, we explain how we exploited this data in order to create the task. We present our three subtasks and the question template after which we outline the question creation. Finally, we explain how we divided the data into trial and test sets and provide some statistics about the data. For detailed information about the task, e.g. about the question and answer representation, we refer to the CodaLab website of the task.

Subtasks
The task contains two event-based subtasks and one entity-based subtask.
Subtask 1 (S1): Find the single event that answers the question e.g. Which killing incident happened in Wilmington, CA in June 2014? The main challenge is not to determine how many incidents satisfy the question, but to identify the documents that describe the single answer incident.
Subtask 2 (S2): Find all events (if any) that answer the question. This subtask differs from S1 in that the system now also has to determine the number of answer incidents, which makes this subtask harder. To make it more realistic, we also include questions with zero as an answer.
Subtask 3 (S3): Find all participant-role relations that answer the question e.g. How many people were killed in Wilmington, CA with the last name Smith? The goal is to determine the number of entities that satisfy the question. The system not only needs to identify the relevant incidents, but also to reason over the participant roles.

Question Template
Questions in each subtask consist of an event type and two event properties.
Event type We consider four event types in this task described through their representation in WordNet (Fellbaum, 1998) and FrameNet (F. Baker et al., 1998). Each question is constrained by exactly one event type.  Table 2: Description of the event types. The meanings column lists meanings that best describe the event type. It contains both FrameNet 1.7 frames (prefixed by fn17) and Word-Net 3.0 synsets (prefixed by wn30).
Event properties For each event property in our task (time, location, participants), we distinguish between three levels of granularity (see Table 1). In addition, we make a distinction between the surface form and the meaning of an event property value. For example, the surface form Wilmington can denote several meanings: the Wilmington cities in the states of California, North Carolina, and Delaware. When composing questions, for time and location we take the semantic (meaning) level, while for participants we use the surface form of their names. This is because the vast majority of the participants in our task are long tail instances which have no semantic representation in a structured knowledge base.

Question Creation
Our question creation strategy consists of three consecutive phases: question composition, generation of answer and confusion sets, and question scoring. These steps are common for both the event-based subtasks (S1 and S2) and the entitybased subtask S3. 1. Question composition We compose questions based on the template described in Section 5.2. This entails: 1. choice of a subtask 2. choice of an event type, e.g. killing 3. choice of two event properties (e.g. time and location) with their corresponding granularities (e.g. month and city) and concrete values (e.g. June 2014 and Wilmington, CA). This step generates a vast amount of potential questions (hundreds of thousands) in a data-driven way, i.e. we select the event type and properties per question purely based on the combinations we find in our data. Example questions are: Which killing event happened in June 2014 in Wilmington, CA? (subtask S1) How many killing events happened in June 2014 in Wilmington, CA? (subtask S2) How many people were killed in June 2014 in Wilmington, CA? (subtask S3) 2. Answer and confusion sets generation For each generated question, we define a set of answer and confusion incidents with their corresponding documents. Answer incidents are the ones which entirely fit the question parameters, e.g. all killing incidents that occur in June 2014 and in the city of Wilmington, CA. Confusion incidents fit some, but not all, values of the question parameters , i.e. they differ with respect to an event type or property (e.g. all fire incidents in June 2014 in Wilmington, CA; or all killings in June 2014, but not in Wilmington, CA; or all killings in Wilmington, CA, but not in June 2014).

Question scoring
The generated questions with their corresponding answers and confusion are next scored with respect to several metrics that measure their complexity. The per-question scores allow us to detect and remove the "easy" ones, and keep those that: 1. have a high number of answer incidents (only applicable to S2 and S3) 2. have a high number of confusion incidents 3. have a high average number of answer and confusion documents, i.e. news sources describing the answer and the confusion incidents correspondingly 4. have a high temporal spread with respect to the publishing dates reporting on each incident from the answer and confusion incidents 5. have a high ambiguity with respect to the surface forms of an event property value in a granularity level (e.g. we would favor Wilmington, since it is a city in at least three US states in our task data).

Data Partitioning
We divided the overall task data into two partitions: trial and test data. In practice, we separated these two data partitions by reserving one year of news documents (2017) from our task for the trial data, while using all the other data as test data.
The trial data stems from the gun violence domain, whereas the test data also contains data from the fire incidents and business domain. A subset of the trial and test data has been annotated for event coreference.  We made an effort to make the trial data representative for the test data with respect to the main aspects of our task: its referential complexity, high confusability, and long-tail instances. Despite the fact that the trial data contains less questions than the test data, Table 3 shows that it is similar to the test data with respect to the core properties, meaning that the trial data can be used as training data.

Evaluation
This Section describes the evaluation criteria in this task and the baselines we compare against.

Criteria
Evaluation is performed on three levels: incidentlevel, document-level, and mention-level. The incident-level evaluation compares the numeric answer provided by the system to the gold answer for each of the questions. The comparison is done twofold: by exact matching and by Root Mean Square Error (RMSE) for difference scoring. The scores per subtask are then averaged over all questions to compute a single incidentlevel evaluation score. The document-level evaluation compares the set of answer documents between the system and the gold standard, resulting in a value for the customary metrics of Precision, Recall, and F1 per question. The scores per subtask are then averaged over all questions to compute a single documentlevel evaluation score. The mention-level evaluation is a crossdocument event coreference evaluation. Mentionlevel evaluation is only done for questions with the event types killing or injuring. We apply the customary metrics to score the event coreference: BCUB (Bagga and Baldwin, 1998), BLANC (Recasens and Hovy, 2011), entity-based CEAF (CEAF E) and mention-based CEAF (CEAF M) (Luo, 2005), and MUC (Vilain et al., 1995). The final F1-score is the average of the F1-scores of the individual metrics. The set of mentions to annotate should conform to the schema defined in the task annotation guidelines. 11

Baselines
To stimulate participation in general and to stimulate approaches beyond surface form or majority class strategies, we implemented one baseline to infer incidents per subtask and one baseline for mention annotation. 12 Incident inference baseline This baseline uses surface forms based on the question components to find the answer documents. We only consider documents that contain the label of the event type or at least one of its WordNet synonyms. The labels of locations and participants are queried directly in the document (e.g. if the location requested is the US state of Texas, then we only consider documents that contain the surface form Texas, and similarly for participants such as John). The temporal constraint is handled differently: we only consider documents whose publishing date falls within the time requested in the question.
For subtask 1, this baseline assumes that all documents that fit the created constraints are referring to the same incident. If there is no such document, then the baseline does not answer the question (because S1 always has at least one supporting document). For subtask 2, we assume that none of the documents are coreferential. Hence, if 10 documents match the constraints, we infer that there are also 10 corresponding incidents. No baseline was implemented for subtask 3. Mention annotation baseline We annotate mentions of events of type killing and injuring, when these surface forms or their synonyms in WordNet are found as tokens in a document. We assume that all mentions of the same event type within a document are coreferential, whereas all mentions found in different documents are not.

Participants
In this Section, we describe the systems that took part in SemEval-2018 task 5. We refer to the individual system papers for further information.
NewsReader (Vossen, 2018) consists of three steps: 1. the event mentions in the input documents are represented as Event-Centric Knowledge Graphs (ECKGs). 2. the ECKGs of all documents are compared to each other to decide which documents refer to the same incident, resulting in an incident-document index. 3. the constraints of each question (its event type, time, participant names, and location) are matched with the stored ECKGs, resulting in a number of incidents and source documents for each question.
NAI-SEA (Liu and Li, 2018) consists of three components: 1. extraction of basic information on time, location, and participants with regular expressions, named entity recognition, and term matching; 2. event classification with an SVM classifier; 3. document similarity by applying a classifier to detect similar documents. In terms of resources, NAI-SEA combines the training data with data on American cities, counties, and states.
Team FEUP (Abreu and Oliveira, 2018) developed an experimental system to extract entities from news articles for the sake of Question & Answering. For this main task, the team proposed a supervised learning approach to enable the recognition of two different types of entities: Locations (e.g. Birmingham) and Participants (e.g. John List). They have also studied the use of distancebased algorithms (using Levenshtein distance and Q-grams) for the detection of documents' closeness based on entities extracted.
Team ID-DE (Mirza et al., 2018) created KOI (Knowledge of Incidents), a system that builds a knowledge graph of incidents, given news articles as input. The required steps include: 1. Document preprocessing using various semantic NLP tasks such as Word Sense Disambiguation, Named-Entity Recognition, Temporal expression recognition, and Semantic Role Labeling. 2. Incident extraction and document clustering based on the output of step 1. 3. Ontology construction to capture the knowledge model from incidents and documents which makes it possible to run SPARQL queries on the ontology to answer the questions.

R
Team s2 inc acc s2 inc acc s2 inc norm (% of Qs answered) rmse   Table 5: For subtask 3, we report the normalized incident-level accuracy (s3 inc acc norm), the accuracy on the answered questions only (s3 inc acc), and the RMSE value (s3 inc rmse). Systems are ordered by their rank (R).
Before we report the system results, we introduce a few clarifications regarding the result tables: 1. For the incident-and document-level evaluation, we report both the performance with respect to the subset of questions answered and a normalized score, which indicates the performance on all questions of a subtask. If a submission provides answers for all questions, the normalized score will be the same as the non-normalized score.
2. Contrary to the other metrics, a lower RMSE value indicates better system performance. In addition, the RMSE scores have not been normalized since it is not reasonable to set a default value for non-answered questions.
3. The mention-level evaluation was the same across all three subtasks. For this reason, results are only reported once (see Section 8.3). 4. The teams whose member co-organized SemEval-2018 task 5 are marked explicitly with an asterisk in the results.

Incident-level evaluation
The incident-level evaluation assesses whether the system provided the right numeric answer to a question. The results of this evaluation are given in the Tables 4 and 5, for the subtasks 2 and 3 correspondingly. 13 On both subtasks, the order of the participating systems is identical, team FEUP having the highest score. These tables also show the RMSE values, which measure the proximity between the system and the gold answer, punishing cases where the absolute difference between them is large. While for subtask 2 the system with the lowest error rate corresponds to the system with the highest accuracy, this is different for subtask 3. NAI-SEA, ranked third in terms of accuracy, has the lowest RMSE. This means that although their answers were not exactly correct, they were on average much closer to the correct answer than those of the other systems. This is more notable in subtask 3 since here the range of answers is larger than in subtask 2 (the maximum answer in subtask 3 is 171).
We performed additional analysis to compare the performance of systems per subtype and per numeric answer class. Table 6 shows that the system FEUP is not only superior in terms of incident-level accuracy overall, but this is also mirrored for most of the event types, especially those corresponding to the gun violence domain. On the other hand, Figure 2 shows the accuracy distribution of each system per answer class. Notably, for most systems the accuracy is highest for the questions with answer 0 or 1, and gradually declines for higher answers, forming a Zipfian-like distribution. The exception here is the team ID-DE, whose accuracy is almost uniformly spread across the various answer classes.

Document-level evaluation
The intent behind document-level evaluation is to assess the ability of systems to distinguish between answer and non-answer documents. The tables 9, 10, and 11 present the F1-scores for the 13 Incident-level evaluation was not performed for subtask 1, because per definition, its answer is always 1. subtasks 1, 2, and 3, respectively. Curiously, the system ranking is very different and almost opposite compared to the incident-level rankings, with the system NAI-SEA being the one with the highest F1-score. This can be explained by the multifaceted nature of this task, in which different systems may optimize for different goals.
Next, we investigated the F1-scores of systems per event property pair. As shown in Table 7, the best-performing system consistently has the highest performance over all pairs of event properties.      Table 6: For subtask 2 (S2) and subtask 3 (S3), we report the incident-level accuracy and the number of questions (#Qs) per event type. The best result per event type for a subtask is marked in bold. 'ˆ' indicates that the accuracy is normalized for the number of answered questions, in cases where a system answered a subset of all questions. .   Table 7: Document-level F1-score and number of questions (#Qs) for each subtask (S1, S2, and S3) and event property pair as given in the task questions. The best result per property pair for a subtask is marked in bold. 'ˆ' indicates that the F1-score is normalized for the number of answered questions, in cases where a system answered a subset of all questions.   (Bagga and Baldwin, 1998), BLANC (Recasens and Hovy, 2011), entity-based CEAF (CEAF E) and mention-based CEAF (CEAF M) (Luo, 2005), and MUC (Vilain et al., 1995). The individual scores are averaged in a single number (AVG), which is used to rank (R) the systems.

Conclusions
In this paper we have introduced SemEval-2018 Task 5, a referential quantification task of counting events and participants in local news articles with high ambiguity. The complexity of this task challenges systems to establish the meaning, reference, and identity across documents. SemEval-2018 Task 5 consists of two subtasks of counting events, and one subtask of counting event participants in their corresponding roles. We evaluated system performance with a set of metrics, on three levels: incident-, document-, and mention-level. We described the approaches and presented the results of four participating systems, as well as two baseline algorithms. All four teams submitted a result for all three subtasks, and two teams participated in the mention-level evaluation. We observed that the ranking of systems differs dramatically per evaluation level. Given the multifaceted nature of this task, it is not surprising that different systems optimized for different goals. Although the systems are able to retrieve many of the answer documents, the highest accuracy of counting events or participants is 30%. This suggests that further research is necessary in order to develop complete and robust models that can natively deal with the challenge of counting referential units within sparse and ambiguous textual data.
Out-of-competition participation is enabled by the Codalab platform, where this task was hosted.