Distinguishing Past, On-going, and Future Events: The EventStatus Corpus

Determining whether a major societal event has already happened, is still on-going, or may occur in the future is crucial for event prediction, timeline generation, and news summarization. We introduce a new task and a new corpus, EventStatus , which has 4500 English and Spanish articles about civil unrest events labeled as P AST , O N -G OING , or F U - TURE . We show that the temporal status of these events is difﬁcult to classify because lo-cal tense and aspect cues are often lacking, time expressions are insufﬁcient, and the linguistic contexts have rich semantic compositionality. We explore two approaches for event status classiﬁcation: (1) a feature-based SVM classiﬁer augmented with a novel induced lexicon of future-oriented verbs, such as “threat-ened” and “planned”, and (2) a convolutional neural net. Both types of classiﬁers improve event status recognition over a state-of-the-art TempEval model, and our analysis offers linguistic insights into the semantic composition-ality challenges for this new task.


Introduction
When a major societal event is mentioned in the news (e.g., civil unrest, terrorism, natural disaster), it is important to understand whether the event has already happened (PAST), is currently happening (ON-GOING), or may happen in the future (FUTURE). We introduce a new task and corpus for studying the temporal/aspectual properties of major events. The EventStatus corpus consists of 4500 English and Spanish news articles about civil unrest events, such as protests, demonstrations, marches, and strikes, in which each event is annotated as PAST, ON-GOING, or FUTURE (sublabeled as PLANNED, ALERT or POSSIBLE). This task bridges event extraction research and temporal research in the tradition of TIMEBANK (Pustejovsky et al., 2003) and TempEval (Verhagen et al., 2007;Verhagen et al., 2010;UzZaman et al., 2013). Previous corpora have begun this association: TIMEBANK, for example, includes temporal relations linking events with Document Creation Times (DCT). But the EventStatus task and corpus offers several new research directions.
First, major societal events are often discussed before they happen, or while they are still happening, because they have the potential to impact a large number of people. News outlets frequently report on impending natural disasters (e.g., hurricanes), anticipated disease outbreaks (e.g., Zika virus), threats of terrorism, and plans or warnings of potential civil unrest (e.g., strikes and protests). Traditional event extraction research has focused primarily on recognizing events that have already happened. Furthermore, the linguistic contexts of on-going and future events involve complex compositionality, and features like explicit time expressions are less useful. Our results demonstrate that a state-of-the-art Tem-pEval system has difficulty identifying on-going and future events, mislabeling examples like these: (1) The metro workers' strike in Bucharest has entered the fifth day. Second, we intentionally created the EventStatus corpus to concentrate on one particular event frame (class of events): civil unrest. In contrast, previous temporally annotated corpora focus on a wide variety of events. Focusing on one frame (semantic depth instead of breadth) makes this corpus analogous to domain-specific event extraction data sets, and therefore appropriate for evaluating rich tasks like event extraction and temporal question answering, which require more knowledge about event frames and schemata than might be represented in large broad corpora like TIMEBANK (UzZaman et al., 2012;Llorens et al., 2015).
Third, the EventStatus corpus focuses on specific instances of high-level events, in contrast to the lowlevel and often non-specific or generic events that dominate other temporal datasets. 1 Mentions of specific events are much more likely to be realized in non-finite form (as nouns or infinitives, such as "the strike" or "to protest") than randomly selected event keywords. In breadth-based corpora like the Event-CorefBank (ECB) corpus (Bejan and Harabagiu, 2008), 34% of the events have non-finite realization; in TIMEBANK, 45% of the events have non-finite realization. By contrast, in a frame-based corpus like ACE2005 (ACE, 2005), 59% of the events have non-finite forms. In the EventStatus corpus, 80% of the events have non-finite forms. Whether this is due to differences in labeling or to intrinsic properties of these events, the result is that they are much harder to label because tense and aspect are less available than for events realized as finite verbs.
Fourth, the EventStatus data set is multilingual: we collected data from both English and Spanish texts, allowing us to compare events representing the same event frame across two languages that are known to differ in their typological properties for describing events (Talmy, 1985).
Using the new EventStatus corpus, we investigate two approaches for recognizing the temporal status of events. We create a SVM classifier that incorporates features drawn from prior TempEval work (Bethard, 2013;Chambers et al., 2014;Llorens et al., 2010) as well as a new automatically induced 1 For example in TIMEBANK almost half the annotated events (3720 of 7935) are hypothetical or generic, i.e., PERCEP-TION, REPORTING, ASPECTUAL, I ACTION, STATE or I STATE rather than the specific OCCURRENCE. lexicon of 411 English and 348 Spanish "futureoriented" matrix verbs-verbs like "threaten" and "fear" whose complement clause or nominal direct object argument is likely to describe a future event. We show that the SVM outperforms a state-of-theart TempEval system and that the induced lexicon further improves performance for both English and Spanish. We also introduce a Convolutional Neural Network (CNN) to detect the temporal status of events. Our analysis shows that it successfully models semantic compositionality for some challenging temporal contexts. The CNN model again improves performance in both English and Spanish, providing strong initial results for this new task and corpus.

The EventStatus Corpus
For major societal events, it can be very important to know whether the event has ended or if it is still in progress (e.g., are people still rioting in the streets?). And sometimes events are anticipated before they actually happen, such as labor strikes, marches and parades, social demonstrations, political events (e.g., debates and elections), and acts of war. The EventStatus corpus represents the temporal status of an event as one of five categories: Past: An event that has started and has ended. There should be no reason to believe that it may still be in progress. On-going: An event that has started and is still in progress or likely to resume 2 in the immediate future. There should be no reason to believe that it has ended. Future Planned: An event that has not yet started, but a person or group has planned for or explicitly committed to an instance of the event in the future. There should be near certainty it will happen. Future Alert: An event that has not yet started, but a person or group has been threatening, warning, or advocating for a future instance of the event. Future Possible: An event that has not yet started, but the context suggests that its occurrence is a live possibility (e.g., it is anticipated, feared, hinted at, or is mentioned conditionally).
The three subtypes of future events are important

Past
[EN] Today's demonstration ended without violence.
An estimated 2,000 people protested against the government in Peru.
On-going [EN] Negotiations continue with no end in sight for the 2 week old strike.
Yesterday's rallies have caused police to fear more today.
Future Planned [EN] 77 percent of German steelworkers voted to strike to raise their wages.
Peace groups have already started organizing mass protests in Sydney.
Future Alert [EN] Farmers have threatened to hold demonstrations on Monday.
Nurses are warning they intend to walkout if conditions don't improve.
Future Possible [EN] Residents fear riots if the policeman who killed the boy is acquitted.
The military is preparing for possible protests at the G8 summit.
[SP] Policía Militar analiza la posibilidad de decretar una huelga nacional. in marking not just temporal status but also what we might call predictive status. Events very likely to occur are distinguished from events whose occurrence depends on other contingencies (Future Planned vs. Alert/Possible). Warnings or mentions of a potential event by a likely actor are further distinguished from events whose occurrence is more open-ended (Future Alert vs. Possible). The status of future events is not due just to lexical semantics or local context but also other qualifiers in the sentence (e.g. "may"), the larger discourse context, and world knowledge. The annotation guidelines are formulated with that in mind. The categories for future events are not incompatible with one another but are meant to be informationally ordered (e.g. "future alert" implies "future possible"). Annotators are instructed to go for the strongest implication supported by the overall context. Table 1 presents examples of each category in news reports about civil unrest events, with the event keywords in italics.

EventStatus Annotations
The EventStatus dataset consists of English and Spanish news articles. We manually identified 6 English words 3 and 13 Spanish words 4 and phrases associated with civil unrest events, and added their morphological variants. We then randomly selected 2954 and 1491 5 news stories from the English Gigaword 5th Ed. (Parker et al., 2011) and Spanish Gigaword 3rd Ed. (Mendon et al., 2011) corpora, respectively, that contain at least one civil unrest phrase. Events of a specific type are very sparsely distributed in a large corpus like the Gigaword, so we used keyword matching just as a first pass to identify candidate event mentions.  Because many keyword instances don't refer to a specific event, primarily due to lexical ambiguity and generic descriptions (e.g., "Protests are often facilitated by ..."), we used a two-stage annotation process. First, we extracted sentences containing at least one key phrase, and had three human annotators judge whether the sentence describes a specific civil unrest event. Next, for each sentence that mentions a specific event, the annotators assigned an event status to every civil unrest key phrase in that sentence. In both annotation phases, we asked the annotators to consider the context of the entire article.
In the first annotation phase, the average pairwise inter-annotator agreement (Cohen's κ) among the annotators was κ = 0.84 on the English data and 0.70 on the Spanish data. We then assigned the majority label among the three annotators to each sentence. In the English data, of the 5085 sentences with at least one key phrase, 2492 (49%) were judged to be about a specific civil unrest event. In the Spanish data, 3249 sentences contained at least one key phrase and 2466 (76%) described a specific event.
In the second phase, the annotators assigned one of the five temporal status categories listed in Section 2 to each event keyword in a relevant sentence. In addition, we provided a Not Event label. 6 Occasionally, a single instance of a keyword can refer to multiple events (e.g., "Both last week's and today's protests..."), so we permitted multiple labels to be assigned to an event phrase. However this happened for only 28 cases in English and 21 cases in Spanish.
The average pairwise inter-annotator agreement among the three human annotators for the temporal status labels was κ=.78 for English and κ=.80 for Spanish. We used the majority label among the three annotators as the gold status. In total, 2907 English and 2807 Spanish event phrases exist in the relevant sentences and were annotated. However there were 83 English cases (≈2.9%) and 70 Spanish cases (≈2.5%) where the labels among the three annotators were all different, so we discarded these cases. Table 2 shows the final distribution of labels in the EventStatus corpus. The EventStatus corpus 7 is available through the LDC.

Linguistic Properties of Event Mentions
Next, we investigated the linguistic properties of the event status categories, lumping together the 3 future subcategories. Table 3 shows the distribution of syntactic forms of the event mentions in two commonly used event datasets, ACE2005 (ACE, 2005) and EventCorefBank (Bejan and Harabagiu, 2008), and our new EventStatus corpus. In the introduction, we mentioned the high frequency of non-finite event expressions; Table 3 provides the evidence: nonfinite forms (nouns and infinitives) constitute 59% in ACE2005, 34% in EventCorefBank, and a very high 80% of the events in the EventStatus dataset. The distribution is even more skewed for future events, which are 95% (English) and 96% (Spanish) realized by non-finite surface forms.

Future Oriented Verbs
We observed that many future event mentions are preceded by a set of lexical (non-aux) verbs that we call future oriented verbs, such as "threatened" in (4) and "fear" in (5). These verbs project the events in the lower clause into the future.
We harvested matrix verbs whose complement unambiguously describes a future event using two heuristics. One heuristic looks for examples with a tense conflict between the matrix verb and its complement: a matrix verb in the past tense (like "planned" below) whose complement event is an infinitive verb or deverbal noun modified by a future time expression (like "tomorrow" or "next week"), hence in the future (e.g., "strike" below): 8 (6) The union planned to strike next week. Future events are often marked by conditional clauses, so the second heuristic considers an event to be future if it was post-modified by a conditional clause (beginning with "if" or "unless"): (7) The union threatened to strike if their appeal was rejected. Finally, to increase precision, we only harvested a verb as future-oriented if it functioned as a matrix both in sentences with an embedded future time expression and in sentences with a conditional clause.
Future Oriented Verb Categories: We ran the algorithm on the English and Spanish Gigaword corpora (Parker et al., 2011;Mendon et al., 2011), obtaining 411 English verbs and 348 Spanish verbs. To better understand the structure of the learned lexicon, we mapped each English verb to Framenet (Baker et al., 1998); 86% (355) of the English verbs occurred in Framenet, in 306 unique frames. We clustered these into 102 frames 9 and grouped the Spanish verbs following English Framenet, identifying 67 categories. (Some learned verbs, such as "poise" , "slate" , "compel" and "hesitate", had a clear future orientation but didn't exist in Framenet.)  In the next sections we propose two classifiers, an SVM classifier using standard TempEval features plus our new future-oriented lexicon, and a Convolutional Neural Net, as a pilot exploration of what features and architecture work well for the EventStatus task. For these studies we combine the Future Planned, Future Alert and Future Possible categories into a single Future event status because we first wanted to establish how well classifiers can detect the primary temporal distinctions between Past vs. Ongoing vs. Future. The future subcategories are, of course, relatively smaller and we expect that the most effective approach will be to design a classifier that sits on top of the primary classifier to further subcategorize the Future instances. We leave the task of subcategorizing future events for later work.

SVM Event Status Model
Our first classifier is a linear SVM classifier. 10 We trained three binary classifiers (one per class) using one-vs.-rest, and label an event mention with the class that assigned the highest score to the mention. We used features inspired by prior TempEval work and by the previous analysis, including words, tense and aspect features, time expressions, and the new future-oriented verb lexicon. We also experimented with other features used by TempEval systems (including bigrams, POS tags, and two-hop dependency features), but they did not improve performance. 11 Bag-Of-Words Features: For bag-of-words unigram features we used a window size of 7 (7 left and 7 right) for the English data and 6 for the Spanish data; this size was optimized on the tuning sets.
Tense, Aspect and Time Expressions: Because these features are known to be the most important for relating events to document creation time (Bethard, 2013;Llorens et al., 2010), we used TIPSem (Llorens et al., 2010) to generate the tense and aspect of events and find time expressions in both languages. TIPSem infers the tense and aspect of nominal and infinitival event mentions using heuristics without relying on syntactic dependencies. For the English data set, we also generated syntactic dependencies using Stanford CoreNLP (Marneffe et al., 2006) and applied several rules to create additional tense and aspect features based on the governing words of event mentions 12 . Time indication features are created by comparing document creation time to time expressions linked to an event mention detected by TIPSem. If TIPSem detects no linked time expressions for an event mention, we take the nearest time expression in the same sentence.
Governing Words: Governing words have been useful in prior work. Our version of the feature 10 Trained using LIBSVM (Chang and Lin, 2011) with linear kernels (polynomial kernels yielded worse performance). 11 Previous TempEval work reported that those additional features were useful when computing temporal relations between two events but not when relating an event to the Document Creation Time, for which tense, aspect, and time expression features were the most useful (Llorens et al., 2010;Bethard, 2013). 12 We did not imitate this procedure for Spanish because the quality of our generated Spanish dependencies is poor. pairs the governing word of an event mention with the dependency relation in between. We used Stanford CoreNLP (Marneffe et al., 2006) to generate dependencies for the English data. For the Spanish data, we used Stanford CoreNLP to generate Partof-Speech tags 13 and then applied the MaltParser (Nivre et al., 2004) to generate dependencies.

Convolutional Neural Network Model
Convolutional neural networks (CNNs) have been shown to be effective in modeling natural language semantics (Collobert et al., 2011). We were especially keen to find out whether the convolution operations of CNNs can model the semantic compositionality needed to detect temporal-aspectual status. For our experiments, we trained a simple CNN with one convolution layer followed by one max pooling layer (Kim, 2014;Collobert et al., 2011), The convolution layer has 300 hidden units. In each unit, the same affine transformation is applied to every consecutive 5 words (a filter instance) in the input sequence of words. A different affine transformation is applied to each hidden unit. After each affine transformation, a Rectified Linear Units (ReLU) (Nair and Hinton, 2010) non-linearity is applied. For each hidden unit, the max pooling layer selects the maximum value from the pool of real values generated from each filter instance.
After the max pooling layer, a softmax classifier predicts probabilites for each of the three classes, Past, Ongoing and Future. To alleviate overfitting of the CNN model, we applied dropout (Hinton et al., 2012) on the convolution layer and the following pooling layer with a keeping rate of 0.5.

Evaluations
For all subsequent evaluations, we use gold event mentions. We randomly sampled around 20% of the annotated documents as the parameter tuning set and used the rest as the test set. Rather than training once on a distinct training set, all our experiment results are based on 10-fold cross validation on the test set, (1191 Spanish documents, 2364 English documents; see Table 5 for the distribution of event mentions).

Comparing with a TempEval System
We begin with a baseline: applying a TempEval system to classify each event. Most of our features are already drawn from TempEval, but our goal was to see if an off-the-shelf system could be directly applied to our task. We chose TIPSem (Llorens et al., 2010), a CRF system trained on TimeBank that uses linguistic features, has achieved top performance in TempEval competitions for both English and Spanish (Verhagen et al., 2010), and can compute the relation of each event with the Document Creation Time. We applied TIPSem to our test set, mapping the DCT relations to our three event status classes 15 .
Row 1 of Tables 6 and 7 shows TIPSem results. The columns show results for each category separately, as well as macro-average and microaverage results across the three categories. Each cell shows the Recall/Precision/F-score numbers. Since TIPSem linked relatively few event mentions to the DCT, we next leveraged the transitivity of temporal relations (UzZaman et al., 2012;Llorens et al., 2015), linking an event to a DCT if the temporal relation between another event in the same sentence and the DCT is transferable. For instance, if event A is AFTER its DCT, and event B is AFTER event A, then event B is also AFTER the DCT. 16 Row 2 shows the results of TIPSem with temporal transitivity.
Even augmented by transitivity, TIPSem fails to detect many Ongoing (OG) and Future (FU) events; most mislabeled OG and FU events were nominal. Confusion matrices (  SVM Results Next, we compare TIPSem's results with our SVM classifier. An issue is that TIPSem identifies only 72% and 78% of the gold event mentions, for English and Spanish respectively 17 . To have a fair comparison, we applied the SVM to only the event mentions that TipSem recognized. Row 3 shows these results for the SVM classifier using its full feature set. The SVM outperforms TipSem on all three categories, for both languages, with the largest improvements on Future events. Next, we ran ablation experiments with the SVM to evaluate the impact of different subsets of its features. For these experiments, we applied the SVM to all gold event mentions, thus Rows 1-3 of Tables 6 and 7 report on fewer event mentions than rows 4-8. Row 4 shows results using only bag-of-words features 18 . Row 5 shows results when additionally including the tense, aspect, and time features provided by TIPSem (Llorens et al., 2010). Unsurprisingly, in both languages 19 these features improve over just bag-of-word features.
Row 6 further adds governing word features. These improve English performance, especially for On-Going events. For Spanish, governing word fea-tures slightly decrease performance, likely due to the poor quality of the Spanish dependencies.
Row 7 adds the future oriented lexicon features 20 . For both English and Spanish, the future oriented lexicon increased overall performance, and (as expected) especially for Future events.
CNN Results Row 8 shows the results using CNN models. For English and Spanish, the same window (7 words for English, 6 words for Spanish) was used to compute bag-of-word features for SVMs as for training the CNN models. For English, the CNN model further increased recall and precision across all three classes. The CNN improved Spanish performance on both Past and On-going events, but the SVM outperformed the CNN for Future events when the future oriented lexicon features were included.

Analysis
To better understand whether the CNN model's strong performance was related to handling compositionality, we examined some English examples that were correctly recognized by the CNN model but mislabeled by the SVM classifier with bag-ofwords features. The examples below (event mentions are in italics) suggest that the CNN may be capturing the compositional impact of local cues like "possibility" or "since": (10) Raising the possibility of a strike on New Year's Eve, the president of New York City's largest union is calling for a 30 percent raise over three years. (FU) (11) The lockout was announced in the wake of a go-slow and partial strike by the union since July 12 after management turned down its demand. (OG) We also conducted an error analysis by randomly sampling and then analyzing 50 of the 473 errors by the CNN model. Many cases (26/50) are ambiguous from the sentence alone, requiring discourse information. The first example below is caused by the well-known "double access" ambiguity of the complement of a communication verb (Smith, 1978;Abusch, 1997;Giorgi, 2010).
(12) Chavez also said he discussed the strike with UN Secretary General Kofi Annan and told him the strike organizers were "terrorists." (OG) (13) Students and teachers protest over education budget (PA) In 9/50 cases, the contexts that imply temporal status are complex and fall out of our ±7 word range, e.g.,: (14) Protesters on Saturday also occupied two gymnastics halls near Gorleben which are to be used as accommodation for police. They were later forcibly dispersed by policemen. (PA) The remaining 15/50 cases contain enough local cues to be solvable by humans, but both the CNN and SVM models nonetheless failed: (15) Eastern leaders have grown weary of the protest movement led mostly by Aymara. (OG)

Related Work
Our work overlaps with two communities of tasks and corpora: the task of classifying temporal order between event mentions and Document Creation Time (DCT) in TempEval (Verhagen et al., 2007;Verhagen et al., 2010;UzZaman et al., 2013), and the task of extracting events, associated with corpora such as ACE2005 (ACE, 2005) and the Event-CorefBank (ECB) (Bejan and Harabagiu, 2008). By studying the events in a particular frame (civil unrest), but focusing on their temporal status, our work has the potential to draw these communities together. Most event extraction work (Freitag, 1998;Appelt et al., 1993;Ciravegna, 2001;Chieu and Ng, 2002;Riloff and Jones, 1999;Roth and Yih, 2001;Zelenko et al., 2003;Bunescu and Mooney, 2007) has focused on extracting event slots or frames for past events and assigning dates. The TempEval task of linking events to DCT has not focused on events that tend to have non-finite realizations, nor has it focused on subtypes of future events. Our work, including the corpus and the future-oriented verb lexicon, has the potential to benefit related tasks like generating event timelines from news articles (Allan et al., 2000;Yan et al., 2011) or social media sources (Li and Cardie, 2014;Ritter et al., 2012), or exploring the psychological implications of future oriented language (Nie et al., 2015;Schwartz et al., 2015).

Conclusions
We have proposed a new task of recognizing the past, on-going, or future temporal status of major events, introducing a new resource for study-ing events in two languages. Besides its importance for studying time and aspectuality, the EventStatus dataset offers a rich resource for any future investigation of information extraction from major societal events. The strong performance of the convolutional net system suggests the power of latent representations to model temporal compositionality, and points to extensions of our work using deeper and more powerful networks.
Finally, our investigation of the role of context and semantic composition in conveying temporal information also has implications for our understanding of temporality and aspectuality and their linguistic expression. Many of the errors made by our CNN system are complex ambiguities, like the double access readings, that cannot be solved without information from the wider discourse context. Our work can thus also be seen as a call for the further use of rich discourse information in the computational study of temporal processing.