Event Time Extraction and Propagation via Graph Attention Networks

Grounding events into a precise timeline is important for natural language understanding but has received limited attention in recent work. This problem is challenging due to the inherent ambiguity of language and the requirement for information propagation over inter-related events. This paper first formulates this problem based on a 4-tuple temporal representation used in entity slot filling, which allows us to represent fuzzy time spans more conveniently. We then propose a graph attention network-based approach to propagate temporal information over document-level event graphs constructed by shared entity arguments and temporal relations. To better evaluate our approach, we present a challenging new benchmark on the ACE2005 corpus, where more than 78% of events do not have time spans mentioned explicitly in their local contexts. The proposed approach yields an absolute gain of 7.0% in match rate over contextualized embedding approaches, and 16.3% higher match rate compared to sentence-level manual event time argument annotation.


Introduction
Understanding and reasoning about time is a crucial component for comprehensive understanding of evolving situations, events, trends and forecasting event abstractions for the long-term. Event time extraction is also useful for many downstream Natural Language Processing (NLP) applications such as event timeline generation (Huang and Huang, 2013;Wang et al., 2015;Ge et al., 2015;Steen and Markert, 2019), temporal event tracking and prediction Minard et al., 2015), and temporal question answering (Llorens et al., 2015;Meng et al., 2017). *Work done prior to joining Amazon. 1 The resource for this paper is available at https://gi thub.com/wenhycs/NAACL2021-Event-Time-Ex traction-and-Propagation-via-Graph-Atten tion-Networks.
In order to ground events into a timeline we need to determine the start time and end time of each event as precisely as possible (Reimers et al., 2016). However, the start and end time of an event is often not explicitly expressed in a document. For example, among 5,271 annotated event mentions in the Automatic Content Extraction (ACE2005) corpus 2 , only 1,100 of them have explicit time argument annotations. To solve the temporal event grounding (TEG) problem, previous efforts focus on its subtasks such as temporal event ordering (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Yoshikawa et al., 2009;Do et al., 2012;Meng et al., 2017;Meng and Rumshisky, 2018;Ning et al., 2017Ning et al., , 2018Han et al., 2019) and duration prediction (Pan et al., 2006(Pan et al., , 2011Vempala et al., 2018;Gusev et al., 2011;Vashishtha et al., 2019;Zhou et al., 2019). In this paper we aim to solve TEG directly using the following novel approaches.
To capture fuzzy time spans expressed in text, we adopt a 4-tuple temporal representation proposed in the TAC-KBP temporal slot filling task (Ji et al., 2011(Ji et al., , 2013 to predict an event's earliest possible start date, latest possible start date, earliest possible end date and latest possible end date, given the entire document. We choose to work at the daylevel and leave time scales smaller than that for future work since, for example, only 0.6% of the time expressions in the newswire documents in ACE contain smaller granularities (e.g., hours or minutes).
Fortunately, the uncertain time boundaries of an event can often be inferred from its related events in the global context of a document. For example, in Table 1, there are no explicit time expressions or clear linguistic clues in the local context to infer the time of the appeal event. But the earliest possible date of the refuse event is explicitly expressed as 2003-04-18. Since the appeal event must happen before the refuse event, we can infer Malaysia' s Appeal Court Friday[2003-04-18] refused to overturn the conviction and nine-year jail sentence imposed on ex-deputy prime minister Anwar Ibrahim. Anwar now faces an earliest possible release date of April 14, 2009April 14, [2009. The former heir says he was framed for political reasons, after his appeal was rejected ... Mahathir's sacking of Anwar in September 1998[1998  the earliest start and the latest end date of appeal as 2003-04-18. However, there are usually many other irrelevant events that are in the same document, which requires us to develop an effective approach to select related events and perform temporal information propagation. We first use eventevent relations to construct a document-level event graph for each input document, as illustrated in Figure 1. We leverage two types of event-event relations: (1) if two events share the same entity as their arguments, then they are implicitly connected; (2) automatic event-event temporal relation extraction methods such as  provide important clues about which element in the 4-tuple of an event can be propagated to which 4tuple element of another event. We propose a novel time-aware graph propagation framework based on graph attention networks (GAT, Velickovic et al., 2018) to propagate temporal information across events in the constructed event graphs. Experimental results on a benchmark, newly created on top of ACE2005 annotations, show that our proposed cross-event time propagation framework significantly outperforms state-of-theart event time extraction methods using contextualized embedding features. Our contributions can be summarized as follows.
• This is the first work taking advantage of the flexibility of 4-tuple representation to formulate absolute event timeline construction. • We propose a GAT based approach for timeline construction which effectively propagates temporal information over document-level event graphs without solving large constrained optimization problems (e.g., Integer Linear Program-ming (ILP)) as previous work did. We propose two effective methods to construct the event graphs, based on shared arguments and temporal relations, which allow the time information to be propagated across the entire document. • We build a new benchmark with over 6,000 human annotated non-infinite time elements, which implements the 4-tuple representation for the first time as a timeline dataset, and is intended to be used for future research on absolute timeline construction.

4-tuple Event Time Representation
Grounding events into a timeline necessitates the extraction of the start and end time of each event. However, the start and end time of most events is not explicitly expressed in a document. To capture such uncertainty, we adopt the 4-tuple representation introduced by the TAC-KBP2011 temporal slot filling task (Ji et al., 2011(Ji et al., , 2013. We define 4tuple event time as four time elements for an event e → τ − start , τ + start , τ − end , τ + end , 3 which indicate earliest possible start date, latest possible start date, earliest possible end date and latest possible end date, respectively. These four dates follow hard constraints: (1)

64
The enemy have now been flown out and we're treating them including a man who is almost dead with a gunshot wound to the chest after we (Royal Marines) sent in one of our companies of about 100 men in here (Umm Kiou) this morning.  The above temporal representation was originally designed for entity slot filling, and we regard it as an expressive way for describing events too as: (1) it allows for flexible representation of fuzzy time spans and thus, for those events that we cannot determine the accurate dates, they can also be grounded into a timeline; and (2) it allows for a unified treatment of various types of temporal information and thus makes it convenient to propagate over multiple events.

Annotation
We choose the Automatic Content Extraction (ACE) 2005 dataset because it includes rich annotations of event types, entity/time/value argument roles, time expressions and their normalization results. In our annotation interface, each document is highlighted with event triggers and time expressions. The annotators are required to read the whole document and provide as precise information as possible for each element of the 4-tuple of each event. If there is no possible information for a specific time, the annotators are asked to provide +/-infinite labels.  Overall, we have annotated 182 documents from this dataset. Most of the documents are from broadcast news or newswire genres. Detailed data statistics and data splits are shown in Table 2. We annotated all the documents with two independent passes. Two experts led the final adjudication based on independent annotations and discussions with annotators since single annotation pass is likely to miss important clues, especially when the event and its associated time expression appear in different paragraphs.

Overview
The input is a document D = [w 1 , . . . , w n ], containing event triggers E = [e 1 , . . . , e m ] and time expressions T = [t 1 , . . . , t l ], and we use goldstandard annotation for event triggers and time expressions. Our goal is to connect the event triggers E and time expressions T scattered in a document, and estimate their association scores to select the most possible values for the 4-tuple elements. At a high-level, our approach is composed of: (1) a text encoder to capture semantic and narrative information in local context, (2) a document-level event graph to facilitate global knowledge, (3) a graphbased time propagation model to propagate time along event-event relations, and (4) an extraction algorithm to generate 4-tuple output. Among these four components, (1) and (4) build up the minimal requirements of an extractor, which serve as our baseline model and will be described in Section 3.2. We will detail how we utilize event arguments and temporal ordering to construct the document-level event graph, namely component (2), in Section 3.3. We will present our graph-based time propagation model in Section 3.4, and wrap up our model with training objective and other details in Section 3.5.
We list notations in Table 3, which will be explained when encountered.

Baseline Extraction Model
Our baseline extraction model is an event-time pair classifier based on a pre-trained language model (Devlin et al., 2019;Liu et al., 2019;Beltagy et al., 2020) encoder. The pre-trained language models allow us to have contextualized representation for every token in a given text. We directly derive the local representation for event triggers and time expressions from the contextualized representation. The representations are denoted as h e i for event trigger e i and h t j for time expression t j . For events or time expressions containing multiple tokens, we take the average of token representations. Thus, all h e i and h t j are of the same dimensions.
We pair each event and time in the document, i.e., {(e i , t j ) | e i ∈ E, t j ∈ T }, to form the training examples. After obtaining event and time representations, we concatenate them and feed them into a 2-layer feed-forward neural classifier. The classifier estimates the probability of filling t j in e i 's 4-tuple time elements, i.e., The probabilities are: where σ(·) is sigmoid function, and W 1,2 and b 1,2 are learnable parameters. In short, we use τ i,k to represent the k th element in τ i (k ∈ {1, 2, 3, 4}) and p i,j,k represents a probability that t j fills in the k th element of 4-tuple τ i . The baseline model consists of 4 binary classifiers, one for each element of the 4-tuple.
When determining the 4-tuple for each event e i , we estimate the probability of t 1 through t l . For each element, we take the time expression with the highest probability to fill in this element. A practical issue is that the same time is often expressed by different granularity levels, such as 2020-01-01 and 2020-W1, following the most common TIMEX format (Ferro et al., 2005). To uniformly represent all the time expressions and allow certain degree of uncertainty, we introduce the following 2-tuple normalized form for time expressions, which indicates the time range of t j by two dates, where t − * represents the earliest possible dates and t + * represents the latest possible dates.
We also make a simplification that the earliest possible values can only fill in earliest possible dates, i.e., This constraint can be relaxed in future work. Here is an example of how we determine the binary labels for event-time pairs. If the 4-tuple time for an event is 2020-01-01, 2020-01-03, 2020-01-01, 2020-01-07 and the 2-tuple for time expression 2020-W1 is 2020-01-01, 2020-01-07 , then the classification labels of this event-time pair will be True, False, True, True .

Event Graph Construction
Before we conduct the global time propagation, we first construct document-level event graphs. In this paper, we focus on two types of event-event relations: (1) shared entity arguments, and (2) temporal relations.
Event Argument Graph. Event argument roles provide local information about events and two events can be connected via their shared arguments.
We denote the event-argument graph as G arg = {(e i , v j , r i,j )}, where e i represents an event, v j represents an entity or a time expression, and r i,j represents the bi-directed edge between e i and v j , namely the argument role. For example, in Figure 1, there will be two edges between the "sent" event (e 1 ) and the entity "Royal Marines" (v 1 ), namely (e 1 , v 1 , AGENT) and (v 1 , e 1 , AGENT). In addition, we add a self-loop for each node in this graph. The graph can be constructed by Information Extraction (IE) techniques and we use gold-standard event annotation from ACE 2005 dataset in our experiments.
Event Temporal Graph. Event-event temporal relations provide explicit directions to propagate time information. If we know that an attack event happened before an injury event, the lower-bound end date of the attack can possibly be the start date of the injury. We denote the event temporal graph as G temp = {(e i , e j , γ i,j )}, where e i and e j denote events, and γ i,j denotes the temporal order between e i and e j . Similar to G arg , we also add a self-loop in G temp and edges for two directions. For example, for a BEFORE relation from e 1 to e 2 , we will add two edges, (e 1 , e 2 , BEFORE) and (e 2 , e 1 , AFTER). We only consider BEFORE and AFTER relations when constructing the event temporal graph. To propagate time information, we also use local time arguments as in event argument graphs.
We apply the state-of-the-art event temporal relation extraction model  to extract temporal relations for event pairs that appear in the same sentence or two consecutive sentences, and we only keep the relations whose confidence score is over 90%.

Event Graph-based Time Propagation
After obtaining the document-level graphs G arg and G temp , we design a novel time-aware graph neural network to perform document-level 4-tuple propagation.
Graph neural networks (Dai et al., 2016;Kipf and Welling, 2017;Hamilton et al., 2017;Schlichtkrull et al., 2018;Velickovic et al., 2018) have shown effective for relational reasoning (Zhang et al., 2018;Marcheggiani et al., 2018). We adopt graph attention networks (GAT, Velickovic et al., 2018) to propagate time through eventargument or event-event relations. GAT are proposed to aggregate and update information for each node from its neighbors through attention mechanism. Compared to the original GAT, we further include relational embedding for edge labels when performing attention to capture various types of relations between each event and its neighboring events.
The graphs G arg and G temp together with the GAT model are placed in the intermediate layer of our baseline extraction model (Section 3.2), i.e., between the pre-trained language model encoder and the 2-layer feed-forward neural classifier (Eq. (2)). For clarity, we denote all events and entities as nodes V = {v 1 , . . . , v n }, and we use r i,j to denote their relation types. More specifically, we stack several layers of GAT on top of the contextualized representations of nodes h v i . And we follow Vaswani et al. (2017) to use multi-head attention for each layer. We use the simplified notation h v i to describe one of the attention heads for h k v i .
where ELU is exponential linear unit (Clevert et al., 2016), a ij is the attention coefficient of node v i and v j , α ij is the attention weight after softmax, and h v i and h v i are the hidden states of node v i before and after one GAT layer, respectively. We use N (i) to denote the neighborhood of v i . The attention coefficients are calculated through where σ is LeakyReLU (Clevert et al., 2016) activation function. φ r i,j is the learnable relational embedding for relation type of r i,j that we further add compared to the original GAT. We concatenate m different attention heads to compute the representation of v i for the next layer after performing attention for each head, We stack n l GAT layers to obtain the final representations for events and time. These representations are fed into the 2-layer feed-forward neural classifier in Eq.
(2) to generate the corresponding probabilities.

Training Objective
Since we model the 4-tuple extraction task by four binary classifiers, we adopt the log loss as our model objective: Since the 4-tuple elements are extracted from time expressions, the model cannot generate +/-inf (infinite) output. To address this issue, we adopt another hyperparameter, inf threshold, and convert those predicted time values with scores lower than the threshold into +/-inf values. That is, we regard the probability p i,j,k also as a confidence score. A low score indicates the model cannot determine the results for some 4-tuple elements. Thus it is natural to set those elements as inf. When this case happens in τ − start or τ − end , we correct the value to be -inf, and when it is τ + start or τ + end , we set the value to be +inf. This threshold and its searching will be applied to both baseline extract and GAT-based extraction systems. The extraction model may generate 4-tuples that do not follow the constraints on Eq. (1) and we leave enforcing the constraints for future work.

Data and Experiment Setting
We conduct our experiments on previously introduced annotated data. Statistics of the dataset and splits are shown in Table 2.
Experiment Setup. We compare our proposed graph-based time propagation model with the following baselines: • Local gold-standard time argument: The goldstandard time argument annotation provides the upperbound of the performance that a local time extraction system can achieve in our document 4-tuple time extraction task. We map gold-standard time argument roles to our 4-tuple representation scheme and report its performance for comparison. Specifically, if the argument role indicates the start time of an event (e.g., TIME-AFTER, TIME-AT-BEGINNING) we will map the date to τ − start and τ + start ; if the argument role indicates the end time of an event (e.g., TIME-BEFORE) we will map the date to τ − end and τ + end ; if the argument role is TIME-WITHIN, we will map the date to all elements. And we will leave all other elements as infinite. • Baseline extraction model: We compare our model with the baseline extraction model using contextualized embedding introduced in Section 3.2. We use two contextualized embedding methods, RoBERTa (Liu et al., 2019) and Longformer (Beltagy et al., 2020), which provide sentence-level 4 and document-level contextualized embeddings respectively.
For our proposed graph-based time propagation model, we use contextualized embedding from Longformer and consider two types of event graphs: (1) constructed event arguments, and (2) constructed temporal relations and time arguments. We optimize our model with Adam (Kingma and Ba, 2015) for up to 500 epochs with a learning rate of 1e-4. We use dropout with a rate of 0.5 for each layer. The hidden size of two-layer feedforward neural networks and GAT heads for all models is 384. The size of relation embeddings is 50. We use 4 different heads for GAT. The number of layers n l is 2 for all GAT models. And we use a fixed pretrained model 5 to obtain contextualized representation for each sentence or document. We use 10 different random seeds for our experiments and report the averaged scores. We evaluate our model at each epoch, and search the best threshold for infinite dates on the development set. We use all predicted scores from the development set as candidate thresholds. We choose the model with the best performance on accuracy based on the development set and report the performance on test set using the best searched threshold on the development set.
Evaluation Metrics. We evaluate the performance of models based on two different metrics, exact match rate and approximate match rate proposed in TAC-KBP2011 temporal slot filling evaluation (Ji et al., 2011). For exact match   (9)).
rate, credits will only be assigned when the extracted date for a 4-tuple element exactly matches the ground truth date. The approximate match rate Q(·) compares the predicted 4-tupleτ In this way, partial credits will be assigned based on how close the extracted date is to the ground truth. For example, if a gold standard date is 2001-01-01 and the corresponding extracted date is 2001-01-02, the credit will be 1 1+|2001-01-01−2001-01-02| = 1 2 . If a gold standard date is inf and the corresponding extracted date is 2001-01-02, the credit will be 1 1+|inf−2001-01-02| = 0.

Results
Our experiment results are shown in Table 4. From the results of directly converting sentence-level time arguments to 4-tuple representation, we can find that local time information is not sufficient for our document-level 4-tuple event time extraction. And the document creation time baseline does not perform well because a large portion of documentlevel 4-tuple event time information does not coincide with document creation time, which is widely used in previous absolute timeline construction. By comparing the performance of basic extraction framework that uses sentence-level and documentlevel contextualized embedding, we can also find that involving document-level information from embeddings can already improve the system performance. Similarly, we can also see performance improvement by involving rule-based time propagation rules, which again indicates the importance of document-level information for this task.
Our GAT based time propagation methods significantly outperform those baselines, both when using temporal relations and when using arguments to construct those event graphs. Specifically, we find that using relation embedding significantly improves the temporal relation based propagation, by 2.01% on exact match rate and 2.03% on approximate match rate. This is because temporal labels between events, for example, BEFORE and AFTER, are more informative than argument roles in tasks related to time. Although our argument-based propagation model does not explicitly resolve conflict, the violation rate of 4-tuple constraints is about 4% in the output.
Our time propagation framework has also been integrated into the state-of-the-art multimedia multilingual knowledge extraction system GAIA (Li et al., 2020a,b) for NIST SM-KBP 2020 evaluation and achieves top performance at intrinsic temporal evaluation. Table 5 shows some cases of comparison of various methods. In the first example, our argument based time propagation can successfully propagate "Wednesday", which is attached to the event "arrive", to "talk" event, through the shared argument "Blair". In the second example, "Negotiation" and "meeting" share arguments "Washington" and "Pyongyang". So the time information for "Negotiation" can be propagated to "meeting". In contrast, for these two cases, the basic extraction framework extracts wrong dates.

Qualitative Analysis
The third example shows the effectiveness of temporal relation based propagation. We use the extracted temporal relation that "rumble" happens before "secured" to propagate time information. The basic extraction model does not know the temporal relation between these two events and thus makes mistakes.

Remaining Challenges
Some temporal boundaries may require knowledge synthesis of multiple temporal clues in the docu-  ment. For example, in Table 1, the latest end date of the "sentence" event (2012-04-14) needs to be inferred by aggregating two temporal clues in the document, namely its duration as nine-year, and its start date as 2003-04-14.
Temporal information for many events, especially major events, may be incomplete in a single document. Taking Iraq war as an example, one document may mention its start date and another may mention its end date. To tackle this challenge, we need to extend document-level extraction to corpuslevel and then aggregate temporal information for coreferential events in multiple documents.
It is also challenging for the current 4-tuple representation to represent temporal information for recurring events such as paying monthly bills. Currently we consider recurring events as different events and fill in slots separately. Besides, this work does not capture more fine-grained information such as hours and minutes, but it is straightforward to extend the 4-tuple representation to these time scales in future work.
Our current annotations are done by linguistic experts and thus they are expensive to acquire. It is worth exploring crowd-sourcing methods in the future to make it more scalable and less costly.

Related Work
Event Temporal Anchoring. Event temporal anchoring is first introduced by Setzer (2002) using temporal links (TLINKS) to specify the relation among events and time. However, the Time-Bank Corpus and TimeBank Dense Corpus using TimeML scheme (Pustejovsky et al., 2003b,a;Cassidy et al., 2014) is either too vague and sparse or is dense only with limited scope. Recently, Reimers et al. (2016) annotate the start and end time of each event on TimeBank. We have made several extensions by adding event types, capturing uncertainty by 4-tuple representation instead of TLINKS so that indirect time can also be considered, and extending event-event relations to document-level.
Especially, Reimers et al. (2018) propose a decision tree that uses a neural network based classifier to find start and end time on Reimers et al. (2016). Leeuwenberg and Moens (2018) use event time to construct relative timeline.
Temporal Reasoning. Some early efforts attempt to incorporate event-event relations to perform temporal reasoning (Tatu and Srikanth, 2008) and propagate time information ) based on hard constraints learned from annotated data. Our work is largely inspired from Talukdar et al. (2012) on graph-based label propagation for acquiring temporal constraints for event temporal ordering. We extend the idea by constructing rich event graphs, and proposing a novel GAT based method to assign weights for propagation.
The idea of constructing event graph based on sharing arguments is also motivated from Centering Theory (Grosz et al., 1995), which has been applied to many NLP tasks such as modeling local coherence (Barzilay and Lapata, 2008) and event schema induction (Chambers and Jurafsky, 2009).

Conclusions and Future Work
In this paper, we have created a new benchmark for document-level event time extraction based on 4-tuple representation, which provides rich representation to handle uncertainty. We propose a graph-based time propagation and use event-event relations to construct document-level event graphs. Our experiments and analyses show the effectiveness of our model. In the future, we will focus on improving the fundamental pretraining model for time to represent more fine-grained time information and cross-document temporal aggregation.