Temporal Reasoning on Implicit Events from Distant Supervision

We propose TRACIE, a novel temporal reasoning dataset that evaluates the degree to which systems understand implicit events—events that are not mentioned explicitly in natural language text but can be inferred from it. This introduces a new challenge in temporal reasoning research, where prior work has focused on explicitly mentioned events. Human readers can infer implicit events via commonsense reasoning, resulting in a more comprehensive understanding of the situation and, consequently, better reasoning about time. We find, however, that state-of-the-art models struggle when predicting temporal relationships between implicit and explicit events. To address this, we propose a neuro-symbolic temporal reasoning model, SymTime, which exploits distant supervision signals from large-scale text and uses temporal rules to combine start times and durations to infer end times. SymTime outperforms strong baseline systems on TRACIE by 5%, and by 11% in a zero prior knowledge training setting. Our approach also generalizes to other temporal reasoning tasks, as evidenced by a gain of 1%-9% on MATRES, an explicit event benchmark.


Introduction
Understanding temporal relations between events in narrative text is a crucial part of text understanding. When reading a story, a human can construct a latent timeline about events' start and end times, similar to the one shown in Fig. 1 about an automobile accident. This timeline not only contains the placements of explicitly mentioned events (e.g., ride a bicycle), but also accounts for implicit events (e.g., Farrah was distracted so she looked away). Such a latent timeline explains the dynamics between events; for example, the possible chain of events between ride and recovered in this context * Most of the work was done when the third author was employed at the Allen Institute for AI and the first author was an intern there.
Farrah was driving home from school. A person was riding a bicycle in front of her. Farrah looked away for a second. She didn't notice that he stopped. She tried to brake but it was too late. The person recovered soon.  Figure 1: A story, its latent timeline, and example TRA-CIE instances from it. For simplicity, events are shortened to single verbs and the timeline is exaggerated.

Context
contains get hit and injured. The ability to construct such a timeline is essential for understanding the causal dynamics of a situation. Without it, NLP systems cannot truly understand situations and reliably solve tasks such as temporal question-answering, causal inference, and scheduling assistance. To better evaluate this ability, we introduce a new dataset called TRACIE (TempoRAl Closure InfErence) that focuses on temporal relations on implicit events in short stories. Our dataset contains high-quality annotations of both start and end time queries that test a system's understanding of the full temporal closure (i.e., both start and end time) of events. As a task that requires considerable commonsense knowledge, we follow  in minimizing the size of the training set, therefore making TRACIE mainly an evaluation set. The final TRACIE dataset contains a total of 5.4k human-curated instances, provided in a (multi-premise) textual entailment (TE) format, as illustrated at the bottom of Fig 1. A Pre-trained language model such as T5-Large (Raffel et al., 2020) fine-tuned on our new dataset achieves a modest binary prediction accuracy of 67.9%. 1 Consistent with other studies on temporal reasoning , these results reveal serious limitations in existing pre-trained language models.
To build models better capable of understanding time with minimal direct training data, we propose a novel distant supervision technique that improves generalization by extracting temporal patterns in large-scale free text as part of an additional pretraining step. In contrast to other attempts at extracting temporal data through patterns at a sentence level (Gusev et al., 2011;, we extract over large windows of text such as paragraphs. This allows for capturing global information related to multiple events and extracting signals that do not appear in small-window local contexts. The resulting model, PTNTIME (Pattern-Time), achieves a 76.6% accuracy on TRACIE, a 9% gain over using standard T5-Large. We also show the applicability of PTNTIME on a standard temporal reasoning benchmark involving only explicit events, MATRES (Ning et al., 2018b), with a 9 point gain in a low-resource setting.
We achieve further improvements by coupling PTNTIME with a duration model from  to create a neural-symbolic reasoning model called SYMTIME. The key idea in SYMTIME is to decompose the computation of temporal relations to the predictions of relative distances between start times and those of durations. For example, in Fig 1, we can decide that distracted likely ends before try starts because the duration of distracted is likely to be shorter than the distance between the two start times. This allows for better prediction on the end time, which rarely appears in the natural text and has been previously shown to be difficult to annotate (Ning et al., 2018b). Such a symbolic computation involves a logical combination of the individual models in a way that formalizes part of the Allen interval algebra (Allen, 1983). This model, which supports a wider range of temporal computation and can be used with and without taskspecific supervision, achieves a final accuracy of 78.9% on TRACIE's binary classification metric. We also show that SYMTIME is more robust to different distributions of the training data, demonstrating the benefits of using a temporal model with a transparent reasoning process. 1 The same model achieves 77.4% on MATRES (Ning et al., 2018b) with a similar amount of training instances. All TRACIE numbers reported in this section are from Table 2. In summary, we make the following 3 contributions: (1) a temporal relation dataset TRACIE focusing on implicit events ( §3); (2) a distant supervision process for temporal understanding of implicit events ( §4); and (3) a reasoning model that makes end-time comparisons using predictions of start-time distances and durations ( §5). Finally, we demonstrate the effectiveness of our models on TRACIE, as well as the applicability of our approach to an existing temporal benchmark ( §6).

Related Work
Temporal reasoning has received much attention in the NLP community, and to date, there are many datasets that focus on temporal ordering (Pustejovsky et al., 2003;Bethard et al., 2007;Cassidy et al., 2014;Reimers et al., 2016;O'Gorman et al., 2016;Ning et al., 2018bNing et al., , 2020b, and other temporal knowledge (Pan et al., 2006;Zhou et al., 2019). We focus here on modeling implicit events, which has received relatively little attention. Multiple systems have been proposed as part of research into temporal ordering (Do et al., 2012;Moens and Leeuwenberg, 2017;Leeuwenberg and Moens, 2018;Meng and Rumshisky, 2018;Ning et al., 2018c;Han et al., 2019), duration prediction (Vashishtha et al., 2019) and other tasks. Our decision to use a textual entailment style follows recent work on natural language inference (Williams et al., 2017;Nie et al., 2020;Bhagavatula et al., 2020), which tends to not focus on time (for recent work on temporal NLI, see Vashishtha et al. (2020)). Many have used distant supervision for temporal reasoning (Gusev et al., 2011;Ning et al., 2018a;. Comparatively, our work captures longer-range dependencies in narrative text (for related ideas, see Ammanabrolu et al. (2021)).

Context Story (Premise)
Hypothesis Inference Label Tom needed to get braces. He was afraid of them. The dentist assured him everything would be fine. Tom had them on for a while. Once removed he felt it was worth it.
Tom avoids foods he can't eat with braces starts before the braces are removed.

entailment
We were all watching Spongebob as a family. It is a kid's show but all really enjoyed it. This one episode was especially funny for the adults. It has humor in it that is funny for kids and adults. It is something we can all watch...
The adults laughed at the jokes ends before we watch Spongebob as a family contradiction I was throwing the baseball with my son. He threw one past me that landed in the lake. I reached in to get the ball. I lost my balance and fell in. I got the ball and a bath all in one shot!
The ball was in the boys hand starts after he reached for the ball contradiction Figure 2: Example TRACIE instances. The comparator l ∈{starts,ends} and relation r ∈{before,after} in each hypothesis are highlighted, in addition to the corresponding explicit event from the story. This work is broadly related to works on causal dynamics (Pearl, 2009). The nature of combined temporal and causal focuses is also related to procedural text modeling (Tandon et al., 2018(Tandon et al., , 2020.

The TRACIE Dataset
In this section, we introduce the TRACIE dataset. 2

Task Overview and Dataset Construction
The goal of TRACIE is to test a system's ability to compare start and end times of non-extractive implicit event phrases instead of extractive triggers from the context. Such tests in TRACIE take the form of multi-premise textual entailment (TE) (Lai et al., 2017). Each TRACIE instance contains 1) a context story (or premise) consisting of a sequence of explicit narrative events; 2) an implicit event in the form of a natural language phrase that is unmentioned but has some role in the story; 3) a comparator of either {starts,ends}; 4) an explicit event also in the form of a phrase, and 5) a temporal relation of either {before,after} that marks the relationship in the dimension defined by the comparator between the implicit-event and the explicit-event. With these 4 components, we are able to generate TE-style instances, using the context story as the premise and temporal queries about pair-wise relations between implicit and explicit events as hypotheses. For example, in the first positive instance shown in Fig. 1, "distracted" is the implicit-event, "starts" is the comparator, "try" is explicit-event and "before" is the temporal-relation. They form a positive hypothesis "distracted starts before try." 3 We flip the temporal-relation (i.e., "before" to "after" and vice versa) to create negative  Figure 3: TRACIE's label definition and its relation to Allen's interval algebra, with a graph illustration between an implicit event and an explicit event.
(contradiction) instances, as shown in the second example instance in Fig. 1.
Since the start times of explicit-events are more obvious to human annotators, we use them as reference points and compare the implicit-event's start or end time with them (depending on the comparator), according to the label definitions shown in Fig. 3. In rare cases where two time points are the same (e.g., hit and get hit start at the same time in Fig.1), we use the causal relation to decide the order, so that hit starts before get hit. Such instances are created through a multi-stage annotation process as detailed (in respective order) below. All steps are implemented with the CrowdAQ platform (Ning et al., 2020a) with qualification exams.

Implicit Event Generation
We randomly sample short stories from the ROCStories dataset (Mostafazadeh et al., 2016). For each story, one annotator writes 5 implicit event phrases that are not explicitly mentioned by the given story, but are inferable and relevant. The annotator additionally rewrites two explicit events closest to the implicit event's start and end time, respectively. With these two events, we can build two TRACIE instances (minus the temporal-relation) per implicit event, which accounts for 10 instances in total per story.
Automatic Instance Generation We use Al-lenNLP (Gardner et al., 2018) to extract all verbs and relevant arguments with its semantic role labeling (SRL) model. With all the verbs and their arguments, we construct a pool of explicit events in the form of short phrases. For each implicit event, we randomly select two {explicit-event, compara-tor} pairs from the pool and build 10 additional instances (without temporal-relation).
Label Collection For each of the 20 instances per story, we annotate the temporal-relation with four different annotators. Annotators follow the label definition in §3.1 to produce four temporalrelations for each instance. We use the majority agreement as the final label and filter out unagreeable instances. Two authors additionally verify the instances with ambiguous verbs (e.g., "have") and corrected 5% of the end-time instances.

Splits and Analysis
We split the data under the independent and identically distributed (i.i.d.) assumption based on stories, with a 20/80 train/test ratio. We use a small training set, following Zhou et al. (2019), as we believe temporal relations involve much commonsense knowledge. As we later show in §6.3, it is infeasible to collect a large enough human-annotated training set to capture all the knowledge needed to tackle this problem completely, and a system must acquire knowledge from external resources. As a result, we use a small training set just to define the task, and at the same time, use an extensive testing set for more robust evaluation.
The authors conduct a human upper-bound analysis on 100 randomly sampled instances, following the procedure in . There is a 94% agreement and a 98% resolved accuracy, 4 suggesting that TRACIE has a high annotation quality.

Pattern-Based Pre-Training
As argued in §3.2, we believe that it is more efficient to build a model that learns the prior knowledge needed for the task with distant signals and only subsequently learns the task definition through a small training set. This section describes how we collect the distant signals related to events' starttime comparisons and pre-train a novel temporallyaware transformer model called PTNTIME. While PTNTIME will be used for fine-tuning directly on TRACIE, it will also form the basis of a more general temporal reasoning model called SYMTIME that we describe in §5.

Distant Supervision Collection
We describe the sources of distant supervision signals with the goal of understanding the relative order between two events' start times as well as the relative distance between them.
I went to the park on January 1 st . I was very hungry after some hiking. Luckily, I purchased a lot of food before I went to the park. I enjoyed the trip and wrote an online review about the trip on the 10 th .
[I purchased food, I went to the park.]: before [I went to the park, I wrote a review]: before, weeks Within-Sentence Extraction We collect start time comparisons between pairs of events heuristically from free-text using "before/after" keywords (following much prior work in temporal modeling and extraction (Do et al., 2012)). We use Al-lenNLP's SRL model to process each input sentence and find verbs with a temporal argument that starts with either "before" or "after", and contains at least another verb. If there are multiple verbs in the temporal argument, we take the one with the largest number of tokens as arguments. We match the two extracted verbs with the relation indicated by the first word of either "before" or "after". As the example in Fig. 4 shows, the extractor identifies that purchase food is before go to park as indicated by the "before" keyword mentioned in the text. We acquire 2.8 million instances from the May 2020 Wikipedia dump using this process.

Cross-Sentence Extraction
The data collected from the within-sentence patterns does not reveal the relative distance between two start times. In addition, because writers often save trivial inferences for efficiency, certain event pairs rarely co-occur within a small textual window, making one event often implicit to the other one in these pairs. To better collect such signals, we employ a cross-sentence extraction that finds direct temporal expressions of hours and dates. Because these temporal expressions (e.g., 2021-01-01) are globally comparable, the compared events can be anywhere in a document. Therefore, this process collects more supervision signals about time-point comparisons and their relative distance on event pairs with trivial causal relations. We apply the SRL model and find all temporal arguments and their associated verbs. We find the exact temporal values by filling unmentioned elements of a temporal expression with the nearest previous mention (e.g., we add "January" to the expression of "the 10th" in Fig. 4.) These extractions have high precision, as the SRL model does well on identifying temporal arguments.
We then construct supervision instances under the assumption that the extracted temporal expressions describe the start times of the associated verbs (e.g., went started on January 1 st in Fig. 4) . Each instance comprises an event pair, a temporal relation, and an estimation on the temporal difference between the two start times. Each event is a phrase constructed by taking all relevant arguments of the predicate verb in the SRL parses. We represent the differences between the two start times as one of seven coarse temporal units: {≤minutes, hours, days, weeks, months, years, ≥decades}. For example, we get go to park is weeks before write review as shown in Fig. 4. In addition to the event pairs, we randomly sample sentences within the paragraph to use as the context that better defines the events. We collect 700k instances from this cross-sentence extraction process from Wikipedia.
Language Model (LM) Pre-Training Data We couple the specialized temporal pre-training data described above with additional paragraphs that are used to perform conventional language model pretraining using the original denoising task proposed in Raffel et al. (2020). This is done to maintain part of the original language model's semantics and to avoid overfitting. We use the Gutenberg Dataset (Lahiri, 2014) as the source and collect 1 million paragraphs for this purpose. Here [EventA] represents the tokens that describe the first event; [EventB] represents the ones that describe the second event; and [Paragraph] represents the tokens of the context, which is non-empty only for cross-sentence extractions. [Relation] is either before or after, and [Label] is either positive or negative. When the label is positive, the relation will be the gold relation extracted from the text; when it is negative, the relation will be the inverse of the extracted relation. We randomly make 50% of the instances negative. [Distance] is one of the 7 coarse temporal units represented with a set of blank tokens [extra_id_N]. We leave it to be blank for the within-sentence extractions so that the objective function will not include it in loss computations. The LM pre-training data follows the original format in Raffel et al. (2020).

Pattern-Based Temporal Model (PTNTIME)
We use a pre-trained sequence-to-sequence model as our base model and additionally pre-train this model using the data collected in §4.1 (for modeling details, see §6.1). We call the resulting model PTNTIME. As a result of this additional pre-training step, PTNTIME serves as new set of temporally-aware model weights that can be used in place of existing pre-trained models and finetuned on TRACIE. As we describe next, we also use PTNTIME to build a modular temporal reasoning model called SYMTIME that attempts to go beyond a standard language modeling approach and improve start and end point prediction.

Symbolic Temporal Reasoning Model (SYMTIME)
To address the challenge of predicting event end times for which it is difficult to obtain high-quality direct or distant supervision, we introduce a new reasoning model called SYMTIME in this section. This model makes end-time comparisons by symbolically combining start time distance and duration from separate predictions based on some of the components introduced in the previous section. Different from Leeuwenberg and Moens (2018) and Vashishtha et al. (2019), our model does not rely on explicit annotations on timepoints, but only relative comparisons between them.

Formulation
As described in §3.1, hypotheses in TRACIE make pair-wise comparisons between two events e 1 and e 2 using a comparator l from {starts, ends} and a query-relation r from {before, after} based on a provided story context. We associate each e j with a latent start time start j and an end comparator l relation r l (e1, e2)= ends before if end1 < start2 after otherwise starts before if start1 < start2 after otherwise Figure 5: Decomposition of the relation functions that solve TRACIE instances (equal timepoints ignored).
time end j , as well as, for convenience, a duration duration j = end j − start j . Under this formulation, a symbolic approach to solving TRACIE involves computing the relation functions r l shown in Figure 5. For example, given exact numeric values end 1 and start 2 , as one would assume in a classical interval-based approach to temporal reasoning (Allen, 1983) 5 , determining if the first event ends before the second involves simply computing whether end 1 is less than start 2 . Given that the exact values of start and end times are latent, we use the intervals to do the same comparisons, as they are more context-invariant. For example, we do not need the exact date to know that lunch starts before dinner in the same day, because there is a typical distribution of the relative distance between the two start times. Based on this idea, we build a neural-symbolic model that learns approximations of these simple functions in Fig. 5 in a differentiable way. Specifically, we use individual neural modules that make predictions about event intervals via distance and duration functions dist(e i , e j ) and dur(e j ), respectively.
To understand this decomposition, we define the distance and duration functions computed by these two modules as dist(e i , e j ) = start i − start j and dur(e j ) = duration j . By exploiting the rule that an end point end j can be computed as end j = start j + duration j , we can, for example, decompose the relation r ends (e 1 , e 2 ) = before (i.e., e 1 ends before e 2 ) in terms of our two modules as follows via simple algebraic manipulation: r ends (e 1 , e 2 ) = before ⇔ end 1 < start 2 ⇔ start 1 + duration 1 < start 2 ⇔ start 1 − start 2 + duration 1 < 0 ⇔ dist(e 1 , e 2 ) + dur(e 1 ) < 0 5 In the Allen algebra, the values endx and starty correspond to the right and left end points x + , y − in the intervals (x − , x + ), (y − , y + ). Likewise, our durationx corresponds to the value (x + − x − ). Hence, we have reduced the computation of the relation ends before to a symbolic computation over two numeric intervals. Conversely, we have r ends (e 1 , e 2 ) = after ⇔ dist(e 1 , e 2 ) + dur(e 1 ) > 0, 6 For the starts comparator, we have r starts (e 1 , e 2 ) = before ⇔ dist(e 1 , e 2 ) < 0 and vice versa for the after relation.

Query on A's Duration Query on A and B's Distance
In what follows, we describe how we approximate the values of the two functions via individual neural modules (see illustration in Fig. 6).

Duration Estimation
To obtain a model to estimate dur(·), we pre-train a sequence-to-sequence model with the duration data from , which is similarly collected from pattern-based extraction. The data contains over 1 million events with their corresponding duration values. We map each instance to an input sequence event:[Event]story: [Story] and a corresponding output sequence answer:[Value], where [Event] represents the tokens of an event with the trigger verb marked by a special token to its left, [Story] represents down-sampled tokens from the context, and [Value] is one of the 7 unit labels as described in §4.1 (i.e., { ≤minutes, hours, days, weeks, months, years, ≥decades }).

Computation and Learning
We use the output from PTNTIME to approximate the function dist(·). Following the sequence formulation of PTNTIME in §4, we replace [EventA] with the textual description of e 1 , [EventB] with the textual description of e 2 , and [Paragraph] with the context (premise), and fix [Relation] to be before. By taking the values of the vocabulary indices corresponding to "positive" and "negative" from the logits of [Label] and applying a softmax operation, we get P before and P after . These are the probability of e 1 starting before and after e 2 , respectively, and are used to define the vector p = [P before , P after ]. Similarly, we apply softmax to the logits of [Distance] over the 7 words representing the temporal units to obtain 7 values that approximate the probabilities of the distance between two events' start times being closest to each temporal unit. We place the 7 values in temporal units' increasing order in vector d. To represent |start 1 −start 2 | with a single value, we dot product the probabilities with an incremental constant vector c = [0, 1, 2, 3, 4, 5, 6]. To get the direction, we apply the tanh function to the difference between the probabilities in p. 7 As a result, we have: We use the pre-trained model in §5.2 to approximate the function dur(·). Because the model is pre-trained with markers to the left of trigger verbs, we run a part-of-speech tagger on input phrases and add a marker to the left of the first verb. We apply softmax to the logit values of [Value] over the 7 temporal unit words and get, as above, 7 values representing the probabilities of the input event's duration being closest to each unit. We form v by placing these values at the temporal unit's increasing order. With the same constant vector, we have: For hypotheses with comparator starts, we use PTNTIME and its sequence-to-sequence objective to learn (i.e., we take the input hypothesis and context as is and use [Label] directly as the prediction). For hypotheses where the comparator is ends, we use the inference process in §5.1 and the computation process described above to construct logits = [pred, −pred], pred = dist(e 1 , e 2 ) + dur(e 1 ) as detailed in Fig. 6. We find the gold-temporal-relation in each training instance and compute a two-class cross-entropy loss with logits. The PTNTIME that predicts starts 7 To ensure that tanh returns a value close to 1 or -1, we multiply the distance by a big number denoted as INTmax.
hypotheses shares weights with the one used in computing logits. The final model SYMTIME can also be used to predict TRACIE instances without any task-specific supervision as the two functions are initialized with distant supervision.

Baselines and Systems
We use T5-Large implemented by Wolf et al. (2019) as our base sequence-to-sequence model for both PTNTIME and the duration model in §5.2 as it provides for faster iterations. We use early stopping, batch size of 32 and other default parameters. PTNTIME converges after 45k steps (∼1.4M instances) and the duration model converges after 80k steps (∼2.6M instances). We use these pretrained weights in SYMTIME as well as SYMTIME-ZEROSHOT which uses no TRACIE supervision.
We compare with our proposed models with a host of baselines based on the same pre-trained language model, including BaseLM: T5-Large, and BaseLM-MATRES: T5-Large fine-tuned on 20k MATRES training data. We also compare with other architectures/models, including BiLSTM as used in Williams et al. (2017), Roberta-Large (Liu et al., 2019) and T5-3B. All models and baselines follow a standard TE setup and default parameters. We report a 3-run average and each model is run until convergence.

Metrics and Settings
We measure system performance on TRACIE separately for start-time hypotheses and end-time hypotheses. We also employ a story-wide exact match metric, which is the percentage of stories with all its related hypotheses answered correctly.
In addition to TRACIE's standard i.i.d. split, we propose a pruned version of the training set with balanced prior distributions. For example, in the i.i.d. training set, 70% of the examples with the comparator ends and relation after are positive. We randomly remove instances from the majority classes to produce a uniform-prior training set such that a model can no longer rely on such prior distributions. We believe this setting better evaluates a system's true understanding of the task.   Table 1 shows system performance on TRACIE's i.i.d. setting. We observe that PTNTIME improves on all metrics over the base language model, with 6% on start-time comparisons and 8% on storywide exact match. It also outperforms BaseLM-MATRES, suggesting that distant supervision is more efficient than extensive human annotation. With a symbolic end-time inference, SYMTIME further improves on all metrics, with 7%, 4%, and 9% gains over the base language model on start time, end time and story-wide exact match, respectively. SYMTIME can further improve the performance on start-time hypotheses over PTNTIME even though they use the same model to predict start-time queries. This is because PTNTIME is not designed to understand end time from pre-training, and fine-tuning on such data hurts its representation in general. This illustrates the benefits of models using explicit and sensible reasoning processes.   (2020) is not strictly comparable with the rest.

Main Results
a system cannot exploit prior knowledge about the label distribution when making predictions. Given this, we see that all baselines produce a much lower performance, e.g., the BiLSTM, which is a model that lacks much of the pre-requisite knowledge for reasoning, suddenly performs near random chance. Compared to the baseline models, PTNTIME only drops 2.7%, suggesting that it is more invariant to evaluation settings and better understands temporal common sense. SYMTIME has the smallest drop among all models (1.7%) because of its explicit reasoning process on end-time hypotheses. SYMTIME-ZEROSHOT does not use any TRACIE training examples, so it has the same performance in the uniform-prior setting which outperforms all supervised baselines including T5-3B.

Extrinsic Evaluation
To show that our model is not limited to the TRA-CIE dataset and is general in temporal relation reasoning, we also evaluate on MATRES (Ning et al., 2018b), a temporal relation dataset focused on comparing explicit events' start times. We train and evaluate only the instances with a label of either "before" or "after", which accounts for about 80% of all instances. We compare the performance of SYMTIME 9 with BaseLM. We report four results -  . In OT-NS, we also report a SOTA system from Wang et al. (2020) under the same two-label 10 setting. Table 3 shows the performance of our model and the baselines. We see that our model is consistently   better than BaseLM, and at the same time, comparable to Wang et al. (2020). Our model benefits more from input contexts, and only drops 4% in the OT-MS setting with minimal supervision (from 89.6 to 86.1), comparing to the 10% drop from T5-Large. This shows the effectiveness of our distant signals in §4.1, which are also designed to encourage contextual understandings.

Ablation Studies and Analysis
To better understand the improvements from our models, we conduct several ablation studies. Table 4 shows the results on TRACIE where the story is not provided as part of the inputs to systems (a no-story setting). While such a setting bares some resemblance to the partial-input baselines often employed in TE (Poliak et al., 2018), in our setting, it is often possible to predict temporal relations in the absence of stories because of strong commonsense priors. Indeed, we estimate that 65% of the instances can be correctly predicted from the hypotheses alone, based on expert analysis in § 3.2. This suggests a 82.5% human upper-bound 11 in this no-story setting. Hence, such a setting partly evaluates a model's ability to incorporate commonsense priors when making decisions.
We see that BaseLM is close to random chance, whereas PTNTIME and SYMTIME improve 20% and 22% respectively. This suggests that our models better understand temporal common sense through the distant supervision on both start times and duration. On the other hand, we observe much smaller drops in our model's performances in this no-story setting. This suggests that our models do not improve as much on the 35% instances that require multi-hop timeline constructions over more than two events, motivating future work. Table 5 compares the two pre-training sources 11 We assume that the remaining 35% non-predictable instances are decided by random guessing. described in §4.1 by individually pre-training two models with only within-sentence or cross-sentence extracted data. We see that the cross-sentence extraction brings the most performance gain on TRA-CIE's start-time binary metric under the uniformprior training setting. This suggests that the global extraction rule is able to introduce new knowledge that is not seen in localized language model pretraining. Combining the within-sentence data further improves the performance.
Through analysis on the interval predictions made by SYMTIME, we notice a tendency for the model to predict "after" for end-time instances, possibly due to overly-estimated durations: a byproduct of natural biases in text. Given the weak signal used to learn such intervals and these potential biases, this is not altogether surprising. We leave the task of learning more robust and faithful interval representations for future work.

Conclusion
We introduce a challenging dataset TRACIE, to evaluate systems' temporal understanding of implicit events. We propose a distant supervision process that improves language models' understanding of start times of both explicit and implicit events. We further combine this process with a distantly supervised model that estimates events' duration to compare event end times, under the explicit rule that end times are start times plus durations. We show that our model improves over TRACIE and MATRES, suggesting the effectiveness of highprecision pre-training and symbolic temporal reasoning. Despite these advances, TRACIE continues to be a challenging task for future work on general temporal reasoning.