Temporal Reasoning in Natural Language Inference

We introduce five new natural language inference (NLI) datasets focused on temporal reasoning. We recast four existing datasets annotated for event duration—how long an event lasts—and event ordering—how events are temporally arranged—into more than one million NLI examples. We use these datasets to investigate how well neural models trained on a popular NLI corpus capture these forms of temporal reasoning.


Introduction
The ability to reason about how events unfold in time is core to how humans structure their knowledge about the world (Casati and Varzi, 1996;Zacks and Tversky, 2001;Radvansky and Zacks, 2014), and modeling such temporal reasoning has been central to many classical AI approaches (Mc-Carthy and Hayes, 1987;Kahn and Gorry, 1977;McDermott, 1982;Allen, 1984;Kowalski and Sergot, 1989;Pani and Bhattacharjee, 2001).

Order
We waited until 2:25 PM and then left. The waiting started before the leaving started. Reggie said he will pay us soon. The paying ended before the saying started.

Duration
The greeter said there was about 15 mins waiting. The saying did take or will take shorter than an hour. Randy , this is the issue I left you the voice mail on. The leaving did take or will take longer than a day. indicates the line is a context, and the following line is its corresponding hypothesis. Hypotheses in green indicate that the context entails the hypothesis; those in red indicate that it does not entail the hypothesis.
Given that temporal reasoning is integral to natural language understanding (NLU) and that Natural Language Inference (NLI) is a common framework for evaluating how well models capture semantic phenomena integral to NLU (Cooper et al., 1996;Dagan et al., 2006;White et al., 2017;Poliak et al., 2018), it is important to evaluate how well different classes of NLI models trained on common generic NLI datasets capture temporal reasoning.
We present five new NLI datasets recasted from four existing temporal reasoning datasets: Our new NLI datasets focus on two key aspects of temporal reasoning: (a) temporal ordering and (b) event duration. We present strong baseline models for our temporal reasoning focused NLI datasets and also investigate the performance of common neural NLI models on these datasets. Our experiments demonstrate that common neural based NLI models trained on a popular dataset do not sufficiently capture temporal reasoning and require additional supervised training on datasets specific to temporal reasoning.

Motivation
A text often does not contain explicit mentions of how long events last or whether some events are contained within another. Consider (1). (1) We waited until 2:25 pm and then left.
Although (1) does not explicitly mention how long the waiting lasted, one can reasonably guess that it lasted somewhere between minutes to hoursdefinitely not months or years. Zhou et al. (2020) note that common sense inference is required to come to such conclusions about an event's duration and text might even contain reporting biases when highlighting rarities (Schubert, 2002;Van Durme, 2011;Zhang et al., 2017;Tandon et al., 2018), potentially making it hard to learn using common language modeling-based methods. Popular NLI datasets contain hypotheses which are elicited by humans (Bowman et al., 2015;Williams et al., 2018). Although the context sentences for these datasets come from multiple genres, the constructed hypotheses do not necessarily capture semantic phenomenon which are essential for any robust NLU inference system. Recent work has catered to the lack of such inference capabilities by focusing on semantic phenomenon such as paraphrastic inference and anaphora resolution (White et al., 2017), veridicality (Poliak et al., 2018;Ross and Pavlick, 2019), and various other implicatures and presuppositions (Jeretic et al., 2020).
Even though temporal reasoning is crucial for event understanding, no datasets focused on temporal reasoning exist in the NLI format. To fill this lacuna, we recast four existing datasets to create NLI pairs that explicitly require reasoning about event duration and chronological ordering. Table 1 shows examples from two of our recasted datasets.

Dataset Creation
We construct five new NLI datasets recast from four existing datasets that focus on two key aspects of temporal reasoning: (a) temporal ordering and (b) event duration. Across these datasets, we have more than a million NLI examples and we retain the training, development, and test splits from the original (for datasets in which such splits exist). Table 2 reports the total number of NLI pairs in each of our recast datasets.

Temporal Ordering
To generate hypotheses for our temporal ordering datasets, we create 8 templates which refer to the start-points and end-points of events in a pair of two events. The templates are shown in Table 3. We recast 4 datasets: (i) TE3; (ii) TB-D; (iii) RED; and (iv) UDS-T. UDS-T directly annotates for the relation between start and end points of events in an event pair, making hypothesis generation with our templates straight-forward. In contrast, TE3, TB-D, and RED annotate event pairs for categorical temporal relations based on those proposed by Allen (1983). Using each category's definition, we map that category to a template predicate-a function from hypothesis templates to {entailed, not-entailed}-summarized in Table 3.
TB-D uses a reduced set of relations: before (Bt), after (At), isincluded (II), includes (I), simultaneous (S), and vague (the last of which we ignore); as does RED: before (Br), begins-on (BO), ends-on (EO), contains (C), and simultaneous (S). This reduction results in the categories being ambiguous with respect to certain hypothesis templates. For instance, for Template 3 (X ended before Y started) knowing that X is before (Bt, Br) Y in the TB-D and RED sets does not give enough information about the ending point for X because these relations are not defined to have a strict ending boundary-in contrast to before (B) in TE3. We thus exclude hypothesis templates for ambiguous TB-D or RED relations.
For RED, we collapse relations with the same prefix into a single relation, e.g before/causes, before/precondition is collapsed into Br. We ignore  relations with overlap prefix as they do not have a clear boundary for start or end points of events.

Temporal Duration
To generate hypotheses for our temporal duration dataset, we create 18 hypothesis templates that refer to a range of likely durations for an event, based on two metatemplates: (i) X did last or will last longer than LOWER-BOUND and (ii) X did last or will last shorter than UPPER-BOUND, where LOWER-BOUND and UPPER-BOUND range over a second, a minute, an hour, a day, a week, a month, a year, a decade, and a century. 2 We recast a single dataset-UDS-T-which contains annotations for the duration of an event drawn from the following 11 labels: instantaneous, seconds, minutes, hours, days, weeks, months, years, decades, centuries, and forever. For each event, we create two or four NLI pairs (depending upon the true label) to capture the duration information.
The entailed hypothesis of the NLI pair takes a range of duration values derived from the gold duration label for the given event. The lower limit of the range is one rank less than the gold label-e.g. for minutes, the LOWER-BOUND is a second-and the upper limit is one rank greater than the gold label-e.g. for minutes, the UPPER-BOUND is an hour. Two entailed hypotheses are then generated from these two limits, one corresponding to the lower limit-longer than a second, and the other corresponding to the upper limit-shorter than an hour. The corresponding not-entailed hypotheses are then generated by inverting the entailed hypothesis-e.g. for minutes: shorter than a second and longer than an hour. In cases, where the gold duration label is instantaneous or forever, only one entailed and one not-entailed pair in created.

Development and Test Splits
For the development and test set in UDS-T, there are three gold labels for each event-pair, so for the entailed hypothesis in these cases, we take the lower limit of duration range as one rank less than the lowest of the three gold labels and the upper limit as one higher than the highest of the three gold labels. For instance, if the three gold labels in the development set for an event are: hours, weeks, months, then the lower limit is minutes and the upper limit is years. The entailed and not-entailed hypothesis can then be generated using the same method described for the train set earlier.
TE3 does not have a development set, so we randomly sample documents from the train data and set it aside as development set. We use the same number of documents as that in the test set. Similarly, RED does not contain development and test splits, so we randomly sample 20% of the documents from train, evenly splitting them to create a development and a test set.

Grammatical Hypothesis Generation
We define rules to help generate hypotheses that are grammatical. We define our rules based on the Part-of-Speech (POS) tag of the events (predicates) in the context. UDS-T contains gold POS tags, and the gold dependency trees for all contexts. So for any predicate which is tagged as a VERB in the context, we use its inflected form as a gerund in the hypothesis. For example, 'we waited until ...' becomes 'the waiting started ...'. Predicates with other POS tags in UDS-T occur with a copular construction, so we add the prefix being before the predicate to make it grammatical, for example, 'we're happy ...' becomes 'the being happy started ...'. We also attach three types of direct modifiers of the predicate in the context -adjectives, determiners, and negations -to make the reference of the predicate specific to the context in the hypothesis. For example, 'we're not happy ...' becomes 'the not being happy started ...'. For cases where the lemma of the event appears multiple times in the context, we attach the direct object modifier of the event to make the reference unambiguous in the context. For example, to refer to the highlighted predicate in the context -'we cleaned the apartment .... and they cleaned the washroom ...' -we use the hypothesis 'the cleaning the apartment started ...'. We use the gold dependency trees of each context to obtain these modifiers of the predicate. We do not consider predicates with AUX and DET POS tags for our recasting.
For TE3, TB-Dense, and RED, the gold dependency trees are not available, so we focus only on verb-verb event relations to ensure better grammaticality of the hypothesis. To get the POS and lemma for sentences in TE3, TB-Dense and RED, we process and tokenize each sentence using Stanza (Qi et al., 2020). To get the inflection on each verb, we use LemmInflect. 3

Dataset Validation
To assess whether the recast NLI pairs are correct, we conduct a validation experiment by randomly sampling 100 NLI pairs from the train split of each dataset. For each NLI pair, we ask the annotators to answer the question -How likely is it that the second sentence is true if the first sentence is true? We provide 5 options to choose from -extremely likely, very likely, even chance, very unlikely, extremely unlikely.
We recruited 48 annotators from Amazon Mechanical Turk to validate the sampled NLI pairs for each of our 5 recasted datasets. We selected only those annotators who passed an American nativespeaker test with 90% or above accuracy. Each item in our validation task listed 10 NLI pairs. If our recasting produces valid NLI pairs, we should see that entailed pairs receive higher likelihood judgments than not-entailed pairs, even when adjusting for the dataset the pair comes from, the annotator, the pair, and the list of pairs the annotator saw the pair in. To test this, we fit an ordinal mixed effects model to the likelihood responses given by annotators, with a fixed effect for the source of the NLI pair as well as random intercepts for annotator, pair, and list. We compare this model to a model that additionally includes a fixed effect for the entailment label associated with the pair by our recasting. We find a reliable positive effect of the label being entailed (χ 2 (1) = 227.1, p < 0.001), indicating our recasting method produces valid NLI pairs.

Experimental Setup
We use our recast datasets to explore how well different common classes of NLI models capture temporal reasoning. Specifically, we use three types of models: (i) neural bag of words (NBOW; Iyyer et al., 2015) (ii) InferSent (Conneau et al., 2017), and (iii) RoBERTa (Liu et al., 2019). 4 Our NBOW model represents contexts and hypotheses as an average of GloVe embeddings (Pennington et al., 2014). The concatenation of these representations is fed to a MLP with one hidden layer. The InferSent model encodes contexts and hypotheses independently with a BiLSTM and sentence representations are extracted using max-pooling. The concatenation of these sentences, their difference, and their element-wise product (Mou et al., 2016) are then fed to a MLP. For Roberta, we use a classification head on top of the pooled output of roberta-large to predict the labels. 5 In our experiments, we train and test these models on each recast temporal dataset. For each model, we include a hypothesis-only baseline to evaluate how much the datasets test NLI as opposed to just the likely duration and order of events in general. Additionally, we train each model on Multi-genre NLI (MNLI, Williams et al., 2018) and test the model on our datasets to see if the model learns temporal reasoning from a generic NLI dataset that does not necessarily focus on temporal reasoning. Table 4 shows the accuracy of different models on our recast temporal datasets. We report the majority baseline (MAJ) of always predicting the label that appeared the most in training. We see that the models trained on MNLI perform poorly on our recast datasets, even worse than MAJ baseline in  Table 4: Accuracies on the test set of our recast datasets as predicted by different settings of our models. many cases. This indicates that the models trained on MNLI do not learn representations well enough to infer temporal reasoning in our datasets.

Results & Discussion
The hypothesis-only models provide an interesting limitation of NBOW and InferSent. Both NBOW and InferSent hypothesis-only models are as good as, or even better, than the normal models across all datasets. RoBERTa, however, improves when given the context, across all datasets, with TimeBank-Dense as the exception. This suggests that RoBERTa embeddings are better able to capture the semantics of the context than NBOW and InferSent. In fact, NBOW and InferSent may just predict the label based on information about lexical entities in the hypothesis.
Context in duration All three hypothesis-only models achieve high accuracy on the NLI dataset based on UDS-Duration. Even RoBERTa seems to fail to capture anything extra from the context. To analyze this anomaly, we create a hypothesistemplate based majority baseline inferred from the UDS-Duration train data and find that it achieves an 80.2% accuracy on the test set. This indicates that the data is skewed for each template, which might be caused by the skewed minutes duration label in UDS-T (roughly 28% of the UDS-T train set contains minutes as the true duration label). This template based majority prediction is noteworthy as the models pretrained on MNLI fail to infer the correct labels even when the labels are skewed per template. The neural models see a 10% gain in accuracy over the template-sensitive majority, indicating that the models are learning the range of durations for different entities. Another possible reason that the context does not help much for duration is that events often have a modal distribution for a duration label, similar to the explanation for the recast NER data in Poliak et al. (2018)

Conclusion
To better capture temporal reasoning inference capabilities, we create a million NLI pairs recast from existing corpora in the literature that focus on two aspects of temporal reasoning -temporal duration and temporal order. We test existing models trained on MNLI on our datasets and find that a generic NLI model is not able to capture temporal reasoning. We show that training on our datasets can improve the performance of models in capturing temporal reasoning, and some aspects of temporal reasoning, specifically how long an event lasts, might be learned from lexical entities alone. We hope that our recast datasets push the research community to further explore how learning temporal reasoning could benefit other tasks.

A Model Implementation Details
For all of the experiments using Glove embeddings, we use 300-length dimensional embeddings. The MLP for the NBOW model has one hidden layer of 100 dimensions. The output from the hidden layer is fed to a logistic regression softmax classifier. In InferSent, the encoders have one layer in each direction and we use Glove embeddings to initially represent the tokens. Sentence representations of length 2048 are extracted by max-pooling. The MLP has one hidden layer of 512 dimensions. We optimize the model using SGD. We set the initial learning rate to 0.1 and decay rate to 0.99 and we train over 20 epochs.
For Roberta, we use the transformers (Wolf et al., 2019) library from HuggingFace and use their RobertaForSequenceClassification class to implement our model. We use a mini-batch size of 16 trained over 2 GPUs with an Adam optimizer using 122 warmup steps and an initial learning rate of 2e-5 and a 0.1 weight decay. For UDS-T recast datasets we run the Roberta models for 2 epochs. For TE3, TBD, and RED we run the model for 10 epochs.
The MNLI dataset has three labels: neutral, contradiction, and entailment. For the MNLI Baseline models, we train the models to predict these three labels, but when we evaluate these models on our recast datasets, we follow common practice (Belinkov et al., 2019) by converting neutral and contradiction to the not-entailed label during test time.