Temporal and Aspectual Entailment

Inferences regarding “Jane’s arrival in London” from predications such as “Jane is going to London” or “Jane has gone to London” depend on tense and aspect of the predications. Tense determines the temporal location of the predication in the past, present or future of the time of utterance. The aspectual auxiliaries on the other hand specify the internal constituency of the event, i.e. whether the event of “going to London” is completed and whether its consequences hold at that time or not. While tense and aspect are among the most important factors for determining natural language inference, there has been very little work to show whether modern embedding models capture these semantic concepts. In this paper we propose a novel entailment dataset and analyse the ability of contextualised word representations to perform inference on predications across aspectual types and tenses. We show that they encode a substantial amount of information relating to tense and aspect, but fail to consistently model inferences that require reasoning with these semantic properties.


Introduction
Tense and aspect are two of the main contributors to the semantics of a proposition, describing the temporal location of a predication and its internal constituency, thereby considerably influencing the entailment relations it licenses.For example, while arrive in LOC |= be in LOC is generally considered a valid entailment rule, the case is complicated when different tenses and aspectual auxiliaries 1 of a given verb are considered as sentences (1) and (2) illustrate.
(1) Jane has arrived in London.
|= Jane is in London now.
(2) Jane will arrive in London.
|= Jane is in London now.
Understanding the difference between an event that has happened and whose consequences hold at the present moment, and an event that is currently happening or will happen in the future, is crucial for answering questions such as Where is Jane? or Is Jane in London now?Inferring the consequences of events is important for understanding the relation between entities in the world.For example, if we read that Lady Catherine has bought Longbourn estate, the inference that the acquisition is completed, and that the resulting consequence is that Lady Catherine now owns Longbourn estate, is paramount for keeping knowledge bases up-to-date.
In this paper we propose a novel entailment dataset that requires models to correctly determine the internal and external temporal structure of predications when performing natural language inference.To the best of our knowledge, this is the first dataset that is primarily focused on assessing natural language inference between temporally and aspectually modified predications.

Tense, Aspect and Entailment
Tense is a grammatical category which is encoded in the morphology of the verb in English (e.g.past loved vs. non-past loves).It establishes a point of reference that allows the temporal organisation of events in a discourse.In English, tense interacts with aspectual auxiliaries such as the verbs be or have that influence the internal constituency of a predication, and determine whether an event is completed or ongoing.Tense and aspect therefore control the internal and external temporal structure of an event and govern the inferences that a predication licenses (Reichenbach, 1947;Dahl, 1985;Steedman, 1997).There is evidence that such morphology is represented in distributional embeddings (Mitchell and Steedman, 2015;Vylomova et al., 2016).In this paper we are concerned with perfect and progressive aspect, but do not focus on any other types of aspect such as the Aktionsart of a predication (Vendler, 1957), which we leave to future work.

The Interaction between Temporality and Entailment
Perfect aspect (typically) describes events as a completed whole, and licenses inferences regarding the consequences of that event.The use of different tenses and aspects for past events influences their relevance to the present moment and thereby their entailment behaviour.For example, the consequences of an event in the present perfect hold at the time of utterance, whereas events in the simple past or the past perfect do not (Comrie, 1985;Moens and Steedman, 1988;Depraetere, 1998;Katz, 2003).This is shown in sentences (3) and (4), where only sentence (3) licenses the inference of Elizabeth being in Meryton now.
|= Elizabeth is in Meryton now.
|= Elizabeth is in Meryton now.This property can be explained through a Reichenbachian view of the present perfect, where the point of reference coincides with the point of speech, thereby indicating its current relevance (Reichenbach, 1947).On the other hand, events in the past simple or the past perfect license inferences for consequent states in the past, as sentence (5) shows.
(6) Mary is going to Netherfield now.
|= Mary has arrived / is in Netherfield.
Progressive aspect describes ongoing events and therefore does not license inferences regarding their consequences as sentence (6) shows.It furthermore gives rise to the imperfective paradox (Dowty, 1979), which only seems to license inferences for non-culminated processes (Moens and Steedman, 1988), as sentences ( 7) and (8) show.
(7) Catherine was walking in the woods.
|= Catherine walked in the woods.
|= Jane reached / was in London.
The modal future introduces an event whose realisation is uncertain, therefore any inferences about its outcome are only licensed if common-sense knowledge suggests that this is almost always the course of events as sentence (9) shows.
(9) Charles will meet with Jane.
|= Charles will see Jane.
The correct treatment of tense and aspect in a predication is crucial for inferring the consequences it licenses, which is important for answering questions about a given paragraph, or creating and updating knowledge bases.

Models
We analyse five distributional embedding models and two pre-trained biLSTM sentence encoders for their ability to perform inference on temporal predications.Our choice of models is motivated by the observation that modelling entailment between temporal predications requires a bespoke representation of the inflected verb in the context of the given aspectual auxiliary and its arguments.
word2vec.We evaluate the ability of word2vec representations for performing inference with temporal predications.Contextualisation2 can be achieved by averaging two word vectors, which has been shown to be a strong baseline for a range of problems (Iyyer et al., 2015;Wieting et al., 2016).Notably, adding or averaging word vectors approximates the intersection of their feature spaces (Tian et al., 2017).
APTs.Anchored Packed Trees are a recently proposed vector space model that take distributional composition to be a process of lexeme contextualisation.APTs are based on a higher-order dependencytyped structure that gives rise to a weighted, directed and labelled graph.Contextualisation is achieved through distributional composition, which requires aligning two lexemes according to their syntactic relation, and then merging the aligned representations.APTs are the only count-based (i.e.non-neural) model in our evaluation.
fastText.The fastText model represents each word as a sum of bag-of-character n-grams, thereby making better use of subword information and therefore -potentially -providing a better mechanism for encoding morphosyntactic relations.Contextualisation is achieved through averaging the respective word vectors in a phrase.
ELMo.ELMo is based on a deep bidirectional LSTM language model that creates multiple layers of representations for every token.Contextualised representations are obtained from the internal states of the LSTMs, where Peters et al. (2018) showed that lower levels of the architecture capture syntactic characteristics, and higher-levels capture semantic characteristics of words.
BERT.BERT uses multi-headed bi-directional self-attention and is based on the Transformer architecture (Vaswani et al., 2017).Devlin et al. (2018) observed that sequential language model architectures are limited by the unidirectionality of the models.Therefore they proposed a novel training objective that jointly conditions on left and right context in all layers.They showed that their training regime results in substantial gains over serial language model-based architectures on numerous NLP tasks.
Word2vec, APTs and fastText follow the one representation per word paradigm (Kober et al., 2017), where every lexeme is represented by one vector, and contextualisation is typically achieved through distributional composition.ELMo, BERT and the pre-trained biLSTMs, on the other hand ,create context-sensitive representations on the token level.This results in different representations for the same word, depending on its current context.

Experiments
We created two experiments to assess the extent of morphosyntactic information relating to tense and aspect that is encoded in the respective embedding spaces.Subsequently we propose a novel entailment dataset and evaluate the capability of the embedding models and the pre-trained biLSTMs to perform inference on temporal predications.All our resources are available from https://github.com/tttthomasssss/iwcs2019.

Auxiliary-Verb Agreement
The first experiment evaluates whether the models are able to capture the agreement between an inflected verb and its corresponding aspectual auxiliary.For example, the models should be able to determine that will visit represents a correct combination whereas will visiting does not.We consider capturing the morphosyntactic interplay between an inflected verb and its aspectual auxiliary a pre-requisite for adequately modelling the semantics of tense and aspect.
We cast the problem as a classification task with the goal of distinguishing correct auxiliary-verb pairs from incorrect ones with a diagnostic classifier.This methodology is similar to the approach of Linzen et al. (2016) who assessed the ability of LSTMs to learn number agreement in English subject-verb phrases.For the dataset, we extracted verbs from the One Billion Word Benchmark (OBWB) (Chelba et al., 2013) where each inflected verb form occurred at least 50 times.We then paired the inflected verb forms with their corresponding auxiliaries to form positive pairs, and subsequently paired each of the different inflected verb forms with all incorrect auxiliaries to build the negative pairs.We filtered the negative pairs for plausible combinations such as is eaten by removing valid passive constructions and any invalid combination that occurred at least 5 times in the OBWB corpus.The final dataset consists of almost 36k auxiliary-verb combinations with a positive : negative class distribution of 38 : 62.

Translation Operation
In the second experiment we assess whether it is possible to learn a translation operation between different tenses in the embedding space.We consider learning a translation operation in two ways: firstly a simple vector offset on the basis of the averaged difference between inflected verbs with their auxiliaries and their respective lemmas.Secondly, we train a feedforward neural network to project the infinitive representation of a verb to one of its inflected forms.The goal for both approaches is then to generate an unseen inflected verb form from a given unseen lemma.
The averaged offset translation is shown in Equation 1, where the offset o t is calculated on the basis of a set of seed verbs S of size n, and vector representations x t and x of the inflected form, or contextualised form if the tense requires an auxiliary, and lemma form of the verb x, respectively.At prediction time, we are trying to create x t by adding the offset o t to the lemma x (where x ∈ S).Equation 2 shows the setup where we use a neural network to learn a translation matrix from infinitive forms to inflected forms, where f is a tense-specific neural network with a single hidden layer, that takes an unseen lemma representation x as input and generates an inflected form x t , and where Θ t represent the learnable parameters of the network.
We subsequently evaluate whether the correctly inflected verb is in the nearest neighbour list of the generated verb.The inflected verb generation setup is inspired by Bolukbasi et al. (2016) and Shoemark et al. (2017), who used a similar method in their respective works.For the dataset, we extracted verbs from the OBWB corpus where each inflected verb form occurred at least 50 times, resulting in ≈2.8k verbs per tense.

Entailment with Temporal Predications
Lastly, we propose TEA -the Temporal Entailment Assessment dataset.TEA contains pairs of short sentences with the same argument structure that differ in tense and aspect of the main verb, and follows a binary label annotation scheme (entailment vs. non-entailment).inference patterns creates the necessity for NLP systems to learn these rules from data.With TEA, we cast the problem of determining when a new consequent state is licensed by an event as a natural language inference task, thereby providing a first evaluation set for modern NLP models.Data Collection.We sampled candidate pairs from the before-after category of VerbOcean (Chklovski and Pantel, 2004), the WordNet verb entailment graph (Fellbaum, 1998), the entailment datasets of Weisman et al. (2012) and Vulić et al. (2017), and the relation inference dataset of Levy and Dagan (2016).Subsequently, we manually filtered the list, and discarded candidate verb pairs without any temporal relation to each other.For each pair we chose nouns as arguments to form full sentences.The arguments further served the purpose of reducing ambiguity and avoiding habitual readings.
TEA covers entailments between an all-by-all combination of the present simple, present progressive, present perfect, past simple, past progressive, past perfect and the modal future, covering perfect and progressive aspect.The dataset contains 11138 sentence pairs with a class distribution of 22 : 78 (entailment : non-entailment).More detailed dataset statistics are presented in Appendix A.
Data Annotation.We interpreted entailment as common-sense inference (Dagan et al., 2006), and considered a positive entailment relation between two temporal predications if a human annotator would decide that sentence 2 is most likely true given sentence 1.We decided against a crowdsourced annotation of TEA as our aim was to maximise the consistency of fine-grained entailment decisions.Therefore, TEA was labelled by two annotators3 , where the first round of annotation resulted in just under 20% disagreement across the whole dataset.The relatively high level of disagreement suggests that even for annotators who (more or less) know what they are looking for, assessing whether an entailment holds between two temporal predications is a very challenging task.
Disagreements in TEA were resolved on a case-by-case basis and all sentence pairs with an initial disagreement have been resolved and included in the dataset.We found that with temporality involved, suddenly everything appeared to become uncertain.Hence we approached the disagreement resolution by first discussing which of several possible readings is the strongest, and whether that reading is sufficiently more likely than any other possible reading.Subsequently we discussed whether the strong reading is above the almost always true threshold.
Often, disagreements resulted from different assumptions regarding the ordering of the events' nuclei.For example, even if we accept that buys entails chooses, will buy does not necessarily entail will choose.The reason is that this pair is ambiguous between two readings, a "has-just-chosen-and-nowwill-buy" reading on one hand, and a "will-choose-and-then-will-buy" reading on the other, which seem to be equally likely in the absence of any further context4 .
Even when ordering was clear, however, disagreements could arise over beliefs of when an utterance becomes licensed.Saying will graduate, for example, can be considered reasonable at any time, or only once graduation is sufficiently imminent and likely.In the latter case, is studying can be considered sufficiently likely to be an entailment, while in the former case the entailment is less clear5 .Overall, world knowledge and intuition played into disagreements heavily, causing cases to fall just above or below the common-sense inference threshold depending on the annotator.
We identified a possible annotation artefact in TEA due to our decision to annotate the dataset sequentially rather than randomly.While this greatly reduced the cognitive load, we were confronted with possible contradictions between different tenses of entailed predicates (for example, a single event cannot happen in the past and the future).This initially led to more conservative annotations, since some pairs when viewed independently can sound very plausible.We tried to factor out this source of bias when resolving the disagreements, and are confident that the annotations in TEA are robust.
An interesting avenue for future work would be adding temporal adverbials to further reduce ambiguity for annotators -and to analyse whether models can handle them correctly.The addition of temporal adverbials might alleviate the temporal ordering ambiguity, as for example reading will buy in 5 years might help us conclude the ordering with will choose, since choosing is probably near buying.

Results and Analysis
For our experiments we used the publicly available versions of each embedding model.For the evaluation on TEA, we trained two biLSTMs on SNLI and DNC in addition to the embedding models, achieving 83% and 88% accuracy on the SNLI and DNC development sets, respectively.Appendix B lists further details for all models.

Auxiliary-Verb Agreement
For assessing whether the auxiliary-verb agreement can be detected with a diagnostic classifier, we built a binary classification task, using stratified J-K-fold cross-validation (Moss et al., 2018) and report averaged accuracy.We used the scikit-learn (Pedregosa et al., 2011) logistic regression classifier with default hyperparameter settings.
The results in Table 2 show that the representations of APTs and BERT are specific enough for a linear classifier to distinguish plausible from implausible combinations.The reason for the strong performance of APTs stems from its sparsity -plausible auxiliary-verb combinations result in representations with numerous non-zero entries, whereas implausible combinations rarely contain more than a handful of non-zero elements.While word2vec and fastText seem to capture the morphosyntactic relation between an auxiliary and an inflected verb to some extent, their performance is substantially worse than APTs and BERT.Somewhat surprisingly, the results for ELMo are worse than the majority class baseline for all auxiliaries.One possible reason for the comparatively weak performance of word2vec,

Translation Operation
For obtaining an averaged vector offset, we randomly sampled a seed set of verb types from our dataset to learn an offset vector, and subsequently aimed to predict the inflected form for all remaining verb types in the dataset.We sampled 10 different seed sets of size 10 for our experiments6 .
For learning a translation operation with a neural network we used a simple feedforward architecture with a single hidden layer and a tanh activation function, using Adam with a learning rate of 0.01 to optimise the mean squared error between the generated inflected verb and the true inflected verb.Due to the neural network requiring more training data than the averaged vector offset approach, we evaluated the model using 10-fold cross-validation.For APTs we projected the explicit co-occurrence space down to 100 dimensions using SVD before feeding the representations to the neural network.
Performance for both approaches is reported in terms of Mean Reciprocal Rank (MRR), averaged over the 10 randomly sampled seed sets and the 10 cross-validation folds, for the averaged offset vector and neural network approaches, respectively.For calculating MRR, the query space for retrieving an inflected verb, given its lemma and the computed translation operation, is based on all contextualised auxiliary-verb combinations, and all inflected forms of all verbs.
Creating translation operations in embedding space is primarily a word-type level task and thus potentially puts BERT and ELMo at a disadvantage as they produce representations on the token level.This is reflected in Figure 1, where both ELMo and BERT perform poorly in comparison to word2vec and fastText.APTs also exhibit weak performance on this task, with this time the sparsity of its highdimensional representations being disadvantageous.Interestingly, performance generally droppedexcept for word2vec -when moving from the simple vector offset approach to a neural network based translation operation, providing evidence that the morphosyntax of tense and aspect is well represented as a linear offset in the embedding space.One of the main reasons for the poor performance of ELMo and BERT was that the obtained offset vectors and learnt translation matrices varied substantially across runs.Figure 2 shows the average cosine similarities (left) and average Euclidean distances (middle) between the computed offset vectors for each subtask across all 10 runs. Figure 2 furthermore shows the average Frobenius distances (right) between the learnt neural network translation matrices across all 10 folds.Figure 2 mirrors the general performance trend in Figure 1, with vector offsets obtained from word2vec and fastText having high average cosine similarity and low average Euclidean distance.Furthermore, the lower average Frobenius distance for word2vec is reflected in its improved performance in comparison to fastText whose translation matrices exhibit a larger average Frobenius distance.For ELMo in particular, the offset vectors and translation matrices differ considerably across experimental runs.The large average Frobenius distances for ELMo and BERT also suggest that the neural network struggled to find a good minimum during learning.

Entailment with Temporal Predications
The results in this section so far have shown that morphosyntactic information relating to tense and aspect is encoded in the different embedding spaces.In the following we use TEA to analyse whether these models are able to use that information for natural language inference.As our goal is to assess to what extent tense and aspect are captured by the models, we refrain from fine-tuning them on TEA.
For evaluation we measure precision and recall over varying thresholds and report performance in terms of average precision7 .TEA can also serve as an additional evaluation set for sentence encoder models trained on large-scale natural language inference datasets such as SNLI or DNC, which themselves include very little temporal information in their respective test sets.We therefore additionally cast TEA as a binary classification task, and report accuracy and macro-averaged F1-score for the two pre-trained biLSTM models.
Table 3 shows the average precision scores for the models and the accuracy and F1-scores for the two pre-trained biLSTMs in comparison to a majority class baseline and a baseline predicting the majority class per tense pair.We used cosine as similarity measure for the embedding models and the softmax prediction scores for the biLSTMs.For APTs, we also tried the asymmetric inclusion score BInc (Szpektor and Dagan, 2008), however found cosine working better.We furthermore experimented with distributional inference (Kober et al., 2016), and found a small positive impact on recall but a slightly larger negative dip in precision, which overall led to slightly lower average precision scores.The results show
that neither of the models are able to outperform the majority class / tense baseline.This highlights that despite the use of short and simple sentences in the dataset, the latent nature of tense and aspect make TEA a very challenging problem.
In order to analyse the causes for the low performance across models, we calculated the false positive and false negative rates for different similarity threshold ranges for each of the models.Figure 3 shows that even for high thresholds, the neural embedding models frequently predict entailment when there isn't one, thereby producing a high rate of false positives (highlighted at the top of Figure 3).Conversely, a sparse model such as APTs, fails to predict entailment when there actually is one, resulting in a high rate of false negatives (highlighted at the bottom of Figure 3).Our results show that natural language inference on temporal predications is a challenging problem, especially for distributional semantic approaches.One reason is that these models are primarily governed by contextual similarity which is a bad proxy for inference in the case of a dataset such as TEA.For example, if Jane has arrived in London, then she was going to London at some earlier point, but it is not the case that she currently is going to London.Furthermore, when she has arrived in London, she is visiting London at the moment, and will leave again at some point in the future.
The predications in the short narrative above are very diverse in terms of tense and aspect, however the main verbs -or even the predications as a whole -typically have high distributional similarity, which inevitably leads to numerous false entailment decisions as reflected in Figure 3.
In the following we briefly analyse the impact of distributional similarity and investigate to what extent the similarity scores between two predications change when tense and aspect influence the entailment.Table 4 shows that the cosine similarity between temporally and aspectually modified predications is typically higher than for their respective lemmas.This further indicates that many false positives of the neural network based models in our results are due to high distributional similarity scores between predications.For APTs the cosine scores -even when normalised -are generally very low due to their sparsity and high dimensionality, highlighting their bias towards false negatives.However, also shows that in most cases the distributional similarity between an entailed pair is higher than for a non-entailed pair (boldfaced in Table 4).This indicates that the embedding models do appear to capture some of the semantics of tense and aspect in their respective contextualised representations.However, their high distributional similarity overwhelms any finer distinction that the models might have extracted.
While our analysis indicates that the embedding models are able to extract knowledge about tense and aspect, the signal is not strong enough to reliably perform inference.A potential avenue for future work would therefore be the development of models that are able to better represent tense and aspect, while not being primarily governed by distributional similarity.
Most previous work on inference between verbs was concerned with extracting inference rules from raw text (Lin and Pantel, 2001;Szpektor et al., 2004Szpektor et al., , 2007;;Hashimoto et al., 2009;Melamud et al., 2013).As a next step, Berant et al. (2010) and Hosseini et al. (2018) leverage these rules to build entailment graphs for modelling natural language inference.However in both cases the entailment graphs are built on the basis of verb lemmas and do not take tense and aspect into account.One example of using tense for inference is Pavlick and Callison-Burch (2016), who leverage implicative verbs to determine that managed to solve X |= X is solved.Our proposed dataset TEA fills a gap in the natural language inference evaluation repertoire by focusing on temporal and aspectual entailment.Recent years saw the release of a number of large-scale datasets, such as SNLI (Bowman et al., 2015), MNLI (Williams et al., 2017) or DNC (Poliak et al., 2018), but neither of these datasets focuses on, or includes a substantial proportion of, inference examples between temporal predications.
TEA is related to work on causality (Mirza et al., 2014;Mirza and Tonelli, 2014), however our dataset has been created from scratch rather than derived from TimeBank (Pustejovsky et al., 2003), as for example explicit buys |= owns relations are rarely encountered in the same paragraph or connected by explicit causal links.Therefore, TEA captures many consequent state inferences that are missing from previous datasets.The most closely related task to TEA is the relation inference dataset of Levy and Dagan (2016), which however, contains only very few examples where temporality is a governing factor.

Future Work
In future work we plan to leverage tense-and aspect-based information for constructing temporal entailment graphs (Lewis and Steedman, 2014), where nodes represent tensed predicates (e.g. has visited), and edges represent entailment relations.Temporal entailment graphs, together with knowledge about the completedness or current relevance of an event, can be applied to procedural reasoning, such as tracking the state of entities through text, similar to recent work of Bosselut et al. (2017), andHenaff et al. (2017).We furthermore plan to focus on other types of aspect such as Aktionsart.

Conclusion
In this paper we highlighted that tense and aspect are two of the most important factors for performing natural language inference.We introduced a novel entailment dataset, TEA, that contains pairs of short sentences and focuses on entailment relations between temporally and aspectually modified verbs.We showed that distributional embedding models capture a considerable amount of the morphosyntactic information relating to tense and aspect in their embedding spaces.However, neither the embedding models, nor two pre-trained biLSTMs, were able to outperform a simple rule-based baseline on TEA, primarily due to their reliance on contextual similarity for inference.In this sense, tense and aspect semantically resemble logical operators like negation rather than distributional components.The challenge will be to combine logical operator semantics with distributional representations of content words.

Figure 1 :
Figure 1: Translation operation results based on averaged MRR.

Figure 2 :
Figure 2: Average cosine similarities and Euclidean distances of averaged offset vectors and Frobenius distances of the learnt neural network weight matrices.
Example sentences from TEA are shown in Table1.The absence and infeasibility of creating a lexical resource for consequent state John is visiting London.|=Johnhasarrived in London.John will visit London.|=Johnhasarrived in London.John is visiting London.|=Johnhasleft London.John is visiting London.|=Johnwill leave London.George has acquired the house.|=George owns the house.George is acquiring the house.|=George owns the house.Table 1: Examples from TEA.

Table 2 :
Auxiliary-verb agreeement results.Results are averaged accuracies with standard deviations in brackets.

Table 3 :
TEA results.All model results are significantly worse at the p < 0.01 level w.r.t. the majority class / tense pair baseline, using a randomised bootstrap test

Table 4 :
Similarity scores between the example predicates.DNC and SNLI refer to the two biLSTMs pre-trained on DNC and SNLI, respectively.