Fine-Grained Temporal Relation Extraction

We present a novel semantic framework for modeling temporal relations and event durations that maps pairs of events to real-valued scales for the purpose of constructing document-level event timelines. We use this framework to construct the largest temporal relations dataset to date, covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to train models for jointly predicting fine-grained temporal relations and event durations. We report strong results on our data and show the efficacy of a transfer-learning approach for predicting standard, categorical TimeML relations.


Introduction
Natural languages provide a myriad of formal and lexical devices for conveying the temporal structure of complex events -e.g. tense, aspect, auxiliaries, adverbials, coordinators, subordinators, etc. Yet, these devices are generally insufficient for determining the fine-grained temporal structure of such events. Consider the narrative in (1).
He was running away, when the neighbor rushed out to confront him. His parents were called but couldn't arrive for two hours because they were still at work.
Most native English speakers would have little difficulty drawing a timeline for these events, likely producing something like that in Figure 1. But how do we know that the breaking, the running away, the confrontation, and the calling were short, while the parents being at work was not? And why should the first four be in sequence, with the last containing the others?
The answers to these questions likely involve a complex interplay between linguistic information, on the one hand, and common sense knowledge about events and their relationships, on the other (Minsky, 1975;Schank and Abelson, 1975;Lamport, 1978;Allen and Hayes, 1985;Hobbs et al., 1987). But it remains an open question how best to capture this interaction. A promising line of attack lies in the task of temporal relation extraction. Prior work in this domain has approached this task as a classification problem, labeling pairs of event-referring expressions -e.g. broke or be at work in (1)and time-referring expressions -e.g. 3pm or two hours -with categorical temporal relations (Pustejovsky et al., 2003;Styler IV et al., 2014;Minard et al., 2016). The downside of this approach is that we must rely on time-referring expressions to express duration information. But as example (1) highlights, nearly all temporal duration information can be left implicit, meaning it is only explicitly encoded when it is linguistically encoded.
In this paper, we develop a novel framework for temporal relation representation that puts event duration front and center. Like standard approaches using the TimeML standard, we draw inspiration from Allen's (1983) seminal work on interval representations of time. But instead of annotating text for categorical temporal relations, we map event pairs directly to real-valued relative timeline representations, in addition to mapping events to their likely durations. This change not only supports the goal of giving a more central role to event duration, it also allows us to better reason about the temporal structure of complex events as described by entire documents.
We begin with a discussion of the literature on temporal relation extraction ( §2) and then discuss our own framework and data collection methodology ( §3). The resulting Universal Decompositional Semantics Time (UDS-T) dataset is the largest temporal relation dataset to date (available at decomp.io), covering all of the Universal Dependencies Nivre et al., 2015) English Web Treebank (Bies et al., 2012). We use this dataset to train a variety of neural models ( §4) to jointly predict fine-grained (real-valued) temporal relations and event durations ( §5), showing not only that our models obtain strong results on our dataset, the representations they learn can be straightforwardly transferred to the standard categorical relation datasets. ( §6).

Background
We review prior work on temporal relations frameworks and associated corpora as well as systems for temporal relation extraction.
Corpora Most large datasets capturing temporal relations between events use the TimeML standard (Pustejovsky et al., 2003;Styler IV et al., 2014;Minard et al., 2016). TimeBank is one of the earliest large corpora built using this standard, capturing event pairs that annotators felt were salient (Pustejovsky et al., 2003). The TempEval competitions improved on the number of temporal relations by covering relations between all the events and times in a sentence, but only one of the TempEval tasks covered inter-sentential event relations (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013, andsee Chambers et al. 2014).
Efforts have been made to address the issue of sparsity in event-graphs with corpora such as the TimeBank-Dense  where annotators label all local-edges irrespective of ambiguity. TimeBank-Dense does not capture the complete graph over events and times relations, instead attempting to achieve completeness by capturing all relations within a sentence and the neighboring sentence. We take inspiration from this work for our own annotation protocol.
The Richer Event Description (RED) corpus takes a multi-stage annotation pipeline where various event-event phenomena, including temporal relations and sub-event relations are annotated together in the same datasets . Similarly, Hong et al. (2016) build a crossdocument event corpora which covers fine-grained event-event relations and roles with more number of event types and sub-types. Another framework called GAF (Fokkens et al., 2013) captures eventidentification through both textual and non-textual sources to track events across news articles.
Most of the corpora mentioned above required skilled workers to build the annotations as they follow specific ontologies. We take an alternative approach of capturing temporal relations by designing a protocol that asks simple questions about events which can be answered by any native speaker of English, finding surprisingly high agreement among annotators (see §3).
Models A variety of approaches have been taken to identifying the temporal relations between pairs of events. Early approaches use handtagged features modeled with multinomial logistic regression and support vector machines (Mani et al., 2006;Bethard, 2013;Lin et al., 2015). Other approaches use a combination of rule-based and learning-based approaches (D'Souza and Ng, 2013) and sieve-based architectures Mirza and Tonelli, 2016). Ning et al. (2018) jointly model causal and temporal relations using Constrained Conditional Models and formulate the problem as an Interger Linear Programming problem.
We presented a novel joint framework, Temporal and Causal Reasoning (TCR), using CCMs and ILP to the extraction problem of temporal and causal relations between events In the recent years, neural network-based approaches have used both recurrent (Tourille et al., 2017;Cheng and Miyao, 2017;Leeuwenberg and Moens, 2018) and convolutional architectures (Dligach et al., 2017). Leeuwenberg and Moens (2018) use such models to predict relative timelines constructed from a set of temporal relations. Our annotations allow us to directly predict relative timelines between a pair of events which we then use to create document timelines anchored to some specific event.
The pairwise classification can result in inconsistent temporal graphs, and efforts have been made to avert this issue by employing temporal reasoning (Chambers and Jurafsky, 2008;Yoshikawa et al., 2009;Denis and Muller, 2011;Do et al., 2012;Laokulrat et al., 2016;Ning et al., 2017;Leeuwenberg and Moens, 2017). People have also worked on modelling event durations from text (Pan et al., 2007;Gusev et al., 2011;Williams and Katz, 2012), but they don't tie it directly to temporal relations. On the other hand, Filatova and Hovy (2001) assign a time-stamp to every clause in text, but the durations of events are not taken into consideration.
Attention-based models have proven effective in neural machine translation literature (Bahdanau et al., 2014;Luong et al., 2015;Vaswani et al., 2017), but to our knowledge, they have not been explored in identifying temporal relations. We follow up on this work in our models, using a variation of dot-product attention (Luong et al., 2015;Vaswani et al., 2017) to predict the event timelines and durations which is described §4. To cater to temporal reasoning, we treat the document timeline as a hidden representation and build it from the actual pairwise annotations as described in §7.

Data Collection
We collect the Universal Decompositional Semantics Time (UDS-T) dataset, which is annotated on top of the Universal Dependencies Nivre et al., 2015) English Web Treebank (Bies et al., 2012). The main advantages of UD-EWT over other similar corpora are: (i) it covers text from a variety of genres; unlike most other datasets; (ii) it is built upon gold standard Universal Dependency parses; and (iii) it is compatible with various other semantic annotations which use the same predicate extraction standard (White et al., 2016;Zhang et al., 2017;. Table 1 compares UDS-T against other temporal relations datasets. Protocol design Annotators are given two contiguous sentences from a document with two highlighted event-referring expressions (predicates). If the predicate contains a copula, the whole predicate starting from the copula is highlighted. Other-   wise, only the root of the predicate is highlighted. They are then asked (i) to provide relative timelines on a 0-100 scale for the pair of events referred to by the highlighted predicates; and (ii) to give the likely duration of the event referred to by the predicate from the following list: instantaneous, seconds, minutes, hours, days, weeks, months, years, decades, centuries, forever. In addition, annotators were asked to give a confidence ratings for their relation annotation and each of their two duration annotation on the same fivepoint scale -not at all confident (0), not very confident (1), somewhat confident (2), very confident (3), totally confident (4). An example of the annotation instrument is shown in Figure 2. Henceforth, we refer to the situation referred to by the predicate that comes first in linear order (feed in Figure 2) as e 1 and the situation referred to by the predicate that comes second in linear order (sick in Figure 2) as e 2 .
We concatenate two adjacent sentences to form a combined sentence which allows us to capture inter-sentential temporal relations. Considering all possible pairs of events in the combined sentence results into an exploding number of event-event comparisons. Therefore, to reduce the total number of comparisons, we find the pivot-predicate of the antecedent of the combined sentence as follows -find the root predicate of the antecedent and if it governs a CCOMP, CSUBJ, or XCOMP, follow that dependency to the next predicate until a predicate is found that doesn't govern a CCOMP, CSUBJ, or XCOMP. We then take all pairs of the antecedent predicates and pair every predicate of the consequent only with the pivot-predicate. This results into N 2 + M predicates instead of N +M 2 per sentence, where N and M are the number of predicates in the antecedent and consequent respectively. This heuristic allows us to find a predicate that loosely denotes the topic being talked about in the sentence. Figure 3 shows an example of finding the pivot predicate.
Annotators We recruited 765 annotators from Amazon Mechanical Turk to annotate predicate pairs in groups of ten. Each predicate pair contained in the UD-EWT training set was annotated by a single annotator, and each predicate in the UD-EWT development and test sets was annotated by three annotators.
Normalization We normalize the slider responses for each event pair by subtracting the minimum slider value from all values, then dividing all such shifted values by the maximum value (after shifting). This ensures that the earliest beginning point for every event pair lies at 0 and that the right-most end-point lies at 1, while preserving the ratio between the durations implied by the sliders. Figure 4 shows the distribution of duration responses in the training and development sets. There is a relatively high density of events lasting minutes, with a relatively even distribution across durations of years or less and few events lasting decades or more. The raw slider positions themselves are somewhat difficult to directly interpret, and so it is not particularly informative to show their distribution directly. To improve interpretability, we rotate the slider position space to construct four new dimensions: (i) PRIORITY, which is positive when e 1 starts and/or ends earlier than e 2 and most negative when e 2 starts and/or ends earlier than e 1 ; (ii) CONTAINMENT, which is most negative when e 2 contains more of e 1 and most positive when e 1 contains more of e 2 ; (iii) EQUALITY, which is largest when both e 1 and e 2 have the same temporal extents and smallest when they are most unequal; and (iv) SHIFT, which moves the events forward or backward in time. We construct these dimensions by solving for R in

Summary statistics
contains the slider positions for our N datapoints in the following order: beg(e 1 ), end(e 1 ), beg(e 2 ), end(e 2 ). Figure 5 shows the embedding of the event pairs on the first three of these dimensions of R. The triangular pattern near the top and bottom of the plot arises because strict priority -i.e. extreme positivity or negativity on the y-axis -precludes any temporal overlap between the two events, and as we move toward the center of the plot, different priority relations mix with different overlap relations -e.g. the upper-middle left corresponds to event pairs where most of e 1 comes toward the beginning of e 2 , while the upper middle right of the plot corresponds to event pairs where most of e 2 comes toward the end of e 1 .
We see that there is a strong bias for e 1 to start and/or end earlier than e 2 -evidenced by the higher density of points near the upper center of Figure 5 than near the lower center -and a slight bias for e 1 to contain more of e 2 -evidenced by slightly higher density of points near the right center of Figure 5 than near the left center.
Inter-annotator agreement We measure interannotator agreement for the temporal relation sliders by calculating the rank (Spearman) correlation between the normalized slider positions for each pair of annotators that annotated a particular group of ten predicate pairs in the development set. Rank correlation is a useful measure in this case because it tells us how much different annotators agree of the relative position of each slider. The average rank correlation between annotators was 0.665 (95% CI=[0.661, 0.669]).
We measure interannotator agreement for the durations by calculating the absolute difference in duration rank between the duration responses for each pair of annotators that annotated a particular group of ten predicate pairs in the development set. On average, annotators disagree by 2.24 scale points (95% CI=[2.21, 2.25]), though there is heavy positive skew (γ 1 = 1.16, 95% CI=[1.15, 1.18]) -evidenced by the fact that the modal rank difference is 1 (25.3% of the response pairs), with rank difference 0 as the next most likely (24.6%) and rank difference 2 as a distant third (15.4%).
Annotation coherence Annotators were asked to approximate the relative duration of the two events that they were annotating using the distance between the sliders. This means that an annotation is coherent insofar as the ratio of distances between the slider responses for each event matches the ratio of the categorical duration responses. We rejected annotations wherein there was gross mismatch between the categorical responses and the slider responses -i.e. one event is annotated as having a longer duration but is given a shorter slider response -but because this does not guarantee that the exact ratios are preserved, we as-sess that here using a canonical correlation analysis (CCA; Hotelling 1936) between the categorical duration responses and the slider responses.  Figure 6 shows the CCA scores. We find that the first canonical correlation, which captures the ratios between unequal events, is 0.765; and the second, which captures the ratios between roughly unequal events, is 0.427. This preservation of the ratios is quite impressive in light of the fact that our slider scales are bounded; though we hoped for at least a non-linear relationship between the categorical durations and the slider distances, we did not expect such a strong linear relationship.

Model
For a given event pair in a sentence, we aim to jointly predict each event's duration alongside the relative event timelines. We then use these relative timelines to construct timelines for entire documents with a separate model.

Relative timelines
The relative timeline model consists of three components: an event model, a duration model, and a relation model. These components use multiple layers of dot product attention (Luong et al., 2015) on top of an embedding H ∈ R N ×D for a sentence s = [w 1 , . . . , w N ] tuned on the three M -dimensional contextual embeddings produced by ELMo (Peters et al., 2018) for that sentence, concatenated together. 1 where D is the dimension for the tuned embeddings, W TUNE ∈ R 3M ×D , and b TUNE ∈ R D .
Event model We define the model's representation for the event referred to by predicate k as g pred k ∈ R D , where D is the embedding size. We build this representation using a variant of dotproduct attention, based on the predicate root.
is the hidden representation of the k th predicate's root; and H SPAN(pred k ) is obtained by stacking the hidden representations of the entire predicate.
The idea here is that the predicate root itself may be indicative of where within the predicate the relevant temporal information lies. For example, the predicate been sick for now in Figure 2 has sick as its root, and thus we would take the hidden representation for sick as h ROOT(pred k ) . Similarly, H SPAN(pred k ) would be equal to taking the hidden-state representations of been sick for now and stacking them together. Then, if the model learns that tense information is important, it may weight been using the attention mechanism.
Duration model The temporal duration representation g dur k for the event referred to by the k th predicate is defined similarly to the event representation, but instead of stacking the predicate's span, we stack the hidden representations of the entire sentence H.
where A SENT DUR ∈ R D×size(g pred k ) and b SENT DUR ∈ R D . We consider two models of the categorical durations: a softmax model and a binomial model. The main difference is that the binomial model enforces that the probabilities p dur k over the 11 duration values be concave in the duration rank, whereas the softmax model has no such constraint. We employ a cross-entropy loss for both models.
In the softmax model, we pass the duration representation g dur k for event k through a multilayer perceptron (MLP) with a single hidden layer and ReLU activations, to yield probabilities p dur k over the 11 durations.
Relation model To represent the temporal relation representation between the event referred to by the i th predicate and the event referred to by the j th predicate, we again use a similar attention mechanism.
where A SENT REL ∈ R D×2size(g pred k ) and b SENT REL ∈ R D . The main idea behind our temporal model is to map events and states directly to a timeline, which we represent via a reference interval [0, 1]. For situation k, we aim to predict the beginning point b k and end-point e k ≥ b k of k.
We predict these values by passing g rel ij through an MLP with one hidden layer of ReLU activations and four real-valued outputs [β i ,δ i ,β j ,δ j ], representing the estimated relative beginning points (β i ,β j ) and durations (δ i ,δ j ) for events i and j. We then calculate the predicted The predicted valuesŝ ij are then normalized in the same fashion as the true slider values prior to being entered into the loss. We constrain this normalized s ij using four L1 losses.
The final loss function is then L = L dur + * L rel 2 with set to a fixed value of 2 (see §5).

Duration-relation connections
We also experiment with four architectures wherein the duration and relation models are connected to each other in the Dur → Rel or Dur ← Rel directions.
In the first Dur → Rel architecture, we modify g rel ij by additionally concatenating the i th and j th predicate's duration probabilities from the binomial distribution model.
In the second Dur → Rel architecture, we do not use the relation representation model at all, just using the i th and j th predicate's duration probabilities from the binomial distribution model.
In the first Dur ← Rel architecture, we modify g dur k by concatenating theb k andê k from the relation model.
In the second Dur ← Rel architecture, we do not use the duration representation model at all, and instead use the predicted relative durationê k − b k obtained from the relation model, passing it through the binomial distribution model.
Document timelines From the timeline model, we learn the hidden document timelines for UDS-T development set using: (i) actual pairwise slider annotations; (ii) slider values predicted by the best performing model on UDS-T development set. We assume a hidden timeline T ∈ R n d ×2 + , where n d is the total number of predicates in that document, the two dimensions represent the beginning point and the duration of the predicates. We then construct predicted relative timelines with We learn T for each document under the relation loss L rel (s ij ,ŝ ij ). We further constrain T to predict the categorical durations using the binomial distribution model on the durations t k2 implied by T, assuming π k = σ(c log(t k2 )).

Experiments
We implement the neural model and attention in pytorch 1.0. We use the concatenated ELMo layers as word embeddings which are then tuned to a lower dimension of 256. For all experiments, we use stochastic gradient descent to train the ELMotuned embeddings, attention, and MLP parameters. The hyperparameter is set to be 2.0. Both the relation and duration MLP have a single hidden layer with 128 nodes. We weight both L dur , and L rel by the ridit-scored confidence ratings of event durations and event relations respectively.
To predict TimeML relations in TempEval3 (Task C -relation only) (UzZaman et al., 2013) and TimeBank-Dense , we use a transfer learning approach. We first use the best-performing model on the UDS-T development set to obtain the relation representation for each pair of annotated predicates in TempEval3 and TimeBank-Dense. We then use this vector as input features to a SVM classifier with a gaussian kernel (sklearn 0.20.0; Pedregosa et al. 2011). to predict the temporal relation on these datasets using the feature vector obtained from our model. We run a hyperparameter grid-search over 4-fold CV with C: (0.1, 1, 10), and gamma: (0.001, 0.01, 0.1, 1). The best performance on cross-validation (C=10 and gamma=0.001) is then evaluated on the test-sets of TempEval3 and TimeBank-Dense.  Table 2: Results on test data based on different model representations; ρ denotes the Spearman-correlation coefficient; rank-diff is the duration rank difference. The model highlighted in blue performs best overall on dev-data. The numbers highlighted in bold are the best-performing numbers in the respective columns.
Since we require spans of predicates for our model, we pre-process TempEval3 and TimeBank-Dense by removing all xml tags from the sentences and then we pass it through Stanford CoreNLP 3.9.2  to get the corresponding conllu format. Roots and spans of predicates are then extracted using Pred-Patt. For our purposes, the identity and simultaneous relations in TempEval-3 are equivalent when comparing event-event relations. Hence, they are collapsed into one single relation.
Following recent work using continuous labels in event factuality prediction (Lee et al., 2015;Stanovsky et al., 2017; and genericity prediction (Govindarajan et al., 2019) we report three metrics for the duration prediction: Spearman correlation (ρ), mean rank difference (rank diff ), and proportion rank difference explained (R1). We report four metrics for the relation prediction: Spearman correlation between the normalized values of actual beginning and end points and the predicted ones (absolute ρ), the Spearman correlation between the actual and predicted values in L rel (relative ρ), and the proportion of MAE explained (R1).
In both cases, the R1 metric corresponds closely to the related R 2 metric, which measures the amount of variance in the data explained by the model, but is defined in terms of mean absolute error (MAE), which assumes an L1 space.
where MAE baseline is always guessing the median. For both ρ and R1, we report the value scaled by 100 for readability. As Govindarajan et al. (2019) note, these metrics are useful, since ρ tells us how similar the predictions are to the true values, ignoring scale, and R1 tells us how close the predictions are to the true values, after accounting for variability in the data.
One difficulty that arises in computing metrics for the relation annotations on our test set is that we obtained three annotation each, and taking, e.g., the mean for each slider value in these annotations can result in a qualitatively different temporal relation, with different duration and relation characteristics, than any of the three annotations themselves. So instead of aggregating either the duration or relation annotations, we compute our metrics on all three annotations separately and then aggregate over them. Note that this will result in higher errors than we might see if we aggregate, but we believe it is the fairest way to report. Table 2 shows the results of different model architectures on the UDS-T test set, and Table 3 shows the results of our transfer-learning approach on TempEval-3 and TimeBank-Dense.  Table 3: Results of our transfer learning experiment on event-event relations in TimeBank-Dense (TD) and TempEval-3 (TE3) compared against other systems.

UDS-T results
The overarching pattern we see is that most of our models are able to predict the relative position of the beginning and ending of events very well (high relation ρ) and the relative duration of events somewhat well (relatively low duration ρ), but they have a lot more trouble pre-dicting relation exactly and relatively less trouble predicting duration exactly.
Duration model The binomial distribution model outperforms the softmax model for duration prediction by a large margin, though it has effectively no effect on the accuracy of the relation model, with the binomial and softmax models performing comparably. This suggests two things. First, the fact that the duration and relation models share the weights associated with the predicate representation does not affect the models this representation feeds into -i.e. having a bad duration representation does not entail having a bad relation representation, even if they are built upon the same foundation. Second, it seems that enforcing concavity in duration rank on the duration probabilities helps the model better predict durations. Indeed, as an elaboration on the first point, it may not be that the duration representations for the softmax model are worse than for the binomial models, it may just be that the extra constraints from the binomial model are helping.
Connections Connecting the duration and relation model doesn't improve performance in general. In fact, when the durations are directly predicted from the temporal relation model -i.e. without using the duration representation modelthe model's performance drops by a large margin, with the Spearman correlation down by roughly 15 percentage points. This indicates that constraining the relations model to predict the durations is not enough and that the duration representation is needed to predict durations well.
On the other hand, predicting temporal relations directly from the duration probability distribution -i.e. without using the relation representation model -results in a similar score as that of the top-performing model. This indicates that the duration representation is able to capture most of the relation characteristics of the sentence. Using both duration representation and relation representation separately (model highlighted in blue) results in the best performance overall on the UDS-T development set. This is interesting in light of the fact that, as noted in §3, there is a strong linear relationship between the categorical durations and the durations implied by the relation annotations.
TempEval-3 and TimeBank-Dense We report F1-micro and F1-macro scores on TempEval-3 and TimeBank-Dense in Table 3 and compare our results with some of the other systems. The TD F1-micro scores for these systems are reported by Cheng and Miyao (2017). 2 Our system beats the TD F1-micro scores of all other systems reported in Table 3. The top performing system on TE3 (Mirza and Tonelli, 2016) reports an F1 score of 0.619 over all relations. This indicates that our model is able to achieve competitive performance on other standard temporal classification problems.

Document timelines
We apply the document timeline model described in §4 to both the annotations on the development set and the bestperforming model's predictions to obtain timelines for all documents in the development set. Figure  8 shows an example, comparing the two resulting document timelines.  Figure 8: Learned Timeline for the following document based on actual (black) and predicted (red) annotations: "A+. I would rate Fran pcs an A + because the price was lower than everyone else , i got my computer back the next day , and the professionalism he showed was great . He took the time to explain things to me about my computer , i would recommend you go to him. David" For these two timelines, we compare the induced beginning points and durations, obtaining a mean Spearman correlation of 0.28 for beginning points and -0.097 for durations. This suggests that the model agrees to some extent with the annotations about the beginning points of events in most documents but is struggling to find the correct duration spans. One possible reason for poor prediction of durations could be the lack of a direct source of duration information. The model cur- 2 We do not report the temporal awareness scores (F1) of other systems on TE3 as they report their metrics on all relations, including timex-timex, and event-timex relations. Hence, it is not a fair comparison against our model. For TD, only those systems are reported which report F1-micro scores.  rently tries to identify the duration based only on the slider values, which leads to poor performance in the Dur ← Rel model.

Model Analysis and Timelines
We investigate three aspects of the bestperforming model on the development set (highlighted in blue in Table 2): what our duration and relation representations attend to, how well we reconstruct the relation space defined in §3, and how well document timelines constructed from the model's predictions match those constructed from the annotations themselves.
Attention The advantage of using an attention mechanism is that we can often interpret what linguistic information the model is using by analyzing the attention weights. We extract these attention weights for both the duration representation and the relation representation from our best model on the development set. We then compute the mean attention weight for these two attention models for each word type across the corpus. We also compute the mean rank of the attention weight for each word token within a sentence, with rank 1 assigned to the word with highest attention weight. Table 4 shows the top 15 words in the UDS-T development set according to mean attention weight, excluding words with frequency of less than 50 in EWT.
Duration Words that denote some time period -e.g. month(s), minutes, hour, years, days, week -are among the top words in the duration model, with seven of the top 15 words directly denoting one of the duration classes. This is exactly what one might expect this model to rely heavily on, since time expressions are likely highly informative for making predictions about duration. It also may suggest that we do not need to directly encode relations between event-referring and time-referring expressions in our frameworkas do annotation standards like TimeML -since our models may discover these relations.
The remainder of the top words in the duration model are plurals or mass nouns. This may suggest that the plurality of a predicate's arguments is an indicator of the likely duration of the event referred to by that predicate. To investigate this possibility, we compute a multinomial regression predicting the attention weights α s for each sentence s from the K morphological features of each word in that sentence F s ∈ {0, 1} length(s)×K , which are extracted from the UD-EWT features column and binarized. To do this, we optimize coefficients c in arg c min s D (α s softmax (F s c)), where D is the KL divergence. We find that the five most strongly weighted positive features in c are all features of nouns -NUMBER=plur, CASE=acc, PRONTYPE=prs, NUMBER=sing, GEN-DER=masc -suggesting that good portion of duration information can be gleaned from the arguments of a predicate. We believe this may be because nominal information can be useful in determining whether the clause is about particular events or generic events (Govindarajan et al., 2019). This is corroborated by the fact that the five most strongly weighted negative features in c tend to be features of function words or predicates: PRONTYPE=Rel, DEGREE=sup, NUMTYPE=mult, VOICE=pass, NUMTYPE=ord. Relation A majority of the top words in the relation model are either coordinators -such as or and and -or bearers of tense information -i.e. lexical verbs and auxiliaries. The first makes sense in light of the fact that, in context, coordinators can carry information about temporal sequencing (Bar-Lev and Palacas, 1980;Carston, 1993;Wilson and Sperber, 1998). The second makes sense in that information about the tense of predicates being compared likely helps the model determine relative ordering of the events they refer to.
To further investigate the role of morphological information, we compute multinomial regression in the same way as for the duration model, using the same morphological featurization. We find that the five most strongly weighted positive features in c are all features of verbs or auxiliaries -PERSON=1, PERSON=3, TENSE=pres, TENSE=past, MOOD=ind, -suggesting that a majority of the information relevant to relation can be gleaned from the tense-bearing units in a clause. This is corroborated by the fact that the five most strongly weighted negative features in c tend to be features of nouns or non-coordinator function words: CASE=acc, DEGREE=cmp, GENDER=neut, PRONTYPE=Rel, NUMTYPE=ord.
Relation space We rotate the predicted slider positions in the relation space defined in §3 and compare it with the rotated space of actual slider positions. We see a Spearman correlation of 0.19 for PRIORITY, 0.23 for CONTAINMENT, and 0.17 for EQUALITY. This suggests that our model is best able to capture CONTAINMENT relations and slightly less good at capturing PRIORITY and EQUALITY relations, though all the numbers are quite low compared to the absolute ρ and relative ρ metrics reported in Table 2. This may be indicative of the fact that our models do somewhat poorly on predicting more fine-grained aspects of an event relation, and in the future it may be useful to jointly train against the more interpretable PRI-ORITY, CONTAINMENT, and EQUALITY measures instead of or in conjunction with the slider values.

Conclusion
We present a new semantic framework which allows us to get annotations of fine-grained tempo-ral relations and event durations. Based on this framework, we construct the largest temporal relations dataset to date, which is built on top of Universal Dependencies English Web Treebank. Our neural model architecture learns the fine-grained relations with a spearman correlation of 0.7804 suggesting that these fine-grained relations can be learned fairly well. We also showcase how our model can be used to predict standard temporal relation classification tasks using a transfer learning approach. We present an analysis over different components of the model and show that the attention model focusses on interesting linguistic features to predict the durations and temporal relations. Finally, we present a simple model to generate document timelines from the learned finegrained relations. These timelines can be used in other tasks to keep track of events temporally.