Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource

Extracting temporal relations (before, after, overlapping, etc.) is a key aspect of understanding events described in natural language. We argue that this task would gain from the availability of a resource that provides prior knowledge in the form of the temporal order that events usually follow. This paper develops such a resource – a probabilistic knowledge base acquired in the news domain – by extracting temporal relations between events from the New York Times (NYT) articles over a 20-year span (1987–2007). We show that existing temporal extraction systems can be improved via this resource. As a byproduct, we also show that interesting statistics can be retrieved from this resource, which can potentially benefit other time-aware tasks. The proposed system and resource are both publicly available.


Introduction
Time is an important dimension of knowledge representation. In natural language, temporal information is often expressed as relations between events. Reasoning over these relations can help figuring out when things happened, estimating how long things take, and summarizing the timeline of a series of events. Several recent SemEval workshops are a good showcase of the importance of this topic (Verhagen et al., 2007(Verhagen et al., , 2010Uz-Zaman et al., 2013;Llorens et al., 2015;Minard et al., 2015;Bethard et al., 2015Bethard et al., , 2016Bethard et al., , 2017.
One of the challenges in temporal relation extraction is that it requires high-level prior knowledge of the temporal order that events usually follow. In Example 1, we have deleted events from several snippets from CNN, so that we cannot use our prior knowledge of those events. We are also told that e1 and e2 have the same tense, and e3 and e4 have the same tense, so we cannot resort to their tenses to tell which one happens earlier.
As a result, it is very difficult even for humans to figure out the temporal relations (referred to as "TempRels" hereafter) between those events. This is because rich temporal information is encoded in the events' names, and this often plays an indispensable role in making our decisions. In the first paragraph of Example 1, it is difficult to understand what really happened without the actual event verbs; let alone the TempRels between them. In the second paragraph, things are even more interesting: if we had e3:dislike and e4:stop, then we would know easily that "I dislike" occurs after "they stop the column". However, if we had e3:ask and e4:help, then the relation between e3 and e4 is now reversed and e3 is before e4. We are in need of the event names to determine the TempRels; however, we do not have them in Example 1. In Example 2, where we show the complete sentences, the task has become much easier for humans due to our prior knowledge, namely, that explosion usually leads to casualties and that people usually ask before they get help. Motivated by these examples (which are in fact very common), we believe in the importance of such a prior knowledge in determining TempRels between events.
Example 1: Difficulty in understanding TempRels when event content is missing. Note that e1 and e2 have the same tense, and e3 and e4 have the same tense. More than 10 people have (e1:died), police said. A car (e2:exploded) on Friday in the middle of a group of men playing volleyball. The first thing I (e3:ask) is that they (e4:help) writing this column.
However, most existing systems only make use of rather local features of these events, which cannot represent the prior knowledge humans have Table 1: TEMPROB is a unique source of information of the temporal order that events usually follow. The probabilities below do not add up to 100% because less frequent relations are omitted. The word sense numbers are not shown here for convenience.

Example Pairs
Before (%) After (%)  accept determine  42  26  ask  help  86  9  attend  schedule  1  82  accept  propose  10  77  die  explode  14  83 . . . about these events and their "typical" order. As a result, existing systems almost always attempt to solve the situations shown in Example 1, even when they are actually presented with input as in Example 2. The first contribution of this work is thus the construction of such a resource in the form of a probabilistic knowledge base, constructed from a large New York Times (NYT) corpus. We hereafter name our resource TEMporal relation PRObabilistic knowledge Base (TEMPROB), which can potentially benefit many time-aware tasks. A few example entries of TEMPROB are shown in Table 1. Second, we show that existing TempRel extraction systems can be improved using TEMPROB, either in a local method or in a global method (explained later), by a significant margin in performance on the benchmark TimeBank-Dense dataset .
Example 2: The original sentences in Example 1. More than 10 people have (e1:died), police said. A car (e2:exploded) on Friday in the middle of a group of men playing volleyball. The first thing I (e3:ask) is that they (e4:help) writing this column.
The rest of the paper is organized as follows. Section 2 provides a literature review of TempRels extraction in NLP. Section 3 describes in detail the construction of TEMPROB. In Sec. 4, we show that TEMPROB can be used in existing TempRels extraction systems and lead to significant improvement. Finally, we conclude in Sec. 5.

Related Work
The TempRels between events can be represented by an edge-labeled graph, where the nodes are events, and the edges are labeled with TempRels (Chambers and Jurafsky, 2008;Do et al., 2012;Ning et al., 2017). Given all the nodes, we work on the TempRel extraction task, which is to assign labels to the edges in a temporal graph (a "vague" or "none" label is often included to account for the non-existence of an edge).
Early work includes Mani et al. (2006); Chambers et al. (2007); Bethard et al. (2007); Verhagen and Pustejovsky (2008), where the problem was formulated as learning a classification model for determining the label of every edge locally without referring to other edges (i.e., local methods). The predicted temporal graphs by these methods may violate the transitive properties that a temporal graph should possess. For example, given three nodes, e1, e2, and e3, a local method can possibly classify (e1,e2)=before, (e2,e3)=before, and (e1,e3)=after, which is obviously wrong since before is a transitive relation and (e1,e2)=before and (e2,e3)=before dictate that (e1,e3)=before. Recent state-of-the-art methods, Mirza and Tonelli, 2016), circumvented this issue by growing the predicted temporal graph in a multi-step manner, where transitive graph closure is performed on the graph every time a new edge is labeled. This is conceptually solving the structured prediction problem greedily. Another family of methods resorted to Integer Linear Programming (ILP) (Roth and Yih, 2004) to get exact inference to this problem (i.e., global methods), where the entire graph is solved simultaneously and the transitive properties are enforced naturally via ILP constraints (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012). A most recent work brought this idea even further, by incorporating structural constraints into the learning phase as well (Ning et al., 2017).
The TempRel extraction task has a strong dependency on prior knowledge, as shown in our earlier examples. However, very limited attention has been paid to generating such a resource and to make use of it; to our knowledge, the TEMPROB proposed in this work is completely new. We find that the time-sensitive relations proposed in Jiang et al. (2016) is a close one in literature (although it is still very different). Jiang et al. (2016) worked on the knowledge graph completion task. Based on YAGO2 (Hoffart et al., 2013) and Freebase (Bollacker et al., 2008), it manually selects a small number of relations that are timesensitive (10 relations from YAGO2 and 87 relations from Freebase, respectively). Exemplar relations are wasBornIn→diedIn→ and graduate-From→workAt, where → means temporally before.
Our work significantly differs from the timesensitive relations in Jiang et al. (2016) in the following aspects. First, scale difference: Jiang et al. (2016) can only extract a small number of relations (<100), but we work on general semantic frames (tens of thousands) and the relations between any two of them, which we think has broader applications. Second, granularity difference: the smallest granularity in Jiang et al. (2016) is one year 2 , i.e., only when two events happened in different years can they know the temporal order of them, but we can handle implicit temporal orders without having to refer to the physical time points of events (i.e., the granularity can be arbitrarily small). Third, domain difference: while Jiang et al. (2016) extracts time-sensitive relations from structured knowledge bases (where events are explicitly anchored to a time point), we extract relations from unstructured natural language text (where the physical time points may not even exist in text). Our task is more general and it allows us to extract much more relations, as reflected by the 1st difference above.
Another related work is the VerbOcean (Chklovski and Pantel, 2004), which extracts temporal relations between pairs of verbs using manually designed lexico-syntactic patterns (there are in total 12 such patterns), in contrast to the automatic extraction method proposed in this work. In addition, the only termporal relation considered in VerbOceans is before, while we also consider relations such as after, includes, included, equal, and vague. As expected, the total numbers of verbs and before relations in VerbOcean is about 3K and 4K, respectively, both of which are much smaller than TEMPROB, which contains 51K verb frames (i.e., disambiguated verbs), 9.2M (verb1, verb2, relation) entries, and up to 80M temporal relations altogether.
All these differences necessitate the construction of a new resource for TempRel extraction, which we explain below.

TEMPROB: A Probabilistic Resource for TempRels
In the TempRel extraction task, people have usually assumed that events are already given. However, to construct the desired resource, we need to extract events (Sec. 3.1) and extract TempRels (Sec. 3.2), from a large, unannotated 3 corpus (Sec. 3.3). We also show some interesting statistics discovered in TEMPROB that may benefit other tasks (Sec. 3.4). In the next, we describe each of these elements.

Event Extraction
Extracting events and the relations between them (e.g., coreference, causality, entailment, and temporal) have long been an active area in the NLP community. Generally speaking, an event is considered to be an action associated with corresponding participants involved in this action. In this work, following Spiliopoulou et al., 2017) we consider semantic-frame based events, which can be directly detected via off-the-shelf semantic role labeling (SRL) tools. This aligns well with previous works on event detection (Hovy et al., 2013;. Depending on the events of interest, the SRL results are often a superset of events and need to be filtered afterwards (Spiliopoulou et al., 2017). For example, in ERE  and Event Nugget Detection (Mitamura et al., 2015), events are limited to a set of predefined types (such as "Business", "Conflict", and "Justice"); in the context of TempRels, existing datasets have focused more on predicate verbs rather than nominals 4 (Pustejovsky et al., 2003;Graff, 2002;UzZaman et al., 2013). Therefore, we only look at verb semantic frames in this work due to the difficulty of getting TempRel annotation for nominal events, and we will use "verb (semantic frames)" interchangeably with "events" hereafter in this paper.

TempRel Extraction
Given the events extracted in a given article (i.e., given the nodes in a graph), we next explain how the TempRels are extracted (that is, the edge labels in the graph).

Features
We adopt the commonly used feature set in TempRel extraction (Do et al., 2012;Ning et al., 2017) and here we simply list them for reproducibility. For each pair of nodes, the following features are extracted. (i) The part-of-speech (POS) tags from each individual verb and from its neighboring three words. (ii) The distance between them in terms of the number of tokens. (iii) The modal verbs between the event mention (i.e., will, would, can, could, may and might). (iv) The temporal connectives between the event mentions (e.g., before, after and since). (v) Whether the two verbs have a common synonym from their synsets in WordNet (Fellbaum, 1998). (vi) Whether the input event mentions have a common derivational form derived from WordNet. (vii) The head word of the preposition phrase that covers each verb, respectively.

Learning
With the features defined above, we need to train a system that can annotate the TempRels in each document. The TimeBank-Dense dataset (TB-Dense)  is known to have the best quality in terms of its high density of TempRels and is a benchmark dataset for the TempRel extraction task. It contains 36 documents from TimeBank (Pustejovsky et al., 2003) which were re-annotated using the dense event ordering framework proposed in . We follow its label set (denoted by R) of before, after, includes, included, equal, and vague in this study.
Due to the slight event annotation difference in TBDense, we collect our training data as follows. We first extract all the verb semantic frames from the raw text of TBDense. Then we only keep those semantic frames that are matched to an event in TBDense (about 85% semantic frames are kept in this stage). By doing so, we can simply use the TempRel annotations provided in TBDense. Hereafter the TBDense dataset used in this paper refers to this version unless otherwise specified.
We group the TempRels by the sentence distance of the two events of each relation 5 . Then we use the averaged perceptron algorithm (Freund and Schapire, 1998) implemented in the Illinois LBJava package (Rizzolo and Roth, 2010) to learn from the training data described above. Since only relations that have sentence distance 0 or 1 are annotated in TBDense, we will have two classifiers, one for same sentence relations, and one for neighboring sentence relations, respectively.
In all subsequent analysis, we combined Train and Dev and we performed 3-fold cross validation on the 27 documents (in total about 10K relations) to tune the parameters in any classifier.

Inference
When generating TEMPROB, we need to process a large number of articles, so we adopt the greedy inference strategy described earlier due to its computational efficiency Mirza and Tonelli, 2016). Specifically, we apply the same-sentence relation classifier before the neighboring-sentence relation classifier; whenever a new relation is added in this article, a transitive graph closure is performed immediately. By doing this, if an edge is already labeled during the closure phase, it will not be labeled again, so conflicts are avoided.

Corpus
As mentioned earlier, the source corpus on which we are going to construct TEMPROB is comprised of NYT articles from 20 years   6 . It contains more than 1 million documents and we extract events and corresponding features from each document using the Illinois Curator package (Clarke et al., 2012) on Amazon Web Services (AWS) Cloud. In total, we discovered 51K unique verb semantic frames and 80M relations among them in the NYT corpus (15K of the verb frames had more than 20 relations extracted and 9K had more than 100 relations).

Interesting Statistics
We first describe the notations that we are going to use. We denote the set of all verb semantic frames by V . Let D i , i = 1, . . . , N be the i-th document in our corpus, where N is the total number of documents. Let G i = (V i , E i ) be the temporal graph inferred from D i using the approach described above, where V i ⊆ V is the set of verbs/events extracted in D i and which is composed of TempRel triplets; specifically, a TempRel triplet (v m , v n , r mn ) ∈ E i represents that in document D i , the TempRel between v m and v n is r mn . Due to the symmetry in TempRels, we only keep the triplets with m < n in E i . Assuming that the verbs in V i are ordered by their appearance order in text, then m < n means that in the i-th document, v m appears earlier in text than v n does.
Given the usual confusion between that one event is temporally before another and that one event is physically appearing before another in text, we will refer to temporally before as T-Before and physically before as P-Before. Using this language, for example, E i only keeps the triplets that v m is P-Before v n in D i .

Extreme cases
We first show extreme cases that some events are almost always labeled as T-Before or T-After in the corpus. Specifically, for each pair of verbs v i , v j ∈ V , we define the following ratios: where where I {·} is the indicator function. Add-one smoothing technique from language modeling is used to avoid divided-by-zero errors. In Table 2, we show some event pairs with either η b > 0.9 (upper part) or η a > 0.9 (lower part). We think the examples from Table 2 are intuitively appealing: chop happens before taste, clean happens after contaminate, etc. More interestingly, in the lower part of the table, we show pairs in which the physical order is different from the temporal order: for example, when achieve is P-Before desire, it is still labeled as T-After in most cases (104 out of 111 times), which is correct intuitively. In practice, e.g., in the TBDense dataset , roughly 30%-40% of the P-Before pairs are T-After. Therefore, it is important to be able to capture their temporal order rather than simply taking their physical order if one wants to understand the temporal implication of verbs.

Distribution of Following Events
For each verb v, we define the marginal count of v being P-Before to arbitrary verbs with TempRel r ∈ R as C(v, r) = v i ∈V C(v, v i , r). Then for every other verb v , we define which is the probability of v T-Before v , conditioned on v T-Before anything. Similarly, we de- For a specific verb, e.g., v=investigate, each verb v ∈ V is sorted by the two conditional probabilities above. Then the most probable verbs that temporally precede or follow v are shown in Fig. 1, where the y-axes are the corresponding conditional probabilities. We can see reasonable event sequences like {involve, kill, suspect, steal}→investigate→{report, prosecute, pay, punish}, which indicates the possibility of using TEMPROB for event sequence predictions or story cloze tasks. There are also suspicious pairs like know in the T-Before list of investigate (Fig. 1a), report in the T-Before list of bomb (Fig. 1b), and play in the T-After list of mourn (Fig. 1c). Since the arguments of these verb frames are not considered here, whether these few seemingly counter-intuitive pairs come from system error or from a special context needs further investigation.

Experiments
In the above, we have explained the construction of TEMPROB and shown some interesting examples from it, which were meant to visualize its correctness. In this section, we first quantify the correctness of the prior obtained in TEMPROB, and then show TEMPROB can be used to improve existing TempRel extraction systems. : Top events that most frequently precede or follow "investigate", "bomb", "mourn", or "sentence" in time, sorted by their conditional probabilities in . Word senses have been disambiguated and the "bomb" and "sentence" here are their verb meanings. There are some possible errors (e.g., report is T-Before bomb) and some unclear pairs (e.g., know is T-Before investigate and play is T-After mourn), but overall the event sequences discovered here are reasonable. More examples can be found in the appendix.

Quality Analysis of TEMPROB
In Table 2, we showed examples with either η b or η a > 0.9. We argued that they seem correct. Here we quantify the "correctness" of η b and η a based on TBDense. Specifically, we collected all the gold T-Before and T-After pairs. Let τ ∈ [0.5, 1) be a constant threshold. Imagine a naive predictor such that for each pair of events v i and v j , if We expect that a higher η b (or η a ) represents a higher confidence for an instance to be labeled T-Before (or T-After).  Table 3 shows the performance of this predictor, which meets our expectation and thus justifies the validity of TEMPROB. As we gradually increase the value of τ in Table 3, the precision increases in roughly the same pace with τ , which indicates that the values of η b and η a 7 from TEMPROB indeed represent the confidence level. The decrease in recall is also expected because more examples are labeled as T-Vague when τ is larger.
To further justify the quality, we also used another dataset that is not in the TempRel domain. Instead, we downloaded the EventCausality dataset 8 (Do et al., 2011). For each causally related pair e1 and e2, if EventCausality annotates that e1 causes e2, we changed it to be T-Before; if EventCausality annotates that e1 is caused by e2, we changed it to be T-after. Therefore, based on the assumption that the cause event is T-Before the result event, we converted the EventCausality dataset to be a TempRel dataset and it thus could also be used to evaluate the quality of TEMPROB. We adopted the same predictor used in Table 3 with τ = 0.5 and in Table 4, we compared it with two baselines: (i) always predicting T-Before and (ii) always predicting T-After. First, the accuracy (66.2%) in Table 4 is rather consistent with its counterpart in Table 3, confirming the stability of statistics from TEMPROB. Second, by directly using the prior statistics η b and η a from TEMPROB, we can improve the precision of both labels with a significant margin relative to the two baselines (17.0% for "T-Before" and 15.9% for "T-After"). Overall, the accuracy was improved by 11.5%. Table 4: Further justification of η b and ηa from TEMPROB on the EventCausality dataset. The thresholding predictor from Table 3 with τ = 0.5 is used here. Compared to always predicting the majority label (i.e., T-Before in this case), τ = 0.5 significantly improved the performance for both labels, with the overall accuracy improved by 11.5%.

System
T-Before T-After Acc.

Improving TempRel Extraction
The original purpose of TEMPROB was to improve TempRel extraction. We show it from two perspectives: How effective the prior distributions obtained from TEMPROB are (i) as features in local methods and (ii) as regularization terms in global methods. The results below were evaluated on the test split of TB-Dense .

Improving Local Methods
We first test how well the prior distributions from TEMPROB can be used as features in improving local methods for TempRel extraction. In Table 5, we used the original feature set proposed in Sec. 3.2.1 as the baseline, and added the prior distribution obtained from TEMPROB on top of it. Specifically, we added η b (see Eq. (1)) and {f r } r∈R , respectively, where {f r } r∈R is the prior distributions of all labels, i.e., Recall function C is defined in Eq.
(2). All comparisons were decomposed to same sentence relations (Dist=0) and neighboring sentence relations (Dist=1) for a better understanding of the behavior. All classifiers were trained using the averaged perceptron algorithm (Freund and Schapire, 1998) and tuned by 3-fold cross validation. From Table 5, we can see that simply adding η b into the feature set could improve the original system F 1 by 1.8% (Dist=0) and 3.0% (Dist=1). If we further add as features the full set of prior distributions {f r } r∈R , the improvement comes to 2.7% and 6.5%, respectively. Noticing that the feature is more helpful for Dist=1, we think that it is because distant pairs usually have less lexical dependency and thus need more prior information provided by our new feature. With Dist=0 and Dist=1 combined (numbers not shown in the Table), the 3rd line improved the "original" by 4.7% in F 1 and by 5.1% in the temporal awareness F-score (another metric used in the TempEval3 workshop).  Table 3 because in Table 3, only T-Before and T-After examples are considered, but here all labels are taken into account and the problem is more practical and harder.

Improving Global Methods
As mentioned earlier in Sec. 2, many systems adopt a global inference method via integer linear programming (ILP) (Roth and Yih, 2004) to enforce transitivity constraints over an entire temporal graph (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012;Ning et al., 2017). In addition to the usage shown in Sec. 4.2.1, the prior distributions from TEMPROB can also be used to regularize the conventional ILP formulation. Specifically, in each document, let I r (ij) ∈ {0, 1} be the indicator function of relation r for event i and event j; let x r (ij) ∈ [0, 1] be the corresponding softmax score obtained from the local classifiers (depending on the sentence distance between i and j). Then the ILP objective for global inference is formulated as follows.
for all distinct events i, j, and k, where E = {ij | sentence dist(i, j)≤ 1}, λ adjusts the regularization term and was heuristically set to 0.5 in this work,r is the reverse relation of r, and M is the number of possible relations for r 3 when r 1 and r 2 are true. Note our difference from the ILP in (Ning et al., 2017) is the underlined regularization term f r (ij) (which itself is defined in Eq. (5)) obtained from TEMPROB. We present our results on the test split of TB-Dense in Table 6, which is an ablation study showing step-by-step improvements in two metrics. In addition to the straightforward precision, recall, and F 1 metric, we also compared the F 1 of the temporal awareness metric used in TempEval3 (UzZaman et al., 2013). The awareness metric performs graph reduction and closure before evaluation so as to better capture how useful a temporal graph is. Details of this metric can be found in UzZaman and Allen (2011) Table 6. We can see that incorporating TEMPROB improves the recall of before and after, and improves the precision of all labels, with a slight drop in the recall of vague.
In Table 6, the baseline used the original feature set proposed in Sec. 3.2.1 and applied global ILP inference with transitivity constraints. Technically, it is to solve Eq. (6) with λ = 0 (i.e., unregularized) on top of the original system in Table 5. Apart from some implementation details, this baseline is also the same as many existing global methods as Chambers and Jurafsky (2008); Do et al. (2012). System 2, "+Feature: {f r } r∈R ", is to add prior distributions as features when training the local classifiers. Technically, the scores x r (ij)'s in Eq. (6) used by baseline were changed. We know from Table 5 that adding {f r } r∈R made the local decisions better. Here the performance of System 2 shows that this was also the case for the global decisions made via ILP: both precision and recall got improved, and F 1 and awareness were both improved by a large margin, with 5.1% in F 1 and 6.6% in awareness F 1 . On top of this, System 3 sets λ = 0.5 in Eq. (6) to add regularizations to the conventional ILP formulation. The sum of these regularization terms represents a confidence score of how coherent the predicted temporal graph is to our TEMPROB, which we also want to maximize. Even though a considerable amount of information from TEMPROB had already been encoded as features (as shown by the large improvements by System 2), these regularizations were still able to further improve the precision, recall and awareness scores. To sum up, the total improvement over the baseline system brought by TEMPROB is 5.9% in F 1 and 7.1% in awareness F 1 , both with a notable margin. Table 7 furthermore decomposes this improvement into each TempRel label. To compare with state-of-the-art systems, which all used gold event properties (i.e., Tense, Aspect, Modality, and Polarity), we retrained System 3 in Table 6 with these gold properties and show the results in Table 8. We reproduced the results of CAEVO 9  and Ning et al. (2017) 10 and evaluated them on the partial TBDense test split 11 . Under both metrics, the proposed system achieved the best performance. An interesting fact is that even without these gold properties, our System 3 in Table 6 was already better than CAEVO (on Line 1) and Ning et al. (2017) (on Line 2) in both metrics. This is appealing because in practice, those gold properties may not exist, but our proposed system can still generate state-of-the-art performance without them.
For readers who are interested in the complete TBDense dataset, we also performed a naive augmentation as follows. Recall that System 3 only makes predictions to a subset of the complete TB-Dense dataset. We kept this subset of predictions, and filled the missing predictions by Ning et al. (2017). Performances of this naively augmented proposed system is compared with CAEVO and Ning et al. (2017) on the complete TBDense dataset. We can see that by replacing with predictions from our proposed system, Ning et al. (2017) got a better precision, recall, F 1 , and awareness F 1 , which is the new state-of-the-art on all reported performances on this dataset. Note that the awareness F 1 scores on Lines 4-5 are consistent with reported values in Ning et al. (2017). To our knowledge, the results in Table 8 is the first in literature that reports performances in both metrics, and it is promising to see that the proposed method outperformed state-of-the-art methods in both metrics.

Conclusion
Temporal relation (TempRel) extraction is an important and challenging task in NLP, partly due to its strong dependence on prior knowledge. Motivated by practical examples, this paper argues that a resource of the temporal order that events usually follow is helpful. To construct such a resource, we automatically processed a large corpus from NYT with more than 1 million documents using an existing TempRel extraction system and obtained the TEMporal relation PRObabilistic knowledge Base (TEMPROB). The TEMPROB is a good showcase of the capability of such prior knowledge, and it has shown its power in improving existing TempRel extraction systems on a benchmark dataset, TBDense. The resource and the system 10 http://cogcomp.org/page/publication_ view/822 11 There are 731 relations in the partial TBDense test split (201 before, 138 after, 39 includes, 31 included, 14 equal, and 308 vague).
reported in this paper are both publicly available 12 and we hope that it can foster more investigations into time-related tasks.