Joint Reasoning for Temporal and Causal Relations

Understanding temporal and causal relations between events is a fundamental natural language understanding task. Because a cause must occur earlier than its effect, temporal and causal relations are closely related and one relation often dictates the value of the other. However, limited attention has been paid to studying these two relations jointly. This paper presents a joint inference framework for them using constrained conditional models (CCMs). Specifically, we formulate the joint problem as an integer linear programming (ILP) problem, enforcing constraints that are inherent in the nature of time and causality. We show that the joint inference framework results in statistically significant improvement in the extraction of both temporal and causal relations from text.


Introduction
Understanding events is an important component of natural language understanding. An essential step in this process is identifying relations between events, which are needed in order to support applications such as story completion, summarization, and timeline construction.
Among the many relation types that could exist between events, this paper focuses on the joint extraction of temporal and causal relations. It is well known that temporal and causal relations interact with each other and in many cases, the decision of one relation is made primarily based on evidence from the other. In Example 1, identifying the temporal relation between e1:died and e2:exploded is 1 The dataset and code used in this paper are available at http://cogcomp.org/page/publication_ view/835 in fact a very hard case: There are no explicit temporal markers (e.g., "before", "after", or "since"); the events are in separate sentences so their syntactic connection is weak; although the occurrence time of e2:exploded is given (i.e., Friday) in text, it is not given for e1:died. However, given the causal relation, e2:exploded caused e1:died,it is clear that e2:exploded happened before e1:died. The temporal relation is dictated by the causal relation.
On the other hand, causal relation extraction can also benefit from knowing temporal relations. In Example 2, it is unclear whether the government stifled people because people raged, or people raged because the government stifled people: both situations are logically reasonable. However, if we account for the temporal relation (that is, e4:stifle happened before e3:raged), it is clear that e4:stifle is the cause and e3:raged is the effect. In this case, the causal relation is dictated by the temporal relation.
The first contribution of this work is proposing a joint framework for Temporal and Causal Reasoning (TCR), inspired by these examples. Assuming the availability of a temporal extraction system and a causal extraction system, the proposed joint framework combines these two using a constrained conditional model (CCM) (Chang et al., 2012) framework, with an integer linear pro-gramming (ILP) objective (Roth and Yih, 2004) that enforces declarative constraints during the inference phase. Specifically, these constraints include: (1) A cause must temporally precede its effect.
(2) Symmetry constraints, i.e., when a pair of events, (A, B), has a temporal relation r (e.g., before), then (B, A) must have the reverse relation of r (e.g., after). (3) Transitivity constraints, i.e., the relation between (A, C) must be temporally consistent with the relation derived from (A, B) and (B, C). These constraints originate from the one-dimensional nature of time and the physical nature of causality and build connections between temporal and causal relations, making CCM a natural choice for this problem. As far as we know, very limited work has been done in joint extraction of both relations. Formulating the joint problem in the CCM framework is novel and thus the first contribution of this work.
A key obstacle in jointly studying temporal and causal relations lies in the absence of jointly annotated data. The second contribution of this work is the development of such a jointly annotated dataset which we did by augmenting the Event-Causality dataset (Do et al., 2011) with dense temporal annotations. This dataset allows us to show statistically significant improvements on both relations via the proposed joint framework. This paper also presents an empirical result of improving the temporal extraction component. Specifically, we incorporate explicit time expressions present in the text and high-precision knowledge-based rules into the ILP objective. These sources of information have been successfully adopted by existing methods Mirza and Tonelli, 2016), but were never used within a global ILP-based inference method. Results on TimeBank-Dense , a benchmark dataset with temporal relations only, show that these modifications can also be helpful within ILP-based methods.

Related Work
Temporal and causal relations can both be represented by directed acyclic graphs, where the nodes are events and the edges are labeled with either before, after, etc. (in temporal graphs), or causes and caused by (in causal graphs). Existing work on temporal relation extraction was initiated by (Mani et al., 2006;Chambers et al., 2007;Bethard et al., 2007;Verhagen and Pustejovsky, 2008), Ex 3: Global considerations are needed when making local decisions. The FAA on Friday (e5:announced) it will close 149 regional airport control towers because of forced spending cuts. Before Friday's (e6:announcement), it (e7:said) it would consider keeping a tower open if the airport convinces the agency it is in the "national interest" to do so.
which formulated the problem as that of learning a classification model for determining the label of each edge locally (i.e., local methods). A disadvantage of these early methods is that the resulting graph may break the symmetric and transitive constraints. There are conceptually two ways to enforce such graph constraints (i.e., global reasoning). CAEVO  grows the temporal graph in a multi-sieve manner, where predictions are added sieve-by-sieve. A graph closure operation had to be performed after each sieve to enforce constraints. This is solving the global inference problem greedily. A second way is to perform exact inference via ILP and the symmetry and transitivity requirements can be enforced as ILP constraints (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012;Ning et al., 2017).
We adopt the ILP approach in the temporal component of this work for two reasons. First, as we show later, it is straightforward to build a joint framework with both temporal and causal relations as an extension of it. Second, the relation between a pair of events is often determined by the relations among other events. In Ex 3, if a system is unaware of (e5, e6)=simultaneously when trying to make a decision for (e5, e7), it is likely to predict that e5 is before e7 2 ; but, in fact, (e5, e7)=after given the existence of e6. Using global considerations is thus beneficial in this context not only for generating globally consistent temporal graphs, but also for making more reliable pairwise decisions.
Prior work on causal relations in natural language text was relatively sparse. Many causal extraction work in other domains assumes the existence of ground truth timestamps (e.g., (Sun et al., 2007;Güler et al., 2016)), but gold timestamps rarely exist in natural language text. In NLP, people have focused on causal relation identification using lexical features or discourse relations. For example, based on a set of explicit causal discourse markers (e.g., "because", "due to", and "as a result"), Hidey and McKeown (2016) built parallel Wikipedia articles and constructed an open set of implicit markers called AltLex. A classifier was then applied to identify causality. Dunietz et al. (2017) used the concept of construction grammar to tag causally related clauses or phrases. Do et al. (2011) considered global statistics over a large corpora, the cause-effect association (CEA) scores, and combined it with discourse relations using ILP to identify causal relations. These work only focused on the causality task and did not address the temporal aspect.
However, as illustrated by Examples 1-2, temporal and causal relations are closely related, as assumed by many existing works (Bethard and Martin, 2008;Rink et al., 2010). Here we argue that being able to capture both aspects in a joint framework provides a more complete understanding of events in natural language documents. Researchers have started paying attention to this direction recently. For example, Mostafazadeh et al. (2016b) proposed an annotation framework, CaTeRs, which captured both temporal and causal aspects of event relations in common sense stories. CATENA (Mirza and Tonelli, 2016) extended the multi-sieve framework of CAEVO to extracting both temporal and causal relations and exploited their interaction through post-editing temporal relations based on causal predictions. In this paper, we push this idea forward and tackle the problem in a joint and more principled way, as shown next.

Temporal and Causal Reasoning
In this section, we explain the proposed joint inference framework, Temporal and Causal Reasoning (TCR). To start with, we focus on introducing the temporal component, and clarify how to design the transitivity constraints and how to enforce other readily available prior knowledge to improve its performance. With this temporal component already explained, we further incorporate causal relations and complete the TCR joint inference framework. Finally, we transform the joint problem into an ILP so that it can be solved using offthe-shelf packages.

Temporal Component
Let R T be the label set of temporal relations and E and T be the set of all events and the set of all time expressions (a.k.a. timex) in a document. For notation convenience, we use EE to represent the set of all event-event pairs; then ET and T T have obvious definitions. Given a pair in EE or ET , assume for now that we have corresponding classifiers producing confidence scores for every temporal relation in R T . Let them be s ee (·) and s et (·), respectively. Then the inference formulation for all the temporal relations within this document is: We do not include the scores for T T because the temporal relationship between timexes can be trivially determined using the normalized dates of these timexes, as was done in (Do et al., 2012;Mirza and Tonelli, 2016). We impose these relations via equality constraints denoted as Y 0 . In addition, we add symmetry and transitivity constraints dictated by the nature of time (denoted by Y 1 ), and other prior knowledge derived from linguistic rules (denoted by Y 2 ), which will be explained subsequently. Finally, we (1). Transitivity Constraints. Let the dimension of Y be n. Then a standard way to construct the symmetry and transitivity constraints is shown in (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012;Ning et al., 2017) where the bar sign is used to represent the reverse relation hereafter, and Trans(r 1 , r 2 ) is a set comprised of all the temporal relations from R T that do not conflict with r 1 and r 2 . The construction of Trans(r 1 , r 2 ) necessitates a clearer definition of R T , the importance of which is often overlooked by existing methods. Existing approaches all followed the interval representation of events (Allen, 1984), which yields 13 temporal relations (denoted byR T here) as shown in . "x" means that the label is ignored. Brackets represent time intervals along the time axis. Scheme 2 is adopted consistently in this work.
ample, {before, after, includes, is included, simultaneously, vague}. For notation convenience, we denote them R T = {b, a, i, ii, s, v}. Using a reduced set is more convenient in data annotation and leads to better performance in practice.
However, there has been limited discussion in the literature on how to interpret the reduced relation types. For example, is the "before" in R T exactly the same as the "before" in the original set (R T ) (as shown on the left-hand-side of Fig. 1), or is it a combination of multiple relations inR T (the right-hand-side of Fig. 1)? We compare two reduction schemes in Fig. 1, where scheme 1 ignores low frequency labels directly and scheme 2 absorbs low frequency ones into their temporally closest labels. The two schemes barely have differences when a system only looks at a single pair of mentions at a time (this might explain the lack of discussion over this issue in the literature), but they lead to different Trans(r 1 , r 2 ) sets and this difference can be magnified when the problem is solved jointly and when the label distribution changes across domains. To completely cover the 13 relations, we adopt scheme 2 in this work.
The resulting transitivity relations are shown in Table 1. The top part of Table 1 is a compact representation of three generic rules; for instance, Line 1 means that the labels themselves are transitive. Note that during human annotation, if an annotator looks at a pair of events and decides that multiple well-defined relations can exist, he/she labels it vague; also, when aggregating the labels from multiple annotators, a label will be  changed to vague if the annotators disagree with each other. In either case, vague is chosen to be the label when a single well-defined relation cannot be uniquely determined by the contextual information. This explains why a vague relation (v) is always added in Table 1 if more than one label in Trans(r 1 , r 2 ) is possible. As for Lines 6, 9-11 in Table 1 (where vague appears in Column r 2 ), Column Trans(r 1 ,r 2 ) was designed in such a way that r 2 cannot be uniquely determined through r 1 and Trans(r 1 ,r 2 ). For instance, r 1 is after on Line 9, if we further put before into Trans(r 1 ,r 2 ), then r 2 would be uniquely determined to be before, conflicting with r 2 being vague, so before should not be in Trans(r 1 ,r 2 ). Enforcing Linguistic Rules. Besides the transitivity constraints represented by Y 1 above, we also propose to enforce prior knowledge to further constrain the search space for Y . Specifically, linguistic insight has resulted in rules for predicting the temporal relations with special syntactic or semantic patterns, as was done in CAEVO (a state-of-the-art method). Since these rule predictions often have high-precision, it is worthwhile incorporating them in global reasoning methods as well.
In the CCM framework, these rules can be represented as hard constraints in the search space for Y . Specifically, where J (rule) ⊆ MM is the set of pairs that can be determined by linguistic rules, and rule(j) ∈ R T is the corresponding decision for pair j according to these rules. In this work, we used the same set of rules designed by CAEVO for fair comparison.

Full Model with Causal Relations
Now we have presented the joint inference framework for temporal relations in Eq. (1). It is easier to explain our complete TCR framework on top of it. Let W be the vectorization of all causal relations and add the scores from the scoring function for causality s c (·) to Eq. (1). Specifically, the full inference formulation is now: where m is the dimension of W (i.e., the total number of causal pairs), R C = {c,c, null} is the label set for causal relations (i.e., "causes", "caused by", and "no relation"), and W (i,j) is the causal label for pair (i, j). The constraint represented by W Y means that if a pair of events i and j are labeled to be "causes", then the causal relation between j and i must be "caused by", and the temporal relation between i and j must be "before".

Scoring Functions
In the above, we have built the joint framework on top of scoring functions s ee (·), s et (·) and s c (·).
To get s ee (·) and s et (·), we trained classifiers using the averaged perceptron algorithm (Freund and Schapire, 1998) and the same set of features used in (Do et al., 2012;Ning et al., 2017), and then used the soft-max scores in those scoring functions. For example, that means where {w r } is the learned weight vector for relation r ∈ R T and ϕ(i) is the feature vector for pair i ∈ EE. Given a pair of ordered events, we need s c (·) to estimate the scores of them being "causes" or "caused by". Since this scoring function has the same nature as s ee (·), we can reuse the features from s ee (·) and learn an averaged perceptron for s c (·). In addition to these existing features, we also use prior statistics retrieved using our temporal system from a large corpus 3 , so as to know probabilistically which event happens before another event. For example, in Example 1, we have a pair of events, e1:died and e2:exploded. The prior knowledge we retrieved from that large corpus is that die happens before explode with probability 15% and happens after explode with probability 85%. We think this prior distribution is correlated with causal directionality, so it was also added as features when training s c (·).
Note that the scoring functions here are implementation choice. The TCR joint framework is fully extensible to other scoring functions.

Convert the Joint Inference into an ILP
Conveniently, the joint inference formulation in Eq. (3) can be rewritten into an ILP and solved using off-the-shelf optimization packages, e.g., (Gurobi Optimization, Inc., 2012). First, we define indicator variables y r similarly, let w r j = I{W i = r} be the indicator variables for W j and q r j be the score for W j = r ∈ R C . Therefore, without constraints Y and W Y for now, Eq. (3) can be written as: The prior knowledge represented as Y and W Y can be conveniently converted into constraints for this optimization problem. Specifically, Y 1 has two components, symmetry and transitivity: wherer is the reverse relation of r (i.e.,b = a, i = ii,s = s, andv = v), and Trans(r 1 , r 2 ) is defined in Table 1. As for the transitivity constraints, if both y r 1 i,j and y r 2 j,k are 1, then the constraint requires at least one of y r 3 i,k , r 3 ∈ Trans(r 1 , r 2 ) to be 1, which means the relation between i and k has to be chosen from Trans(r 1 , r 2 ), which is exactly what Y 1 is intended to do.
The rules in Y 2 is written as where rule(j) and J (rule) have been defined in Eq.
(2). Converting the T T constraints, i.e., Y 0 , into constraints is as straightforward as Y 2 , so we omit it due to limited space. Last, converting the constraints W Y defined in Eq. (4) can be done as following: The equality part, w c i,j = wc j,i represents the symmetry constraint of causal relations; the inequality part, w c i,j ≤ y b i,j represents that if event i causes event j, then i must be before j.

Experiments
In this section, we first show on TimeBank-Dense (TB-Dense) , that the proposed framework improves temporal relation identification. We then explain how our new dataset with both temporal and causal relations was collected, based on which the proposed method improves for both relations.

Temporal Performance on TB-Dense
Multiple datasets with temporal annotations are available thanks to the TempEval (TE) workshops (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013). The dataset we used here to demonstrate our improved temporal component was TB-Dense, which was annotated on top of 36 documents out of the classic TimeBank dataset (Pustejovsky et al., 2003). The main purpose of TB-Dense was to alleviate the known issue of sparse annotations in the evaluation dataset provided with TE3 (Uz-Zaman et al., 2013), as pointed out in many previous work (Chambers, 2013;Ning et al., 2017). Annotators of TB-Dense were forced to look at each pair of events or timexes within the same sentence or contiguous sentences, so that much fewer links were missed. Since causal link annotation is not available on TB-Dense, we only show our improvement in terms of temporal performance on  Table 2: Ablation study of the proposed system in terms of the standard temporal awareness metric.
The baseline system is to make inference locally for each event pair without looking at the decisions from others. The "+" signs on lines 2-5 refer to adding a new source of information on top of its preceding system, with which the inference has to be global and done via ILP. All systems are significantly different to its preceding one with p<0.05 (McNemar's test).
TB-Dense. The standard train/dev/test split of TB-Dense was used and parameters were tuned to optimize the F 1 performance on dev. Gold events and time expressions were also used as in existing systems.
The contributions of each proposed information sources are analyzed in the ablation study shown in Table 2, where we can see the F 1 score was improved step-by-step as new sources of information were added. Recall that Y 1 represents transitivity constraints, ET represents taking eventtimex pairs into consideration, and Y 2 represents rules from CAEVO . System 1 is the baseline we are comparing to, which is a local method predicting temporal relations one at a time. System 2 only applied Y 1 via ILP on top of all EE pairs by removing the 2nd term in Eq. (1); for fair comparison with System 1, we added the same ET predictions from System 1. System 3 incorporated ET into the ILP and mainly contributed to an increase in precision (from 42.9 to 44.3); we think that there could be more gain if more time expressions existed in the testset. With the help of additional high-precision rules (Y 2 ), the temporal performance can further be improved, as shown by System 4. Finally, using the causal extraction obtained via (Do et al., 2011) in the joint framework, the proposed method achieved the best precision, recall, and F 1 scores in our ablation study (Systems 1-5). According to the McNemar's test (Everitt, 1992;Dietterich, 1998), all Systems 2-5 were significantly different to its preceding system with p<0.05.
The second part of Table 2 compares several state-of-the-art systems on the same test set. ClearTK (Bethard, 2013) was the top performing system in TE3 in temporal relation extraction. Since it was designed for TE3 (not TB-Dense), it expectedly achieved a moderate recall on the test set of TB-Dense. CAEVO  and Ning et al. (2017) were more recent methods and achieved better scores on TB-Dense. Compared with these state-of-the-art methods, the proposed joint system (System 5) achieved the best F 1 score with a major improvement in recall. We think the low precision compared to System 8 is due to the lack of structured learning, and the low precision compared to System 7 is propagated from the baseline (System 1), which was tuned to maximize its F 1 score. However, the effectiveness of the proposed information sources is already justified in Systems 1-5.

Data Preparation
TB-Dense only has temporal relation annotations, so in the evaluations above, we only evaluated our temporal performance. One existing dataset with both temporal and causal annotations available is the Causal-TimeBank dataset (Causal-TB) (Mirza and Tonelli, 2014). However, Causal-TB is sparse in temporal annotations and is even sparser in causal annotations: In Table 3, we can see that with four times more documents, Causal-TB still has fewer temporal relations (denoted as T-Links therein), compared to TB-Dense; as for causal relations (C-Links), it has less than two causal relations in each document on average. Note that the T-Link sparsity of Causal-TB originates from TimeBank, which is known to have missing links Ning et al., 2017). The C-Link sparsity was a design choice of Causal-TB in which C-Links were annotated based on only explicit causal markers (e.g., "A happened because of B").
Another dataset with parallel annotations is CaTeRs (Mostafazadeh et al., 2016b), which was primarily designed for the Story Cloze Test (Mostafazadeh et al., 2016a) based on common  Table 3: Statistics of our new dataset with both temporal and causal relations annotated, compared with existing datasets. T-Link: Temporal relation. C-Link: Causal relation. The new dataset is much denser than Causal-TB in both T-Links and C-Links. sense stories. It is different to the newswire domain that we are working on. Therefore, we decided to augment the EventCausality dataset provided in Do et al. (2011) with a modified version of the dense temporal annotation scheme proposed in  and use this new dataset to showcase the proposed joint approach.
The EventCausality dataset provides relatively dense causal annotations on 25 newswire articles collected from CNN in 2010. As shown in Table 3, it has more than 20 C-Links annotated per document on average (10 times denser than Causal-TB). However, one issue is that its notion for events is slightly different to that in the temporal relation extraction regime. To construct parallel annotations of both temporal and causal relations, we preprocessed all the articles in EventCausality using ClearTK to extract events and then manually removed some obvious errors in them. To annotate temporal relations among these events, we adopted the annotation scheme from TB-Dense given its success in mitigating the issue of missing annotations with the following modifications. First, we used a crowdsourcing platform, Crowd-Flower, to collect temporal relation annotations. For each decision of temporal relation, we asked 5 workers to annotate and chose the majority label as our final annotation. Second, we discovered that comparisons involving ending points of events tend to be ambiguous and suffer from low inter-annotator agreement (IAA), so we asked the annotators to label relations based on the starting points of each event. This simplification does not change the nature of temporal relation extraction but leads to better annotation quality. For more details about this data collection scheme, please refer to (Ning et al., 2018b) for more details.

Results
Result on our new dataset jointly annotated with both temporal and causal relations is shown in Ta-  ble 4. We split the new dataset into 20 documents for training and 5 documents for testing. In the training phase, the training parameters were tuned via 5-fold cross validation on the training set. Table 4 demonstrates the improvement of the joint framework over individual components. The "temporal only" baseline is the improved temporal extraction system for which the joint inference with causal links has NOT been applied. The "causal only" baseline is to use s c (·) alone for the prediction of each pair. That is, for a pair i, if s c {i → causes} > s c {i → caused by}, we then assign "causes" to pair i; otherwise, we assign "caused by" to pair i. Note that the "causal accuracy" column in Table 4 was evaluated only on gold causal pairs.
In the proposed joint system, the temporal and causal scores were added up for all event pairs. The temporal performance got strictly better in precision, recall, and F 1 , and the causal performance also got improved by a large margin from 70.5% to 77.3%, indicating that temporal signals and causal signals are helpful to each other. According to the McNemar's test, both improvements are significant with p<0.05.
The second part of Table 4 shows that if gold relations were used, how well each component would possibly perform. Technically, these gold temporal/causal relations were enforced via adding extra constraints to ILP in Eq. (3) (imagine these gold relations as a special rule, and convert them into constraints like what we did in Eq. (2)). When using gold temporal relations, causal accuracy went up to 91.9%. That is, 91.9% of the C-Links satisfied the assumption that the cause is temporally before the effect. First, this number is much higher than the 77.3% on line 3, so there is still room for improvement. Second, it means in this dataset, there were 8.1% of the C-Links in which the cause is temporally after its effect. We will discuss this seemingly counter-intuitive phenomenon in the Discussion section. When gold causal relations were used (line 5), the temporal performance was slightly better than line 3 in terms of both precision and recall. The small difference means that the temporal performance on line 3 was already very close to its best. Compared with the first line, we can see that gold causal relations led to approximately 2% improvement in precision and recall in temporal performance, which is a reasonable margin given the fact that C-Links are often much sparser than T-Links in practice.
Note that the temporal performance in Table 4 is consistently better than those in Table 2 because of the higher IAA in the new dataset. However, the improvement brought by joint reasoning with causal relations is the same, which further confirms the capability of the proposed approach.

Discussion
We have consistently observed that on the TB-Dense dataset, if automatically tuned to optimize its F 1 score, a system is very likely to have low precisions and high recall (e.g., Table 2). We notice that our system often predicts non-vague relations when the TB-Dense gold is vague, resulting in lower precision. However, on our new dataset, the same algorithm can achieve a more balanced precision and recall. This is an interesting phenomenon, possibly due to the annotation scheme difference which needs further investigation.
The temporal improvements in both Table 2 and  Table 4 are relatively small (although statistically significant). This is actually not surprising because C-Links are much fewer than T-Links in newswires which focus more on the temporal development of stories. As a result, many T-Links are not accompanied with C-Links and the improvements are diluted. But for those event pairs having both T-Links and C-Links, the proposed joint framework is an important scheme to synthesize both signals and improve both. The comparison between Line 5 and Line 3 in Table 4 is a showcase of the effectiveness. We think that a deeper reason for the improvement achieved via a joint framework is that causality often encodes humans prior knowledge as global information (e.g., "death" is caused by "explosion" rather than causes "explosion", regardless of the local context), while temporality often focuses more on the local context. From this standpoint, temporal information and causal information are complementary and helpful to each other.
When doing error analysis for the fourth line of Table 4, we noticed some examples that break the commonly accepted temporal precedence assumption. It turns out that they are not annotation mistakes: In Example 4, e8:finished is obviously before e9:closed, but e9 is a cause of e8 since if the market did not close, the shares would not finish. In the other sentence of Example 4, she prepares before hosting her show, but e11:host is the cause of e10:prepares since if not for hosting, no preparation would be needed. In both cases, the cause is temporally after the effect because people are inclined to make projections for the future and change their behaviors before the future comes. The proposed system is currently unable to handle these examples and we believe that a better definition of what can be considered as events is needed, as part of further investigating how causality is expressed in natural language.
Finally, the constraints connecting causal relations to temporal relations are designed in this paper as "if A is the cause of B, then A must be before B". People have suggested other possibilities that involve the includes and simultaneously relations. While these other relations are simply different interpretations of temporal precedence (and can be easily incorporated in our framework), we find that they rarely happen in our dataset.

Conclusion
We presented a novel joint framework, Temporal and Causal Reasoning (TCR), using CCMs and ILP to the extraction problem of temporal and causal relations between events. To show the benefit of TCR, we have developed a new dataset that jointly annotates temporal and causal annotations, and then exhibited that TCR can improve both temporal and causal components. We hope that this notable improvement can foster more interest in jointly studying multiple aspects of events (e.g., event sequencing, coreference, parent-child relations) towards the goal of understanding events in natural language.