A Structured Learning Approach to Temporal Relation Extraction

Identifying temporal relations between events is an essential step towards natural language understanding. However, the temporal relation between two events in a story depends on, and is often dictated by, relations among other events. Consequently, effectively identifying temporal relations between events is a challenging problem even for human annotators. This paper suggests that it is important to take these dependencies into account while learning to identify these relations and proposes a structured learning approach to address this challenge. As a byproduct, this provides a new perspective on handling missing relations, a known issue that hurts existing methods. As we show, the proposed approach results in significant improvements on the two commonly used data sets for this problem.

The fundamental tasks in temporal processing, as identified in the TE workshops, are 1) time expression (the so-called "timex") extraction and normalization and 2) temporal relation (also known as TLINKs (Pustejovsky et al., 2003a)) extraction. While the first task has now been well handled by the state-of-the-art systems (Heidel-Time (Strötgen and Gertz, 2010), SUTime (Chang and Manning, 2012), IllinoisTime (Zhao et al., 2012), NavyTime (Chambers, 2013), UWTime (Lee et al., 2014), etc.) with end-to-end F 1 scores being around 80%, the second task has long been a challenging one; even the top systems only achieved F 1 scores of around 35% in the TE workshops.
The goal of the temporal relation task is to generate a directed temporal graph whose nodes represent temporal entities (i.e., events or timexes) and edges represent the TLINKs between them. The task is challenging because it often requires global considerations -considering the entire graph, the TLINK annotation is quadratic in the number of nodes and thus very expensive, and an overwhelming fraction of the temporal relations are missing in human annotation. In this paper, we propose a structured learning approach to temporal relation extraction, where local models are updated based on feedback from global inferences. The structured approach also gives rise to a semisupervised method, making it possible to take advantage of the readily available unlabeled data. As a byproduct, this approach further provides a new, effective perspective on handling those missing relations.
In the common formulations, temporal relations are categorized into three types: the E-E TLINKs (those between a pair of events), the T-T TLINKs (those between a pair of timexes), and the E-T TLINKs (those between an event and a timex). While the proposed approach can be generally applied to all three types, this paper focuses on the majority type, i.e., the E-E TLINKs. For example, consider the following snippet taken from the training set provided in the TE3 workshop. We want to construct a temporal graph as in Fig. 1 for the events in boldface in Ex1.
Ex1 . . . tons of earth cascaded down a hillside, ripping two houses from their foundations. No one was hurt, but firefighters ordered the evacuation of nearby homes and said they'll monitor the shifting ground.. . . As discussed in existing work (Verhagen, 2004;Bramsen et al., 2006;Mani et al., 2006;Chambers and Jurafsky, 2008), the structure of a temporal graph is constrained by some rather simple rules: 1. Symmetry. For example, if A is before B, then B must be after A.

2.
Transitivity. For example, if A is before B and B is before C, then A must be before C.
This particular structure of a temporal graph (especially the transitivity structure) makes its nodes highly interrelated, as can be seen from Fig. 1. It is thus very challenging to identify the TLINKs between them, even for human annotators: The inter-annotator agreement on TLINKs is usually about 50%-60% (Mani et al., 2006). Fig. 2 shows the actual human annotations provided by TE3. Among all the ten possible pairs of nodes, only three TLINKs were annotated. Even if we only look at main events in consecutive sentences and at events in the same sentence, there are still quite a few missing TLINKs, e.g., the one between hurt and cascaded and the one between monitor and ordered.
Early attempts by Mani et al. (2006);Chambers et al. (2007); ; Verhagen and Pustejovsky (2008) studied local methods -learning models that make pairwise decisions between each pair of events. State-of-the-art local methods, including ClearTK (Bethard, 2013) (Laokulrat et al., 2013), and NavyTime (Chambers, 2013), use better designed rules or more features such as syntactic tree paths and achieve better results. However, the decisions made by these (local) models are often globally inconsistent (i.e., the symmetry and/or transitivity constraints are not satisfied for the entire temporal graph). Integer linear programming (ILP) methods (Roth and Yih, 2004) were used in this domain to enforce global consistency by several authors including Bramsen et al. (2006); Chambers and Jurafsky (2008); , which formulated TLINK extraction as an ILP and showed that it improves over local methods for densely connected graphs. Since these methods perform inference ("I") on top of pre-trained local classifiers ("L"), they are often referred to as L+I (Punyakanok et al., 2005).
In a state-of-the-art method, CAEVO , many hand-crafted rules and machine learned classifiers (called sieves therein) form a pipeline. The global consistency is enforced by inferring all possible relations before passing the graph to the next sieve. This best-first architecture is conceptually similar to L+I but the inference is greedy, similar to Mani et al. (2007); Verhagen and Pustejovsky (2008). Although L+I methods impose global constraints in the inference phase, this paper argues that global considerations are necessary in the learning phase as well (i.e., structured learning). In parallel to the work presented here, Leeuwenberg and Moens (2017) also proposed a structured learning approach to extracting the temporal relations. Their work focuses on a domain-specific dataset from Clinical TempEval (Bethard et al., 2016), so their work does not need to address some of the difficulties of the general problem that our work addresses. More importantly, they compared structured learning to local baselines, while we find that the comparison between structured learning and L+I is more interesting and important for understanding the effect of global considerations in the learning phase. In difference from existing methods, we also discuss how to effectively use unlabeled data and how to handle the overwhelming fraction of missing relations in a principled way. Our solution targets on these issues and, as we show, achieves significant improvements on two commonly used evaluation sets.
The rest of this paper is organized as follows. Section 2 clarifies the temporal relation types and the evaluation metric of a temporal graph used in this paper, Section 3 explains the structured learning approach in detail, and Section 4 discusses the practical issue of missing relations. We provide experiments and discussion in Section 5 and conclusion in Section 6.

Temporal Relation Types
Existing corpora for temporal processing often follows the interval representation of events proposed in Allen (1984), and makes use of 13 relation types in total. In many systems, vague or none is also included as another relation type when a TLINK is not clear or missing. However, current systems usually use a reduced set of relation types, mainly due to the following reasons.
1. The non-uniform distribution of all the relation types makes it difficult to separate lowfrequency ones from the others (see Table 1 in Mani et al. (2006)). For example, relations such as immediately before or immediately after barely exist in a corpus compared to before and after.
2. Due to the ambiguity in natural language, determining relations like before and immediately before can be a difficult task itself .
In this work, we follow the reduced set of temporal relation types used in CAEVO : before, after, includes, is included, equal, and vague.

Quality of A Temporal Graph
The most recent evaluation metric in TE3, i.e., the temporal awareness (UzZaman and Allen, 2011), is adopted in this work. Specifically, let G sys and G true be two temporal graphs from the system prediction and the ground truth, respectively. The precision and recall of temporal awareness are defined as follows.
where G + is the closure of graph G, G − is the reduction of G, "∩" is the intersection between TLINKs in two graphs, and |G| is the number of TLINKs in G. The temporal awareness metric better captures how "useful" a temporal graph is. For example, if system 1 produces ripping is before hurt and hurt is before monitor, and system 2 adds ripping is before monitor on top of system 1. Since system 2 is simply a transitive closure of system 1, they would have the same evaluation scores. Note that vague relations are usually considered as non-existing TLINKs and are not counted during evaluation.

A Structured Training Approach
As shown in Fig. 1, the learning problem in temporal relation extraction is global in nature. Even the top local method in TE3, UTTime (Laokulrat et al., 2013), only achieved F 1 =56.5 when presented with a pair of temporal entities (Task Crelation only (UzZaman et al., 2013)). Since the success of an L+I method strongly relies on the quality of the local classifiers, a poor local classifier is obviously a roadblock for L+I methods. Following the insights from Punyakanok et al. (2005), we propose to use a structured learning approach (also called "Inference Based Training" (IBT)). Unlike the current L+I approach, where local classifiers are trained independently beforehand without knowledge of the predictions on neighboring pairs, we train local classifiers with feedback that accounts for other relations, by performing global inference in each round of the learning process. In order to introduce the structured learning algorithm, we first explain its most important component, the global inference step.

Inference
In a document with n pairs of events, let φ i ∈ X ⊆ R d be the extracted d-dimensional feature and y i ∈ Y be the temporal relation for the i-th pair of events, i = 1, 2, . . . , n, where Y = {r j } 6 j=1 is the label set for the six temporal relations we use. Moreover, let x = {φ 1 , . . . , φ n } ∈ X n and y = {y 1 , . . . , y n } ∈ Y n be more compact representations of all the features and labels in this document. Given the weight vector w r of a linear classifier trained for relation r ∈ Y (i.e., using the one-vs-all scheme), the global inference step is to solve the following constrained optimization problem:ŷ where C(Y n ) ⊆ Y n constrains the temporal graph to be symmetrically and transitively consistent, and f (x, y) is the scoring function: is simply the sum of these probabilities over all the event pairs in a document, which we think of as the confidence of assigning y = {y 1 , ..., y n } to this document and therefore, it needs to be maximized in Eq. (1).
Note that when C(Y n ) = Y n , Eq. (1) can be solved for eachŷ i independently, which is what the so-called local methods do, but the resultinĝ y may not satisfy global consistency in this way. When C(Y n ) = Y n , Eq. (1) cannot be decoupled for eachŷ i and is usually formulated as an ILP problem (Roth and Yih, 2004;Chambers and Jurafsky, 2008;. Specifically, let I r (ij) ∈ {0, 1} be the indicator function of relation r for event i and event j and f r (ij) ∈ [0, 1] be the corresponding soft-max score. Then the ILP objective for global inference is formulated as fol- for all distinct events i, j, and k, where E = {ij | sentence dist(i, j)≤ 1},r is the reverse of r, and N is the number of possible relations for r 3 when r 1 and r 2 are true.
Our formulation in Eq.
(2) is different from previous work (Chambers and Jurafsky, 2008; in two aspects: 1) We restrict our event pairs ij to a smaller set E = {ij | sentence dist(i, j)≤ 1} where pairs that are more than one sentence away are deleted for computational efficiency and (usually) for better performance. In fact, to make better use of global constraints, we should have allowed more event pairs in Eq. (2). However, f r (ij) is usually more reliable when i and j are closer in text. Many participating systems in TE3 (UzZaman et al., 2013) have used this pre-filtering strategy to balance the trade-off between confidence in f r (ij) and global constraints. We observe that the strategy fits very well to the existing datasets: As shown in Fig. 3, annotated TLINKs barely exist if two events are two sentences away. 2) Previously, transitivity constraints were formulated as I r 1 (ij) + I r 2 (jk) − I r 3 (ik) ≤ 1, which is a special case when N = 1 and can be understood as "r 1 and r 2 determine a single r 3 ". However, it was overlooked that, although some r 1 and r 2 cannot uniquely determine r 3 , they can still constrain the set of labels r 3 can take. For example, as shown in Fig. 4, when r 1 =before and r 2 =is included, r 3 is not determined but we know that r 3 ∈ {before, is included} 1 . This information can be easily exploited by allowing N > 1. can either be before C1 or is included in C2. We propose to incorporate this via the transitivity constraints for Eq. (2).
With these two differences, the optimization problem (2) can still be efficiently solved using off-the-shelf ILP packages such as GUROBI (Gurobi Optimization, Inc., 2012).

Learning
With the inference solver defined above, we propose to use the structured perceptron (Collins, 2002) as a representative for the inference based training (IBT) algorithm to learn those weight vectors w r . Specifically, let L = {x k , y k } K k=1 be the labeled training set of K instances (usually documents). The structured perceptron training algorithm for this problem is shown in Algorithm 1. The Illinois-SL package (Chang et al., 2010) was used in our experiments for its structured perceptron component. In terms of the features used in this work, we adopt the same set of features designed for E-E TLINKs in Sec. 3.1 of .
In Algorithm 1, Line 6 is the inference step as in Eq. (1) or (2), which is augmented with a closure operation onŷ in the following line. In the case in which there is only one pair of events in each instance (thus no structure to take advantage of), Algorithm 1 reduces to the conventional perceptron algorithm and Line 6 simply chooses the top scoring label. With a structured instance instead, Line 6 becomes slower to solve, but it can provide valuable information so that the perceptron learner is able to look further at other labels rather than an isolated pair. For example in Ex1 and Fig. 1, the fact that (ripping,ordered)=before is established through two other relations: 1) ripping is an adverbial participle and thus included in cascaded and 2) cascaded is before ordered. If (ripping,ordered)=before is presented to a local learning algorithm without knowing its predictions on (ripping,cascaded) and (cascaded,ordered), then the model either cannot support it or overfits it. In IBT, however, if the classifier was correct in deciding (ripping,cascaded) and (cascaded,ordered), then (ripping,ordered) would be correct automatically and would not contribute to updating the classifier.

Semi-supervised Structured Learning
The scarcity of training data and the difficulty in annotation have long been a bottleneck for temporal processing systems. Given the inherent global constraints in temporal graphs, we propose to perform semi-supervised structured learning using the constraint-driven learning (CoDL) algorithm (Chang et al., 2007, as shown in Algorithm 2, where the function "Learn" in Lines 2 and 9 represents any standard learning algorithm Algorithm 1: Structured perceptron algorithm for temporal relations Input: Training set L = {x k , y k } K k=1 , learning rate λ 1 Perform graph closure on each y k 2 Initialize w r = 0, ∀r ∈ Y 3 while convergence criteria not satisfied do 4 Shuffle the examples in L 5 foreach (x, y) ∈ L do 6ŷ = arg max y∈C f (x, y) 7 Perform graph closure onŷ 8 ifŷ = y then 9 w r = w r + λ( i:y i =r φ i − i:ŷ i =r φ i ), ∀r ∈ Y 10 return {w r } r∈Y (e.g., perceptron, SVM, or even structured perceptron; here we used the averaged perceptron (Freund and Schapire, 1998)) and subscript "r" means selecting the learned weight vector for relation r ∈ Y. CoDL improves the model learned from a small amount of labeled data by repeatedly generating feedback through labeling unlabeled examples, which is in fact a semi-supervised version of IBT. Experiments show that this scheme is indeed helpful in this problem.
Algorithm 2: Constraint-driven learning algorithm Input: Labeled set L, unlabeled set U, weighting coefficient γ 1 Perform closure on each graph in L 2 Initialize w r = Learn(L) r , ∀ r ∈ Y 3 while convergence criteria not satisfied do

Missing Annotations
Since even human annotators find it difficult to annotate temporal graphs, many of the TLINKs are left unspecified by annotators (compare Fig. 2 to Fig. 1). While some of these missing TLINKs can be inferred from existing ones, the vast majority still remain unknown as shown in Table 1. De-spite the existence of denser annotation schemes (e.g., ), the TLINK annotation task is quadratic in the number of nodes, and it is practically infeasible to annotate complete graphs. Therefore, the problem of identifying these unknown relations in training and test is a major issue that dramatically hurts existing methods. We could simply use these unknown pairs (or some filtered version of them) to design rules or train classifiers to identify whether a TLINK is vague or not. However, we propose to exclude both the unknown pairs and the vague classifier from the training process -by changing the structured loss function to ignore the inference feedback on vague TLINKs (see Line 9 in Algorithm 1 and Line 9 in Algorithm 2). The reasons are discussed below.
First, it is believed that a lot of the unknown pairs are not really vague but rather pairs that the annotators failed to look at . For example, (cascaded, monitor) should be annotated as before but is missing in Fig. 2. It is hard to exclude this noise in the data during training. Second, compared to the overwhelmingly large number of unknown TLINKs (89.5% as shown in Table 1), the scarcity of non-vague TLINKs makes it hard to learn a good vague classifier. Third, vague is fundamentally different from the other relation types. For example, if a before TLINK can be established given a sentence, then it always holds as before regardless of other events around it, but if a TLINK is vague given a sentence, it may still change to other types afterwards if a connection can later be established through other nodes from the context. This distinction emphasizes that vague is a consequence of lack of background/contextual information, rather than a concrete relation type to be trained on. Fourth, without the vague classifier, the predicted temporal graph tends to become more densely connected, thus the global transitivity constraints can be more effective in correcting local mistakes (Chambers and Jurafsky, 2008).
However, excluding the local classifier for vague TLINKs would undesirably assign nonvague TLINKs to every pair of events. To handle this, we take a closer look at the vague TLINKs. We note that a vague TLINK could arise in two situations if the annotators did not fail to look at it. One is that an annotator looks at this pair of events and decides that multiple relations can exist, and the other one is that two annotators disagree on the relation (similar arguments were also made in ). In both situations, the annotators first try to assign all possible relations to a TLINK, and then change the relation to vague if more than one can be assigned. This human annotation process for vague is different from many existing methods, which either identify the existence of a TLINK first (using rules or machinelearned classifiers) and then classify, or directly include vague as a classification label along with other non-vague relations.
In this work, however, we propose to mimic this mental process by a post-filtering method 2 . Specifically, we take each TLINK produced by ILP and determine whether it is vague using its relative entropy (the Kullback-Leibler divergence) to the uniform distribution. Let {r m } M m=1 be the set of relations that the i-th pair of events can take, we filter the i-th TLINK given by ILP by: where f rm (φ i ) is the soft-max score of r m , obtained by the local classifier for r m . We then compare δ i to a fixed threshold τ to determine the vagueness of this TLINK; we accept its originally predicted label if δ i > τ , or change it to vague otherwise. Using relative entropy here is intuitively appealing and empirically useful as shown in the experiments section; better metrics are of course yet to be designed.

Datasets
The TempEval3 (TE3) workshop (UzZaman et al., 2013) provided the TimeBank (TB) (Pustejovsky et al., 2003b), AQUAINT (AQ) (Graff, 2002), Silver (TE3-SV), and Platinum (TE3-PT) datasets, where TB and AQ are usually for training, and TE3-PT is usually for testing. The TE3-SV dataset is a much larger, machine-annotated and automatically-merged dataset based on multiple systems, with the intention to see if these "silver" standard data can help when included in training (although almost all participating systems saw performance drop with TE3-SV included in training).
Two popular augmentations on TB are the Verb-Clause temporal relation dataset (VC) and Time-bankDense dataset (TD). The VC dataset has specially annotated event pairs that follow the socalled Verb-Clause structure , which is usually beneficial to be included in training (UzZaman et al., 2013). The TD dataset contains 36 documents from TB which were reannotated using the dense event ordering framework proposed in . The experiments included in this paper will involve the TE3 datasets as well as these augmentations. Therefore, some statistics on them are shown in Table 2 for the readers' information. Table 2: Facts about the datasets used in this paper. The TD dataset is split into train, dev, and test in the same way as in . Note that the column of TLINKs only counts the non-vague TLINKs, from which we can see that the TD dataset has a much higher ratio of #TLINKs to #Events. The TLINK annotations in TE3-SV is not used in this paper and its number is thus not shown.

Baseline Methods
In addition to the state-of-the-art systems, another two baseline methods were also implemented for a better understanding of the proposed ones. The first is the regularized averaged perceptron (AP) (Freund and Schapire, 1998) implemented in the LBJava package (Rizzolo and Roth, 2010) and is a local method. On top of the first baseline, we performed global inference in Eq.
(2), referred to as the L+I baseline (AP+ILP). Both of them used the same feature set (i.e., as designed in ) as in the proposed structured perceptron (SP) and CoDL for fair comparisons. To clarify, SP and CoDL are training algorithms and their immediate outputs are the weight vectors {w r } r∈Y for local classifiers. An ILP inference was performed on top of them to yield the final output, and we refer to it as "S+I" (i.e., structured learn-ing+inference) methods.

TE3 Task C -Relation Only
To show the benefit of using structured learning, we first tested one scenario where the gold pairs of events that have a non-vague TLINK were known priori. This setup was a standard task presented in TE3, so that the difficulty of detecting vague TLINKs was ruled out. This setup also helps circumvent the issue that TE3 penalizes systems which assign extra labels that do not exist in the annotated graph, while these extra labels may be actually correct because the annotation itself might be incomplete. UTTime (Laokulrat et al., 2013) was the top system in this task in TE3. Since UTTime is not available to us, and its performance was reported in TE3 in terms of both E-E and E-T TLINKs together, we locally trained an E-T classifier based on  and included its prediction only for fair comparison.
UTTime is a local method and was trained on TB+AQ and tested on TE3-PT. We used the same datasets for our local baseline and its performance is shown in Table 3 under the name "AP-1". Note that the reported numbers below are the temporal awareness scores obtained from the official evaluation script provided in TE3. We can see that UT-Time is about 3% better than AP-1 in the absolute value of F 1 , which is expected since UTTime included more advanced features derived from syntactic parse trees. By adding the VC and TD datasets into the training set, we retrained our local baseline and achieved comparable performance to Table 4: Temporal awareness scores given gold events but with no gold pairs, which show that the proposed S+I methods outperformed state-of-the-art systems in various settings. The fourth column indicates the annotation sources used, with additional unlabeled dataset in the parentheses. The "Filters" column shows if the pre-filtering method (Sec. 3.1) or the proposed post-filtering method (Sec. 4) were used. The last column is the relative improvement in F1 score compared to baseline systems on line 1, 7, and 11, respectively. Systems that are significantly better than the "*"-ed systems are underlined (per McNemar's test with p < 0.0005).
No UTTime ("AP-2" in Table 3). On top of AP-2, a global inference step enforcing symmetry and transitivity constraints ("AP+ILP") can further improve the F 1 score by 9.3%, which is consistent with previous observations (Chambers and Jurafsky, 2008;. SP+ILP further improved the performance in precision, recall, and F 1 significantly (per the McNemar's test (Everitt, 1992;Dietterich, 1998) with p <0.0005), reaching an F 1 score of 67.2%. This meets our expectation that structured learning can be better when the local problem is difficult (Punyakanok et al., 2005).

TE3 Task C
In the first scenario, we knew in advance which TLINKs existed or not, so the "pre-filtering" (i.e., ignoring distant pairs as mentioned in Sec. 3.1 and "post-filtering" methods were not used when generating the results in Table 3. We then tested a more practical scenario, where we only knew the events, but did not know which ones are related. This setup was Task C in TE3 and the top system was ClearTK (Bethard, 2013). Again, for fair comparison, we simply added the E-T TLINKs predicted by ClearTK. Moreover, 10% of the training data was held out for development. Corresponding results on the TE3-PT testset are shown in Table 4. From lines 2-4, all systems see significant drops in performance if compared with the same entries in Table 3. It confirms our assertion that how to handle vague TLINKs is a major issue for this temporal relation extraction problem. The improvement of SP+ILP (line 4) over AP (line 2) was small and AP+ILP (line 3) was even worse than AP, which necessitates the use of a better approach towards vague TLINKs. By applying the postfiltering method proposed in Sec. 4, we were able to achieve better performances using SP+ILP (line 5), which shows the effectiveness of this strategy. Finally, by setting U in Algorithm 2 to be the TE3-SV dataset, CoDL+ILP (line 6) achieved the best F 1 score with a relative improvement over ClearTK being 14.8%. Note that when using TE3-SV in this paper, we did not use its annotations on TLINKs because of its well-known large noise (UzZaman et al., 2013).
In UzZaman et al. (2013), we notice that the best performance of ClearTK was achieved when trained on TB+VC (line 7 is higher than its reported values in TE3 because of later changes in ClearTK), so we retrained the proposed systems on the same training set and results are shown on lines 8-9. In this case, the improvement of S+I over Local was small, which may be due to the lack of training data. Note that line 8 was still significantly different to line 7 per the McNemar's test, although there was only 0.2% absolute difference in F 1 , which can be explained from their large differences in precision and recall.

Comparison with CAEVO
The proposed structured learning approach was further compared to a recent system, a CAscading EVent Ordering architecture (CAEVO) proposed in  (lines 10-13). We used the same training set and test set as CAEVO in the S+I systems. Again, we added the E-T TLINKs predicted by CAEVO to both S+I systems. In , CAEVO was reported on the straightforward evaluation metric including the vague TLINKs, but the temporal awareness scores were used here, which explains the difference between line 11 in Table 4 and what was reported in .
ClearTK was reported to be outperformed by CAEVO on TD-Test , but we observe that ClearTK on line 10 was much worse even than itself on line 7 (trained on TB+VC) and on line 1 (trained on TB+AQ+VC+TD) due to the annotation scheme difference between TD and TB/AQ/VC. ClearTK was designed mainly for TE3, aiming for high precision, which is reflected by its high precision on line 10, but it does not have enough flexibility to cope with two very different annotation schemes. Therefore, we have chosen CAEVO as the baseline system to evaluate the significance of the proposed ones. On the TD-Test dataset, all systems other than ClearTK had better F 1 scores compared to their performances on TE3-PT. This notable difference (i.e., 48.53 vs 40.3) indicates the better quality of the dense annotation scheme that was used to create TD . SP+ILP outperformed CAEVO and if additional unlabeled dataset TE3-SV was used, CoDL+ILP achieved the best score with a relative improvement in F 1 score being 6.3%.
We notice that the proposed systems often have higher recall than precision, and that this is less an issue on a densely annotated testset (TD-Test), so their low precision on TE3-PT possibly came from the missing annotations on TE3-PT. It is still under investigation how to control precision and recall in real applications.

Conclusion
We develop a structured learning approach to identifying temporal relations in natural language text and show that it captures the global nature of this problem better than state-of-the-art systems do. A new perspective towards vague relations is also proved to gain from fully taking advantage of the structured approach. In addition, the global nature of this problem gives rise to a better way of making use of the readily available unlabeled data, which further improves the proposed method. The improved performance on both TE3-PT and TD-Test, two differently annotated datasets, clearly shows the advantage of the proposed method over existing methods. We plan to build on the notable improvements shown here and expand this study to deal with additional temporal reasoning problems in natural language text.