Exploiting Partially Annotated Data in Temporal Relation Extraction

Annotating temporal relations (TempRel) between events described in natural language is known to be labor intensive, partly because the total number of TempRels is quadratic in the number of events. As a result, only a small number of documents are typically annotated, limiting the coverage of various lexical/semantic phenomena. In order to improve existing approaches, one possibility is to make use of the readily available, partially annotated data (P as in partial) that cover more documents. However, missing annotations in P are known to hurt, rather than help, existing systems. This work is a case study in exploring various usages of P for TempRel extraction. Results show that despite missing annotations, P is still a useful supervision signal for this task within a constrained bootstrapping learning framework. The system described in this system is publicly available.


Introduction
Understanding the temporal information in natural language text is an important NLP task (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013;Minard et al., 2015;Bethard et al., 2016Bethard et al., , 2017. A crucial component is temporal relation (TempRel; e.g., before or after) extraction (Mani et al., 2006;Bethard et al., 2007;Do et al., 2012;Mirza and Tonelli, 2016;Ning et al., 2017Ning et al., , 2018a. The TempRels in a document or a sentence can be conveniently modeled as a graph, where the nodes are events, and the edges are labeled by TempRels. Given all the events in an instance, TempRel annotation is the process of manually labeling all the edges -a highly labor intensive task due to two reasons. One is that many edges require extensive reasoning over multiple sentences and labeling them is time-consuming. Perhaps more importantly, the other reason is that #edges is quadratic in #nodes. If labeling an edge takes 30 seconds (already an optimistic estimation), a typical document with 50 nodes would take more than 10 hours to annotate. Even if existing annotation schemes make a compromise by only annotating edges whose nodes are from a same sentence or adjacent sentences , it still takes more than 2 hours to fully annotate a typical document. Consequently, the only fully annotated dataset, TB-Dense , contains only 36 documents, which is rather small compared with datasets for other NLP tasks.
A small number of documents may indicate that the annotated data provide a limited coverage of various lexical and semantic phenomena, since a document is usually "homogeneous" within itself. In contrast to the scarcity of fully annotated datasets (denoted by F as in full), there are actually some partially annotated datasets as well (denoted by P as in partial); for example, Time-Bank (Pustejovsky et al., 2003) and AQUAINT (Graff, 2002) cover in total more than 250 documents. Since annotators are not required to label all the edges in these datasets, it is less labor intensive to collect P than to collect F. However, existing TempRel extraction methods only work on one type of datasets (i.e., either F or P), without taking advantage of both. No one, as far as we know, has explored ways to combine both types of datasets in learning and whether it is helpful.
This work is a case study in exploring various usages of P in the TempRel extraction task. We empirically show that P is indeed useful within a (constrained) bootstrapping type of learning approach. This case study is interesting from two perspectives. First, incidental supervision (Roth, 2017). In practice, supervision signals may not always be perfect: they may be noisy, only partial, based on different annotation schemes, or even on different (but relevant) tasks; incidental supervision is a general paradigm that aims at making use of the abundant, naturally occurring data, as supervision signals. As for the TempRel extraction task, the existence of many partially annotated datasets P is a good fit for this paradigm and the result here can be informative for future investigations involving other incidental supervision signals. Second, TempRel data collection. The fact that P is shown to provide useful supervision signals poses some further questions: What is the optimal data collection scheme for TempRel extraction, fully annotated, partially annotated, or a mixture of both? For partially annotated data, what is the optimal ratio of annotated edges to unannotated edges? The proposed method in this work can be readily extended to study these questions in the future, as we further discuss in Sec. 5.

Existing Datasets and Methods
TimeBank (Pustejovsky et al., 2003) is a classic TempRel dataset, where the annotators were given a whole article and allowed to label TempRels between any pairs of events. Annotators in this setup usually focus only on salient relations but overlook some others. It has been reported that many event pairs in TimeBank should have been annotated with a specific TempRel but the annotators failed to look at them (Chambers, 2013;Ning et al., 2017). Consequently, we categorize TimeBank as a partially annotated dataset (P). The same argument applies to other datasets that adopted this setup, such as AQUAINT (Graff, 2002), CaTeRs (Mostafazadeh et al., 2016) and RED (O'Gorman et al., 2016). Most existing systems make use of P, including but not limited to, (Mani et al., 2006;Bramsen et al., 2006;Chambers et al., 2007;Bethard et al., 2007;Verhagen and Pustejovsky, 2008;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012); this applies also to the TempEval workshops systems, e.g., (Laokulrat et al., 2013;Bethard, 2013;Chambers, 2013).
To address the missing annotation issue, Cassidy et al. (2014) proposed a dense annotation scheme, TB-Dense. Edges are presented one-byone and the annotator has to choose a label for it (note that there is a vague label in case the TempRel is not clear or does not exist). As a result, edges in TB-Dense are considered as fully annotated in this paper. The first system on TB-Dense was proposed in . Two recent TempRel extraction systems (Mirza and Tonelli, 2016;Ning et al., 2017) also reported their performances on TB-Dense (F) and on TempEval-3 (P) separately. However, there are no existing systems that jointly train on both. Given that the annotation guidelines of F and P are obviously different, it may not be optimal to simply treat P and F uniformly and train on their union. This situation necessitates further investigation as we do here.
Before introducing our joint learning approach, we have a few remarks about our choice of F and P datasets. First, we note that TB-Dense is actually not fully annotated in the strict sense because only edges within a sliding, two-sentence window are presented. That is, distant event pairs are intentionally ignored by the designers of TB-Dense. However, since such distant pairs are consistently ruled out in the training and inference phase in this paper, it does not change the nature of the problem being investigated here. At this point, TB-Dense is the only fully annotated dataset that can be adopted in this study, despite the aforementioned limitation.
Second, the partial annotations in datasets like TimeBank were not selected uniformly at random from all possible edges. As described earlier, only salient and non-vague TempRels (which may often be those easy ones) are labeled in these datasets. Using TimeBank as P might potentially create some bias and we will need to keep this in mind when analyzing the results in Sec. 4. Recent advances in TempRel data annotation (Ning et al., 2018c) can be used in the future to collect both F and P more easily.

Joint Learning on F and P
In this work, we study two learning paradigms that make use of both F and P. In the first, we simply treat those edges that are annotated in P as edges in F so that the learning process can be performed on top of the union of F and P. This is the most straightforward approach to using F and P jointly and it is interesting to see if it already helps.
In the second, we use bootstrapping: we use F as a starting point and learn a TempRel extraction system on it (denoted by S F ), and then fill those missing annotations in P based on S F (thus obtain "fully" annotatedP); finally, we treatP as F and learn from both. Algorithm 1 is a meta-algorithm of the above.
Algorithm 1: Joint learning from F and P by bootstrapping Input: F, P, Learn, Inference In Algorithm 1, we consistently use the sparse averaged perceptron algorithm as the "Learn" function. As for "Inference" (Line 6), we further investigate two different ways: (i) Look at every unannotated edge in p ∈ P and use S F +P to label it; this local method ignores the existing annotated edges in P and is thus the standard bootstrapping. (ii) Perform global inference on P with annotated edges being constraints, which is a constrained bootstrapping, motivated by the fact that temporal graphs are structured and annotated edges have influence on the missing edges: In Fig. 1, the current annotation for (1, 2) and (2, 3) is before and vague. We assume that the annotation (2, 3)=vague indicates that the relation cannot be determined even if the entire graph is considered. Then with (1, 2)=before and (2, 3)=vague, we can see that (1, 3) cannot be uniquely determined, but it is restricted to be selected from {bef ore, vague} rather than the entire label set. We believe that global inference makes better use of the information provided by P; in fact, as we show in Sec. 4, it does perform better than local inference. Figure 1: Nodes 1-3 are three time points and let (i, j) be the edge from node i to node j, where (i, j) ∈{before, after, equal, vague}. Assume the current annotation is (1, 2) = bef ore and (2, 3) = vague and (1, 3) is missing. However, (1, 3) cannot be after because it leads to (2, 3) = af ter, conflicting with their current annotation; similarly, (1, 3) cannot be equal, either.
A standard way to perform global inference is to formulate it as an Integer Linear Programming (ILP) problem (Roth and Yih, 2004) and enforce transitivity rules as constraints. Let R be the TempRel label set 2 , I r (ij) ∈ {0, 1} be the indicator function of (i, j) = r, and f r (ij) ∈ [0, 1] be the corresponding soft-max score obtained via S F +P . Then the ILP objective is formulated aŝ where {r m 3 } is selected based on the general transitivity proposed in (Ning et al., 2017). With Eq. (1), different implementations of Line 6 in Algorithm 1 can be described concisely as follows: (i) Local inference is performed by ignoring "transitivity constraints". (ii) Global inference can be performed by adding annotated edges in P as additional constraints. Note that Algorithm 1 is only for the learning step of TempRel extraction; as for the inference step of this task, we consistently adopt the standard method by solving Eq. (1), as was done by (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Denis and Muller, 2011;Do et al., 2012;Ning et al., 2017).

Experiments
In this work, we consistently used TB-Dense as the fully annotated dataset (F) and TBAQ as the partially annotated dataset (P). The corpus statistics of these two datasets are provided in Table 1. Note that TBAQ is the union of TimeBank and AQUAINT and it originally contained 256 documents, but 36 out of them completely overlapped with TB-Dense, so we have excluded these when constructing P. In addition, the number of edges shown in Table 1 only counts the event-event relations (i.e., do not consider the event-time relations therein), which is the focus of this work.
We also adopted the original split of TB-Dense (22 documents for training, 5 documents for development, and 9 documents for test). Learning parameters were tuned to maximize their corresponding F-metric on the development set. Using the selected parameters, systems were retrained with development set incorporated and evaluated  against the test split of TB-Dense (about 1.4K relations: 0.6K vague, 0.4K before, 0.3K after, and 0.1K for the rest). Results are shown in Table 2, where all systems were compared in terms of their performances on "same sentence" edges (both nodes are from the same sentence), "nearby sentence" edges, all edges, and the temporal awareness metric used by the TempEval3 workshop.
The first part of Table 2 (Systems 1-5) refers to the baseline method proposed at the beginning of Sec. 3, i.e., simply treating P as F and training on their union. P F ull is a variant of P by filling its missing edges by vague. Since it labels too many vague TempRels, System 2 suffered from a low recall. In contrast, P does not contain any vague training examples, so System 3 would only predict specific TempRels, leading to a low precision. Given the obvious difference in F and P F ull , System 4 expectedly performed worse than System 1. However, when we see that System 5 was still worse than System 1, it is surprising because the annotated edges in P are correct and should have helped. This unexpected observation suggests that simply adding the annotated edges from P into F is not a proper approach to learn from both.
The second part (Systems 6-7) serves as an ablation study showing the effect of bootstrapping only. P Empty is another variant of P we get by removing all the annotated edges (that is, only nodes are kept). Thus, they did not get any information from the annotated edges in P and any improvement came from bootstrapping alone. Specifically, System 6 is the standard bootstrapping and System 7 is the constrained bootstrapping.
Built on top of Systems 6-7, Systems 8-9 further took advantage of the annotations of P, which resulted in additional improvements. Compared to System 1 (trained on F only) and System 5 (simply adding P into F), the proposed System 9 achieved much better performance, which is also statistically significant with p<0.005 (McNemar's test). While System 7 can be regarded as a reproduction of Ning et al. (2017), the original paper of Ning et al. (2017) achieved an overall score of P=43.0, R=46.4, F=44.7 and an awareness score of P=42.6, R=44.0, and F=43.3, and the proposed System 9 is also better than Ning et al. (2017) on all metrics. 3

Discussion
While incorporating transitivity constraints in inference is widely used, Ning et al. (2017) proposed to incorporate these constraints in the learning phase as well. One of the algorithms proposed in Ning et al. (2017) is based on Chang et al. (2012)'s constraint-driven learning (CoDL), which is the same as our intermediate System 7 in Table 2; the fact that System 7 is better than System 1 can thus be considered as a reproduction of Ning et al. (2017). Despite the technical similarity, this work is motivated differently and is set to achieve a different goal: Ning et al. (2017) tried to enforce the transitivity structure, while the current work attempts to use imperfect signals (e.g., partially annotated) taken from additional data, and learn in the incidental supervision framework.
The P used in this work is TBAQ, where only 12% of the edges are annotated. In practice, every annotation comes at a cost, either time or the expenses paid to annotators, and as more edges are annotated, the marginal "benefit" of one edge is going down (an extreme case is that an edge is of no value if it can be inferred from existing edges). Therefore, a more general question is to find out the optimal ratio of graph annotations.
Moreover, partial annotation is only one type of annotation imperfection. If the annotation is noisy, we can alter the hard constraints derived from P and use soft regularization terms; if the annotation is for a different but relevant task, we can formulate corresponding constraints to connect that different task to the task at hand. Being able to learn from these "indirect" signals is appealing because indirect signals are usually order of magnitudes larger than datasets dedicated to a single task.

Conclusion
Temporal relation (TempRel) extraction is important but TempRel annotation is labor intensive. While fully annotated datasets (F) are relatively small, there exist more datasets with partial annotations (P). This work provides the first investigation of learning from both types of datasets, and this preliminary study already shows promise.  Table 2: Performance of various usages of the partially annotated data in training. F: Fully annotated data. P: Partially annotated data. P F ull : P with missing annotations filled by vague. P Empty : P with all annotations removed. Bootstrap: referring to specific implementations of Line 6 in Algorithm 1, i.e., local or global. Same/nearby sentence: edges whose nodes appear in the same/nearby sentences in text. Overall: all edges. Awareness: the temporal awareness metric used in the TempEval3 workshop, measuring how useful the predicted graphs are (UzZaman et al., 2013). System 7 can also be considered as a reproduction of Ning et al. (2017) (see the discussion in Sec. 5 for details).
Two bootstrapping algorithms (standard and constrained) are analyzed and the benefit of P, although with missing annotations, is shown on a benchmark dataset. This work may be a good starting point for further investigations of incidental supervision and data collection schemes of the TempRel extraction task.