Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths

Temporal relation classification is becoming an active research field. Lots of methods have been proposed, while most of them focus on extracting features from external resources. Less attention has been paid to a significant advance in a closely related task: relation extraction. In this work, we borrow a state-of-the-art method in relation extraction by adopting bidirectional long short-term memory (Bi-LSTM) along dependency paths (DP). We make a “common root” assumption to extend DP representations of cross-sentence links. In the final comparison to two state-of-the-art systems on TimeBank-Dense, our model achieves comparable performance, without using external knowledge, as well as manually annotated attributes of entities (class, tense, polarity, etc.).


Introduction
Recently, the need for extracting temporal information from text is motivated rapidly by many NLP tasks such as: question answering (QA), information extraction (IE), etc. Along with the TimeBank 1 (Pustejovsky et al., 2003) and other temporal information annotated corpora, a series of temporal evaluation challenges (TempEval-1,2,3) (Verhagen et al., 2009(Verhagen et al., , 2010UzZaman et al., 2012) are attracting growing research efforts.
Temporal relation classification is a task to identify the pairs of temporal entities (events or temporal expressions) that have a temporal link and classify the temporal relations between them. For instance, we show an event-event (E-E) link with 'DURING' type in (i), an event-time (E-T) link 1 https://catalog.ldc.upenn.edu/LDC2006T08 with 'INCLUDES' type in (ii) and an event-DCT (document creation time, E-D) with 'BEFORE' type in (iii).
(i) There was no hint of trouble in the last conversation between controllers and TWA pilot Steven Snyder.
(ii) In Washington today, the Federal Aviation Administration released air traffic control tapes.
(iii) The U.S. Navy has 27 ships in the maritime barricade of Iraq. Marcu and Echihabi (2002) propose an approach considering word-based pairs as useful features. The following researchers (Laokulrat et al., 2013;Mani et al., 2006;D'Souza and Ng, 2013) focus on extracting lexical, syntactic or semantic information from various external knowledge bases such as: Word-Net (Miller, 1995) and VerbOcean (Chklovski and Pantel, 2004). However, these feature based methods rely on hand-crafted efforts and external resources. In addition, these works require the features of entity attributes (class, tense, polarity, etc.), which are manually annotated to achieve high performance. Consequently, they are hard to obtain in practical application scenarios.
In relation extraction, there is an explosion of the works done with the dependency path (DP) based methods, which employ various models along dependency paths (Bunescu and Mooney, 2005;Plank and Moschitti, 2013). In recent years, the DP-based neural networks (Socher et al., 2011;Xu et al., 2015a,b) show state-of-the-art performance, with less requirements on explicit features. Intuitively, the DP-based approaches have the potential to classify temporal relations.
Both relation extraction and temporal relation classification require the identification of relation- ship between entities in texts. However, temporal relation classification is more challenging, since it includes three different type of entities: 'event', 'time expression' and DCT. Cross-sentence links also add additional complexity into the task. Due to the outstanding performance of DP-based neural networks revealed in relation extraction, we borrow this state-of-the-art approach to temporal relation classification.
In Section 2 of this paper, we review related work and introduce TimeBank-Dense. We discuss the cross-sentence link problem and the architectures of our E-E, E-T and E-D classifiers in Section 3. In Section 4, the experiments are performed on TimeBank-Dense and we compare our model to the baseline and two state-of-the-art systems. The final conclusion is made in Section 5.

Related Work
Current state-of-the-art temporal relation classifiers exploit a variety of features. Laokulrat et al. (2013);  extract lexical and morphological features derived from Word-Net synsets. Mani et al. (2006); D'Souza and Ng (2013) incorporate semantic relations between verbs from VerbOcean as features. In addition, most of the systems include the entity attributes ( Figure 1) specified in TimeML 2 as basic features, which actually need heavy human annotations.
In this work, we push this work into a more practical level by using only word, part-of-speech (POS), dependency parsing information, without incorporating entity attributes, as well as any other external resources.
In relation extraction, Bunescu and Mooney (2005) propose an observation that a relation can be captured by the shortest dependency path 2 http://timeml.org/ Figure 2: An example of the DP representation of a cross-sentence link between the two sentences in Figure 1.
(SDP) between the two entities in the entire dependency graph. Plank and Moschitti (2013) extract syntactic and semantic information in a tree kernel. Following this line, researchers (Socher et al., 2011;Xu et al., 2015a,b) achieve state-of-the-art performance by building various neural networks over dependency path.
Our system is similar to the work by Xu et al. (2015b). They perform LSTM with max pooling separately on each feature channel along dependency path. In contrast, our system adopts bidirectional LSTM on the concatenation of feature embeddings.

TimeBank-Dense
In the original TimeBank, temporal links have been created on those pairs with semantic connections, which led to a sparse annotation style.   3 propose a mechanism to force annotators to create complete graphs over the entities in neighboring sentences. Compared to 6,418 links in 183 TimeBank documents, TimeBank-Dense achieves greater density with 12,715 links in 36 documents.
We follow a similar experiment setting to the other two systems (Mirza and Tonelli, 2016; with the same 9 documents as test data and the others as training data (15% of training data is split as validation data for early stopping).

Cross-sentence Dependency Paths
Intuitively, the dependency path based idea can be introduced into the temporal relation classification task. However, around 64% E-E, E-T links in TimeBank-Dense are with the ends in two neighboring sentences, called cross-sentence links.
A crucial obstacle is how to represent the dependency path of a cross-sentence link. In this work, we make a naive assumption that two neighboring sentences share a "common root". Therefore, a cross-sentence dependency path can be represented as two shortest dependency path branches from the ends to the "common root", as shown in Figure 2.
Stanford CoreNLP 4 is used to parsing syntactic structures of sentences in this work.

Temporal Relation Classifiers
Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a natural choice for processing sequential dependency paths. As the reversed order also takes useful information, a backward representation can be achieved by feeding LSTM with the same input in reverse. We adopt the concatenation of the forward and backward LSTMs outputs, referred to as bidirectional LSTM (Graves and Schmidhuber, 2005). Figure 3a shows the neural network architecture of our E-E, E-T classifier. Given an E-E or E-T temporal link, our system first generates two SDP branches: 1) the source entity to common root, 2) the target entity to common root. For each word along a SDP branch, concatenation of word, POS and dependency relation (DEP) embeddings (word-level) is fed into Bi-LSTM. The forward and backward outputs of both source and target branches are all concatenated, and fed into a fully connected hidden units layer. The final Softmax layer generates multi-class predictions. Since an E-D link contains single event SDP branch, our system applies a similar architecture, but with single branch Bi-LSTM with outputs fed into the penultimate hidden layer, as shown in Figure 3b.
In this work, we use word2vec 5 (Mikolov et

Hyper-parameters and Cross-validation
The grid search exploring a full hyper-parameter space takes time for three classifiers (E-E, E-T and E-D). Empirically, we set each single LSTM output with the same dimensions (equal to 300) as the concatenation of word, POS, DEP embeddings. The hidden layer is set as 200-dimensions.
Our system adopts dependency paths as input, which means that the entities in the same sentences contain highly covered word sequence input. Simple cross-validation (CV) on links can not reflect the generalization ability of our model correctly. We use a grouped 5-fold CV based on the source entity ids (document id + sentence id) of links. This schema can reduce bias separately in either the source SDP or the target SDP. Although document level CV can avoid this issue, it's not feasible for TimeBank-Dense because it contains only 27 training documents.
Early stopping is used to save the best model based on the validation data. In each run of the 5-fold cross-validation, we split 80% of 'original training' as 'tentative training' and 20% as 'tentative test'. 85% of 'tentative training' is used to learning and 15% is used for validation. We also adopt early stopping in the final system on the validation data (15% of 'original training'). The patience is set as 10.
Dropout (Srivastava et al., 2014) recently is proved to be an useful approach to prevent neural networks from over-fitting. We adopt dropout   separately after the following layers: embeddings, LSTM, and hidden layer to investigate the impact of dropout on performance. Table 1 shows the best CV results recorded in tuning dropout. The hyperparameter setting with the best CV performance is adopted in the final system.

Overall Performance
Recently, Mirza and Tonelli (2016) report state-ofthe-art performance on TimeBank-Dense. They show the new attempt to mine the value of lowdimensions word embeddings by concatenating them with sparse traditional features. Their traditional features include entity attributes, temporal signals, semantic information of WordNet, etc., which means it's a hard setting for challenging their performance. In Table 2 and 3, 'Mirza' denotes their system.  The final comparison is shown in Table 3. An one-layer fully connected hidden units baseline (200-dimensions) with word, POS embeddings as input (without any dependency information) is provided. The significant out-performance of our proposed model over the baseline indicates the effectiveness of the dependency path information and our Bi-LSTM in classifying temporal links. As a hybrid system, 'CAEVO'  includes hand-crafted rules for their E-T and E-D classifiers. For instance, the temporal prepositions in, on, over, during, and within indicate 'IN INCLUDED' relations. Their system is superior in E-T and E-D. 'Miza' takes the pure feature-based methods and performs slightly better in E-E and overall, compared to 'CAEVO'. Our system shows the highest scores in E-E and overall among the four systems. In general, our system achieves comparable performance to two state-ofthe-art systems, without using any hand-crafted features, rules, or external resources.

Conclusion
We borrow the idea of the dependency path based neural networks into temporal relation classification. A "common root" assumption adapts our model to cross-sentence links. Our model adopts bidirectional LSTM for capturing both forward and backward orders information. We observe the significant benefit of the DP-based Bi-LSTM model by comparing it to the baseline. Our model achieves comparable performance to two state-ofthe-art systems without using any explicit features (class, tense, polarity, etc.) or external resources, which indicates that our model can capture such information automatically.