A Sequential Model for Classifying Temporal Relations between Intra-Sentence Events

We present a sequential model for temporal relation classification between intra-sentence events. The key observation is that the overall syntactic structure and compositional meanings of the multi-word context between events are important for distinguishing among fine-grained temporal relations. Specifically, our approach first extracts a sequence of context words that indicates the temporal relation between two events, which well align with the dependency path between two event mentions. The context word sequence, together with a parts-of-speech tag sequence and a dependency relation sequence that are generated corresponding to the word sequence, are then provided as input to bidirectional recurrent neural network (LSTM) models. The neural nets learn compositional syntactic and semantic representations of contexts surrounding the two events and predict the temporal relation between them. Evaluation of the proposed approach on TimeBank corpus shows that sequential modeling is capable of accurately recognizing temporal relations between events, which outperforms a neural net model using various discrete features as input that imitates previous feature based models.


Introduction
Identifying temporal relations between events is crucial to constructing events timeline. It has direct application in tasks such as question answering, event timeline generation and document summarization.
Bush said he saw little reason to be optimistic about a settlement of the dispute, which stems from Iraq's invasion of oil-wealthy Kuwait and its subsequent military buildup on the border of Saudi Arabia. Relations: (dispute af ter rel 1 invasion, invasion ibef ore rel 2 buildup, dispute af ter rel 3 buildup) Figure 1: Example sentence to illustrate the temporal context for event pairs. Previous works studied this task as the classification problem based on discrete features defined over lexico-syntactic, semantic and discourse features. However, these features are often derived from local contexts of two events and are only capable of capturing direct evidences indicating the temporal relation. Specifically, when two events are distantly located or are separated by other events in between, feature based approaches often fail to utilize compositional evidences, which are hard to encode using discrete features.
Consider the example sentence in Figure 1. Here, the first two temporal re-lations, dispute after rel 1 invasion and invation ibefore rel 2 buildup, involve events that are close by and discrete features, such as dependency relations and bag-of-words extracted from local contexts of two events, might be sufficient to correctly detect their relations. However, for the temporal relation dispute after rel 3 buildup, the context between the two events is long, complex and involves another event (invasion) as well, which makes it challenging for any individual feature or feature combinations to capture the temporal relation. We propose that the overall syntactic structure of in-between contexts including the linear order of words as well as the compositional semantics of multi-word contexts are critical for predicting the temporal relation between two events. Furthermore, the most important syntactic and semantic structures are derived along dependency paths between two event mentions 1 . This aligns well with the observation that semantic composition relates to grammatical dependency relations (Monroe and Wang, 2014;Reddy et al., 2016).
Our approach defines rules on dependency parse trees to extract temporal relation indicating contexts. First, we extract the dependency path between two event mentions. Then we apply two heuristic rules to enrich extracted dependency paths and deal with complex syntactic structures such as punctuations. Empirically, we found that parts-of-speech tags (POS) and dependency sequences generated following the dependency path provide evidences to predict the temporal relation as well.
We use neural net sequence models to capture structural and semantic compositionality in describing temporal relations between events. Specifically, we generate three sequences for each dependency path, the word sequence, the POS tag sequence and the dependency relation sequence. Using the three types of sequences as input, we train bi-directional LSTM models that consume each of the three sequences and model compositional structural information, both syntactically and semantically.
The evaluation shows that each type of sequences is useful to temporal relation classification between events. Our complete neural net model taking all the three types of sequences per-forms the best, which clearly outperforms feature based models.

Related Works
Most of the previous works on temporal relation classification are based on feature-based classifiers. Mani et al. (2006) built MaxEnt classifier on hand-tagged features in the corpus, including tense, aspect, modality, polarity and event class for classifying temporal relations. Later Chambers et al. (2007) used a two-stage classifier which first learned imperfect event attributes and then combined them with other linguistic features in the second stage to perform the classification.
The following works mostly expanded the feature sets (Cheng et al., 2007;Bethard and Martin, 2007;UzZaman et al., 2012;Bethard, 2013;Kolomiyets et al., 2012;Chambers, 2013;Laokulrat et al., 2013). Specifically, Chambers (2013) used direct dependency path between event pairs to capture syntactic context. Laokulrat et al. (2013) used 3-grams of paths between two event mentions in a dependency tree as features instead of full paths as those are too sparse. We found that modeling the entire path as one sequence provides greater compositional evidence on the temporal relation. In addition, modifiers attached to the words in a path with specific dependency relations like nmod:tmod are also informative.
Ng (2013) proposed a hybrid system for temporal relation classification that combines the learned classifier with 437 hand-coded rules. Their system first applied high-accuracy rules and then used the learned classifier, trained on rich features including those high-accuracy rules as features, to classify the cases that were not handled by the rules.  also showed the effectiveness of different discourse analysis frameworks for this task. Later Mirza and Tonelli (2014) showed that a simpler approach based on lexico-syntactic features achieved results comparable to Ng (2013). They also reported that dependency order between events, either governordependent or dependent-governor, was not useful in their experiments. However, we show that dependency relations, when modeled as a sequence, contribute significantly to this task.

Temporal Link Labeling
In this section, we describe the task of temporal relation classification, dataset, context words se-quence extraction model and the used recurrent neural net based classifier.

Task description
Early works on temporal relation classification Mani et al. (2006); Chambers et al. (2007) and the first two versions of TempEval (Verhagen et al., 2007(Verhagen et al., , 2010 simplified the task by considering only six relation types. They combined the pair of relation types that are the inverse of each other and ignored the relations during and during inv. Then TempEval-3 (Uzzaman et al., 2013) extended the task to complete 14 class classification problem and all later works have considered all 14 relations. Our model performs 14-class classification following the recent works, as this is arguably more challenging (Ng, 2013). Also, we consider gold annotated event pairs, mainly because the corpus is small and distribution of relations is very skewed. All previous works focusing on the problem of classifying temporal relation types assumed gold annotation.  We have used TimeBank corpus v1.2 for training and evaluating our model. The corpus consists of 14 temporal relations between 2308 event pairs, which are within the same sentence. These relations (Saurı et al., 2006) are simultaneous, before, after, ibefore, iafter, begins, begun by, ends, ended by, includes, is included, during, during inv, identity. Six pairs among them are inverse of each other and other two types are commutative (e 1 Re 2 ≡ e 2 Re 1 , R ∈ {identical, simultane-ous}). Our sequential model requires that relation should always be between e 1 and e 2 , where e 1 occurs before e 2 in the sentence. Therefore, before extracting the sequence, we inverted the relation types in cases where relation type was annotated in opposite order. Final distribution of dataset is given in Table 1.

Extracting Context Word Sequence
First, we extract words that are in the dependency path between two event mentions. However, event pairs can be very far in a sentence and are involved in complex syntactic structures. Therefore, we also apply two heuristic rules to deal with complex syntactic structures, e.g., two event mentions are in separate clauses and have a punctuation sign in their context. We describe our specific rules below. We used the Stanford parser (Chen and Manning, 2014) for generating dependency relations and parts-of-speech tags and all notations follow enhanced universal dependencies (De Marneffe and Manning, 2008).
Rule 1 (punctuation): Comma directly influences the meaning in text and omitting it may alter the meaning of phrase. Therefore, include comma if it precedes or follows e 1 , e 2 or their modifiers.
Rule 2 (children): Modifiers like now, then, will, yesterday, subsequent, when, was, etc. contains information on the temporal order of events and help in grounding events to the timeline. These modifiers are often related to event mentions with a specific class of dependency relations. Include all such children of e 1 , e 2 and other words in the path between them, which are connected with dependency relations nmod:tmod, mark, case, aux, conj, expl, cc, cop, amod, advmod, punct, ref.

Sequences and Classifier
We form three sequences on the extracted context words (with t words), which are based on (i) parts-of-speech tags: P T = p 1 , p 2 , ..., p t (ii) dependency relations: D T = d 1 , d 2 , ..., d n 2 and (iii) word forms: W T = w 1 , w 2 , ..., w t .
We transform each p i and d i to a one-hot vector and each w i to a pre-trained embedding vector (Pennington et al., 2014). Then each sequence of vectors are encoded using their corresponding forward (LST M f ) and backward (LST M b ) LSTM layers.
Classifier: Figure 2 shows an overview of our model. It consists of six LSTM (Hochreiter and

Evaluation
We evaluate our model using accuracy which has been used in previous research works for temporal relation classification. We also compare model performance using per-class F-score and macro Fscore. We briefly describe all the systems we have used for evaluation.
Majority Class: assigns "after" relation to all event pairs.
Unidirectional LSTMs: use single LSTM layer to encode each sequence (POS tags, dependency relation and word forms) individually for extracted phrase in forward order.
Bidirectional LSTMs: use two LSTM layers to encode each sequence individually, taken from POS tags, dependency and word forms sequences. The first layer encodes sequence in forward and second in reverse order.
2 Sequences: bi-directional LSTM based models considering all combinations of two sequences taken from POS tags, dependency and word forms sequences.
Full model: our complete sequential model considering POS, dependency and word forms sequences.
Direct dependency path: the same as Full model except that the two heuristic rules were not applied in extracting sequences.
Baseline I: a neural network classifier using discrete features described in Mirza and Tonelli (2014);Ng (2013). The features used are: POS tag, dependency relation, token and lemma of e 1 (e 2 ); dependency relations between e 1 (e 2 ) and their children; binary features indicating if e 1 and e 2 are related with the 'happensbefore' or the 'similar' relation according to Ver-bOcean (Chklovski and Pantel, 2004), if e 1 and e 2 have the same POS tag, or if e 1 (e 2 ) is the root and e 1 modifies (or governs) e 2 ; the dependency relation between e 1 and e 2 if they are directly connected in the dependency parse tree; prepositions that modify (or govern) e 1 (e 2 ); signal words (Derczynski and Gaizauskas, 2012) and entity distance between e 1 and e 2 . These features are concatenated and fed into an output neural layer with 14 neurons.
Baseline II: a neural network classifier using POS tags and word forms of words in the surface path as input. The surface path consists of words that lie in between two event mentions based on the original sentence. The classifier uses four LSTM layers to encode both POS tag and word sequences in forward and backward order. The output neural layer and parameters for all LSTM layers are kept the same as the Full model.
Baseline III: a neural network classifier based on event embeddings for both event mentions that were learned using bidirectional LSTMs (Kiperwasser and Goldberg, 2016). The learning uses two LSTM layers, each with 150 neurons and dropout of 0.2, to embed the forward and backward representations for each event mention. The input to LSTM layers are sequences of concatenated word embeddings and POS tags; each sequence corresponding to 19 context words to the left or to the right side of an event mention for the forward or the backward LSTM layer respectively. Event embeddings are then concatenated and fed into an output neural layer with 14 neurons.
All baselines are trained using rmsprop optimizer on an objective function defined by categorical cross entropy and their output layer uses softmax activation function.    Table 3: Per-class results of our best system and the baseline I. Table 2 reports accuracy scores for all the systems. We see that simple sequential models outperform the strong feature based system, Baseline I, which used various discrete features. Note that dependency relation and POS tag sequences alone achieve reasonably high accuracies. This implies that an important aspect of temporal relation is contained in the syntactic context of event mentions. Moreover, Mirza and Tonelli (2014) observed that discrete features based on dependency parse tree did not contribute to improving their classifier's accuracy. On the contrary, using the sequence of dependency relations yields a high accuracy in our setting which signifies the advantages of using sequential representations for this task. Our Full Model achieves a performance gain of 11.35% over Baseline I.

Results and Discussion
We developed two more baselines (Baseline II and III) that do not require syntactic information as well as the Direct dependency path model that used no rules. The Full Model outperformed them by 9.42%, 8.57% and 4.07% respectively. This affirms that the most useful syntactic and semantic structures are derived along dependency paths and additional context words, including prepositions, signal words and punctuations that are indirectly attached to event words, entail evidence on temporal relations as well. Table 3 compares precision, recall and F 1 scores of our Full Model with Baseline I. Our model performs reasonably well compared to the baseline system for most of the classes. In addition, it is able to identify relations present in small proportion like begun by, ibefore, iafter etc., which the baseline system couldn't identify. A similar observation was also reported by Mirza and Tonelli (2014) that relation types begins, ibefore, ends and during are difficult to identify using feature based systems, which often generate false positives for before and after relations.

Conclusion and Future work
In this paper, we have focused on modeling syntactic structural information and compositional semantics of contexts in predicting temporal relations between events in the same sentence. Our approach extracts lexical and syntactic sequences from contexts between two events and feed them to recurrent neural nets. The evaluation shows that our sequential models are promising in distinguishing among fine-grained temporal relations.
In the future, we will extend our sequential models to predict temporal relations for event pairs spanning across multiple sentences, for instance by incorporating discourse relations between sentences in a sequence.