Deep Structured Neural Network for Event Temporal Relation Extraction

We propose a novel deep structured learning framework for event temporal relation extraction. The model consists of 1) a recurrent neural network (RNN) to learn scoring functions for pair-wise relations, and 2) a structured support vector machine (SSVM) to make joint predictions. The neural network automatically learns representations that account for long-term contexts to provide robust features for the structured model, while the SSVM incorporates domain knowledge such as transitive closure of temporal relations as constraints to make better globally consistent decisions. By jointly training the two components, our model combines the benefits of both data-driven learning and knowledge exploitation. Experimental results on three high-quality event temporal relation datasets (TCR, MATRES, and TB-Dense) demonstrate that incorporated with pre-trained contextualized embeddings, the proposed model achieves significantly better performances than the state-of-the-art methods on all three datasets. We also provide thorough ablation studies to investigate our model.


Introduction
Event temporal relation extraction aims at building a graph where nodes correspond to events within a given text, and edges reflect temporal relations between the events. Figure 1a illustrates an example of such graph for the text shown above. Different types of edges specify different temporal relations: the event filed is SIMULTANEOUS with claiming, overruled is BEFORE claiming, and overruled is also BEFORE filed. Temporal relation extraction is beneficial for many downstream tasks such as question answering, information retrieval, and natural language generation. An event graph can po- * The authors contribute equally, alphabetical order.  Figure 1: An illustration of a paragraph with its partial temporal graph. (a) shows the Ground-Truth temporal graph, in which case temporal relation SIMULTA-NEOUS and BEFORE are presented. (b) and (c) are the local and structured predictions, respectively, for three of the event pairs in (a). Local predictions are incompatible with temporal transitivity rule: overruled has to be BEFORE claiming if overruled is BEFORE filed and filed is SIMULTANEOUS with claiming. The structured model achieves coherence by reversing the temporal order between claiming and overruled.
tentially be leveraged to help time-series forecasting and provide guidances for natural language generation. The CaTeRs dataset (Mostafazadeh et al., 2016) which annotates temporal and causal relations is constructed for this purpose.
A major challenge in temporal relation extraction stems from its nature of being a structured prediction problem. Although a relation graph can be decomposed into individual relations on each event pair, any local model that is not informed by the whole event graph will usually fail to make globally consistent predictions, thus degrading the overall performance. Figure 1b gives an example where the local model classifies the relation between overruled and claiming incorrectly as it only considers pairwise predictions: graph temporal transitivity constraint is violated given the relation between filed and claiming is SIMULTANE-OUS. In Figure 1c, the structured model changes the prediction of relation between overruled and claiming from AFTER to BEFORE to ensure compatibility of all predicted edge types.
Prior works on event temporal relation extraction mostly formulate it as a pairwise classification problem (Bethard, 2013;Laokulrat et al., 2013;Chambers, 2013; disregarding the global structures. Bramsen et al. (2006a); Chambers and Jurafsky (2008);Do et al. (2012); Ning et al. (2018b,a) explore leveraging global inference to ensure consistency for all pairwise predictions. There are a few prior works that directly model global structure in the training process (Yoshikawa et al., 2009;Ning et al., 2017;Leeuwenberg and Moens, 2017). However, these structured models rely on hand-crafted features using linguistic rules and local-context, which cannot adequately capture potential longterm dependencies between events. In the example shown in Figure 1, filed occurs in much earlier context than overruled. Thus, incorporating longterm contextual information can be critical for correctly predicting temporal relations.
In this paper, we propose a novel deep structured learning model to address the shortcomings of the previous methods. Specifically, we adapt the structured support vector machine (SSVM) (Finley and Joachims, 2008) to incorporate linguistic constraints and domain knowledge for making joint predictions on events temporal relations. Furthermore, we augment this framework with recurrent neural networks (RNNs) to learn long-term contexts. Despite the recent success of employing neural network models for event temporal relation extraction (Tourille et al., 2017a;Cheng and Miyao, 2017;Meng et al., 2017;Meng and Rumshisky, 2018), these systems make pairwise predictions, and do not take advantage of problem structures.
We develop a joint end-to-end training scheme that enables the feedback from global structure to directly guide neural networks to learn representations, and hence allows our deep structured model to combine the benefits of both data-driven learning and knowledge exploitation. In the ablation study, we further demonstrate the importance of each global constraints, the influence of linguistic features, as well as the usage of contextualized word representations in the local model.
To summarize, our main contributions are: • We propose a deep SSVM model for event temporal relation extraction.
• We show strong empirical results and establish new state-of-the-art for three event relation benchmark datasets.
• Extensive ablation studies and thorough error analysis are conducted to understand the capacity and limitations of the proposed model, which provide insights for future research on temporal relation extraction.

Related Work
Temporal Relation Data. Temporal relation corpora such as TimeBank (Pustejovsky et al., 2003) and RED (O'Gorman et al., 2016) facilitate the research in temporal relation extraction. The common issue in these corpora is missing annotation. Collecting densely annotated temporal relation corpora with all event pairs fully annotated has been reported to be a challenging task as annotators could easily overlook some pairs Bethard et al., 2007;. TB-Dense dataset mitigates this issue by forcing annotators to examine all pairs of events within the same or neighboring sentences. Recent data construction efforts such as MATRES (Ning et al., 2018a) and TCR (Ning et al., 2018b) further enhance the data quality by using a multiaxis annotation scheme and adopting start-point of events to improve inter-annotator agreements. However, densely annotated datasets are relatively small both in terms of number of documents and event pairs, which restricts the complexity of machine learning models used in previous research.
Later efforts, such as ClearTK (Bethard, 2013), UTTime (Laokulrat et al., 2013), and NavyTime (Chambers, 2013) improve earlier work by feature engineering with linguistic and syntactic rules. A noteworthy work, CAEVO , builds a pipeline with ordered sieves. Each sieve is either a rule-based classifier or a machine learning model; sieves are sorted by precision, i.e. decisions from a lower precision classifier cannot contradict those from a higher precision model. More recently, neural network-based methods have been employed for event temporal relation extraction (Tourille et al., 2017a;Cheng and Miyao, 2017;Meng et al., 2017;Han et al., 2019a) which achieved impressive results. However, they all treat the task as a pairwise classification problem. Meng and Rumshisky (2018) considered incorporating global context for pairwise relation predictions, but they do not explicitly model the output graph structure for event temporal relation.
There are a few prior works exploring structured learning for temporal relation extraction (Yoshikawa et al., 2009;Ning et al., 2017;Leeuwenberg and Moens, 2017). However, their local models use hand-engineered linguistic features. Despite the effectiveness of hand-crafted features in previous research, the design of features usually fails to capture long-term context in the discourse. Therefore, we propose to enhance the hand-crafted features with contextual representations learned through RNN models and develop an integrated joint training process.

Methods
We adapt the notations from Ning et al. (2018a), where R denotes the set of all possible relations; E denotes the set of all event entities.

Deep SSVM
Our deep SSVM model adapts the SSVM loss as L = l n=1 1 M n max 0, ∆(y n ,ŷ n ) + (1) where Φ denotes model parameters, n indexes instances, M n is the number of event pairs in instance n. y n ,ŷ n denote the gold and predicted global assignments for instance n, each of  Figure 2: An overview of the proposed deep structured event relation extraction framework. The input representations consist of BERT representations (v w,k ) and POS tag embeddings (v p,k ). They are concatenated to pass through BiLSTM layers and classification layers to get pairwise local scores. Incompatible local pairwise prediction (denoted by red lines) is corrected by the SSVM layer. Edge notation follows Figure 1 and t 1 , ...t N denote tokens in the input sentence.
which consists of M n one hot vectors y n i,j ,ŷ n i,j ∈ {0, 1} |R| representing true and predicted relation labels for event pair i, j respectively. ∆(y n ,ŷ n ) is a distance measurement between the gold and the predicted assignments; we simply use hamming distance. C is a hyper-parameter to balance the loss and the regularizer, and S(y n ; x n ) is a pairwise scoring function to be learned.
The intuition behind the SSVM loss is that it requires the score of gold output structure y n to be greater than the score of the best output structure under the current modelŷ n with a margin ∆(y n ,ŷ n ) 1 , or else there will be some loss.
The major difference between our deep SSVM and the traditional SSVM model is the scoring function. Traditional SSVM uses a linear function over hand-crafted features to compute the scores, whereas we propose to use a RNN for estimation.

RNN-Based Scoring Function
We introduce a RNN-based pair-wise scoring function to learn features in a data-driven way and capture long-term context in the input. The local neural architecture is inspired by prior work in entity relation extraction such as Tourille et al. (2017b). As shown in Figure 2, the input layer consists of word representations and part-of-speech (POS) tag embeddings of each token in the input sentence, denoted as v w,k and v p,k respectively. 2 The word representations are obtained via pre-trained BERT (Devlin et al., 2018) 3 model and are fixed throughout training, while the POS tag embeddings are tuned. The word and POS tag embeddings are concatenated to represent an input token, and then fed into a Bi-LSTM layer to get contextualized representations.
We assume the events are labeled in the text and use indices i and j to denote the tokens associated with an event pair (i, j) ∈ EE in the input sentences of length N. For each event pair (i, j), we take the forward and backward hidden vectors corresponding to each event, namely f i , b i , f j , b j to encode the event tokens. These hidden vectors are then concatenated to form the input to the final linear layer to produce a softmax distribution over all possible pair-wise relations, which we refer to as RNN-based scoring function.

Inference
The inference is needed both during training to obtainŷ n in the loss function (Equation 1), as well as during the test time to get globally compatible assignments. The inference framework is established by constructing a global objective function using scores from local model and imposing several global constraints: symmetry and transitivity as in Bramsen et al. (2006b) Han et al. (2019b), as well as linguistic rules and temporal-causal constraints proposed by Ning et al. (2018a) to ensure global consistency. In this work, we incorporate the symmetry, transitivity, and temporal-causal constraints.
Objective Function. The objective function of the global inference maximizes the score of global assignments as specified in Equation 2 4 .
where y r i,j is a binary indicator specifying if the global prediction is equal to a certain label r ∈ R and S(y r i,j , x), ∀r ∈ R is the scoring function obtained from the local model. The output of the global inferenceŷ is a collection of optimal label assignments for all event pairs in a fixed context. The constraint following immediately from the objective function is that the global inference should only assign one label to each pair of sample inputs.
Symmetry and Transitivity constraint. The symmetry and transitivity constraints are used across all models and experiments in the paper. They can be specified as follows: Intuitively, the symmetry constraint forces two pairs with opposite order to have reversed relations. For example, if r i,j = BEFORE, then r j,i = AFTER. Transitivity constraint rules that if (i, j), (j, k) and (i, k) pairs exist in the graph, the label (relation) prediction of (i, k) pair has to fall into the transitivity set specifying by (i, j) and (j, k) pairs. The full transitivity table can be found in Ning et al. (2018a).
Temporal-causal Constraint. The temporalcausal constraint is used for the TCR dataset which is the only dataset in our experiments that contains causal pairs and it can written as: where c andc correspond to the label CAUSES and CAUSED BY, and b represents the label BEFORE. This constraint specifies that if event i causes event j, then i has to occur before j. Note that this constraint only has 91.9% accuracy in TCR data (Ning et al., 2018a), but it can help improve model performance based on our experiments.

Learning
We develop a two-state learning approach to optimize the neural SSVM. We first train the local scoring function without feedback from global constraints. In other words, the local neural network model is optimized using only pair-wise relations in the first stage by minimizing crossentropy loss. During the second stage, we switch to the global objective function in Equation 1 and

Experimental Setup
In this section, we describe the three datasets that are used in the paper. Then we define the evaluation metrics. Finally, we provide details regarding our model implementation and experiments.

Data
Experiments are conducted on TB-Dense, MA-TRES and TCR datasets and an overview of data statistics are shown in Table 1. We focus on event relation, thus, all numbers refer to EE pairs 6 . Note that in all three datasets, event pairs are always annotated by their appearance order in text, i.e. given a labeled event pair (i, j), event i always appears prior to event j in the text. Following Meng et al. (2017), we augment event pairs with flipped-order pairs. That is, if a pair (i, j) exists, pair (j, i) is also added to our dataset with the opposite label. The augmentation is applied to training and development split, but test set remains unaugmented 7 .
TB-Dense ) is based on TimeBank Corpus but addresses the sparseannotation issue in the original data by introducing the VAGUE label and requiring annotators to label all pairs of events/times in a given window.
MATRES (Ning et al., 2018b) is based on TB-Dense data, but filters out non-verbal events. The authors project events on multiple axes and only keep those in the main-axis. These two factors explain the large decrease of event pairs in Table 1.  Start-point temporal scheme is adopted when outsourcing the annotation task, which contributes to the performance improvement of machine learning models built on this dataset .
TCR (Ning et al., 2018a) follows the same annotation scheme for temporal pairs in MATRES. It is also annotated with causal pairs. To get causal pairs, the authors select candidates based on EventCausality dataset (Do et al., 2011).

Evaluation Metrics
To be consistent with the evaluation metrics used in baseline models, we adopt two slightly different calculations of metrics.
Micro-average For all datasets, we compute the micro-average scores. For densely annotated data, the micro-average metric should share the same precision, recall and F1 scores. However, since VAGUE pairs are excluded in the micro-average calculations of TCR and MATRES for fair comparisons with the baseline models, the microaverage for precision, recall and F1 scores are different when reporting results for the two datasets.
Temporal Awareness (TE3) For TB-Dense dataset, TE3 evaluation scheme (UzZaman et al., 2013) is also adopted in previous research (Ning et al., 2017(Ning et al., , 2018a. TE3 score not only takes into account of the number of correct pairs but also capture how "useful" a temporal graph is. We report this score for TB-Dense results only. For more details of this metric, please refer to the original paper (UzZaman et al., 2013).

Implementation Details
Since our work focuses on event relations, we build our models to predict relations between EE pairs only when conducting experiments. Thus, all micro-average F1 scores only consider EE pairs. Note that there are also time entities labeled in the TB-Dense denoted as T . ET and T T pairs are  Figure 3: Model Performance (F1 Score) Overview. Our local and global models' performances are averaged over 3 different random seeds for robustness.
generally easier to predict using rule-based classifiers or date normalization technique (Do et al., 2012;Mirza and Tonelli, 2016). To be consistent with the baseline models (Ning et al., 2018a,b) for TB-Dense data, we add ET and T T pairs for TE3 evaluation metric 8 . In the two-stage learning procedure, the local model is trained by minimizing cross-entropy loss with Adam optimizer. We use pre-trained BERT embedding with 768 dimensions as the input word representations and one-layer MLP as the classification layer. As for the structured learning stage, we observe performance boost by switching from Adam optimizer to SGD optimizer with decay and momentum 9 . To solve ILP in the inference process specified in Section 3.3, we leverage off-the-shelf solver provided by Gurobi optimizer, i.e. the best solutions from the Gurobi optimizer are inputs to the global training.
The hyper-parameters are chosen by the performance on the development set 10 , and the best combination of hyper-parameters can be found in Table 2. We run experiments on 3 different random seeds and report the average results.
Note that for TCR data, we need a separate classifier for causal relations. Because of small amount of causal pairs, we simply build an independent final linear layer apart from the original linear layer in Figure 2. In other words, there are two final linear layers: only one of them is active when training temporal or causal pairs. Figure 3 shows an overview of our model performance on three different datasets. As the chart 8 We rely on annotated data to distinguish different pair types, i.e. EE, ET and T T are assumed to be given. 9 The weight decay in SGD is exactly the value C in Equation 1. We set the momentum in SGD as 0.9 in all datasets. 10 We randomly select 4 documents from the training set as development set for TCR.     illustrates, our RNN-based local models outperform state-of-the-art (SOTA) results and the global models further improve the performance over local models across all three datasets.

TCR
Detailed model performances for the TCR dataset are shown in Table 3. We only report model performance on temporal pairs. Both of our local and global models outperform the baseline. Our global model is able to improve overall model performance by more than 1.2% over our local model; per McNemar's test, this improvement is statistically significant (with p-value < 0.01).

MATRES
Detailed model performances for the MATRES dataset performances can be found in Table 4. Similar to TCR, both our local and structured models outperform this baseline and the global model is able to improve overall model performance by 1.4%; per McNemar's test, this improvement is statistically significant (with p-value < 0.05).   (2018)) or in TE3 metric (results from Ning et al. (2018a)).

TB-Dense
Per McNemar's test, the improvements from local to global model only has p-value = 0.126, so we are not able to conclude that the improvement is statistically significant. We think one of the reasons is the large share of VAGUE pairs (42.6%). VAGUE pairs make our transitivity rules less conclusive. For example, if R(e1, e2) = BEFORE and R(e2, e3) = VAGUE, R(e1, e3) can be any relation types. Moreover, this impact is magnified by our local model's prediction bias towards VAGUE pairs. As we can see in Table 5, the recall score for VAGUE pairs are much higher than other relation types, whereas precision score is moderate. Our global model leverages local output structure to enforce global prediction consistency, but when local predictions contain many VAGUE pairs, it introduces lots of noise too.
To make fair comparison between our model and the best reported TE3 F1 score from Ning et al. (2018a), we follow their strategy and add CAEVO system's predictions on T T and ET pairs in the evaluation. The scores are shown in Table 5. Our overall system outperforms the baseline over 10% for both micro-average and TE3 F1 scores.

Error Analysis
To understand why both the local and structured models make mistakes, we randomly sample 50 pairs from 345 cases where both models' predictions are incorrect among all 3 random seeds. We analyze these pairs qualitatively and categorize them into four cases as shown in Table 6, with each case (except other) paired with an example.
The first case illustrates that correct prediction requires broader contextual knowledge. For example, the gold label for transition and discuss is BEFORE, where the nominal event transition refers to a specific era in history that ends before discuss in the second sentence. Human annotators can easily infer this relation based on their knowledge in history, but it is difficult for machines without prior knowledge. We observe this as a very common mistake especially for pairs with nominal events. As for the second case shows that negation can completely change the temporal order. The gold label for the event pair planned and plans is AFTER because the negation token no postpones the event planned indefinitely. Our models do not pick up this signal and hence predict the relation as VAGUE.
Finally, "intention" events could make temporal relation prediction difficult (Ning et al., 2018b). Case 3 demonstrates that our models could ignore the "intention" tokens such as aimed at in the example and hence make an incorrect prediction VAGUE between doubling and signed, whereas the true label is AFTER because doubling is an intention that has not occurred.

Ablation Studies
Although we have presented strong empirical results, the isolated contribution of each component of our model has not been investigated. In this section, we perform a though ablation study to understand the importance of structured constraints, linguistic features, and the BERT representations.

Effect of the structured constraints
One of our core claims is that our learning benefits from modeling the structural constraints of event temporal graph. To study the contribution of structured constraints, we provide an ablation study on two constraints that are applied to all three datasets: Symmetry and Transitivity.
A straightforward ablation study on symmetric constraint is to remove it from our global inference Case 1 (32%): Connection with broader context The program also calls for coordination of economic reforms and joint improvement of social programs in the two countries, where many people have become impoverished during the chaotic post -Soviet transition to capitalism. Kuchma also planned to visit Russian gas giant Gazprom, most likely to discuss Ukraine's DLRS 1.2 billion debt to the company. Case 2 (20%): Negation Annan has no trip planned so far. Meanwhile, Secretary of State Madeleine Albright, Berger and Defense Secretary William Cohen announced plans to travel to an unnamed city in the us heartland next week, to explain to the American people just why military force will be necessary if diplomacy fails. Case 3 (14%): Intention Axis A major goal of Kuchma's four -day state visit was the signing of a 10-year economic program aimed at doubling the two nations' trade turnover, which fell to DLRS 14 billion last year, down DLRS 2.5 billion from 1996. The two presidents on Friday signed the plan, which calls for cooperation in the metallurgy, fuel, energy, aircraft building, missile, space and chemical industries. Case 4: (34%) Other Table 6: Error Categories and Examples in TB-Dense step. However, even though we eliminate symmetric constraint explicitly in global inference, it is utilized implicitly in our data augmentation steps (Section 4.1). To better understand the benefits of the symmetry constraints, we study both the contribution of explicitly applying symmetry constraint in our SSVM as well as its implicit impact in data augmentation.
Hence, in this section, we view a pair with original order and flipped order as different instances for learning and evaluation. We denote the pairs with original order as "forward" data, their flipped-order counterparts as "backward" data, and their combinations as "both-way" data.
We train four additional models to study the impacts of symmetry and transitivity constraints: 1) local model trained on forward data; 2) global model with transitivity constraint trained on forward data; 3) local model trained on both-way data; 4) global model with transitivity constraint trained on both-way data, denoted as M 1 , M 2 , M 3 , M 4 respectively. M 1 and M 2 are models that do not apply any symmetric property; M 3 and M 4 are models that utilize symmetric property implicitly.
Additionally, evaluation setup should be rescrutinized if we remove the symmetry constraints. In the standard evaluation setup of prior works, evaluation is only performed on the pairs with their original order (forward data) in text. This evaluation assumes a model will work equally well for both forward and backward  data, which certainly holds when we explicitly impose symmetry constraints. However, as we can observe in the later analysis, this assumption fails when we remove symmetry constraints. To demonstrate the improvement of model robustness over backward data, we propose to test the model on both forward and both-way data. If a model is robust, it should perform well on both scenarios.
We summarize our analysis of the results in Table 7 (F1 scores) as follows: • Impact of Transitivity: By comparing M 1 with M 2 and M 3 with M 4 , the consistent improvements across all three datasets demonstrate the effectiveness of global transitivity constraints.
• Impact of Implicit Symmetry (data augmentation): Examining the contrast between M 1 and M 3 as well as M 2 and M 4 , we can see significant improvements in both-way evaluation despite slight performance drops in forward evaluation. These comparisons imply that data augmentation can help improve model robustness. Note that Meng and Rumshisky (2018) leveraged this data augmentation trick in their model.  significantly in the both-way evaluation. In contrast, the proposed model achieves strong performances in both test scenarios (best F1 scores except for one), and hence proves the robustness of our proposed method.

Effect of linguistic features
Previous research establish the success of leveraging linguistic features in event relation prediction. One advantage of leveraging contextualized word embedding is to provide rich semantic representation and could potentially avoid the usage of extra linguistic features. Here, we study the impact of incorporating linguistic features to our model by using simple features provided in the original datasets: token distance, tense and polarity of event entities. These features are concatenated with the Bi-LSTM hidden states before the linear layer (i.e. f i , b i , f j , b j in Figure 2). Table 8 shows the F1 scores of our local and global model using or not using linguistic features respectively. These additional features likely cause over-fitting and hence do not improve model performance across all three datasets we test. This set of experiments show that linguistic features do not improve the predicting power of our current framework.

Effect of BERT representations
In this section, we explore the impact of contextualized BERT representations under our deep SSVM framework. We replace BERT representations with the GloVe (Pennington et al., 2014) word embeddings.  Table 9: Ablation over word representation: BERT vs GloVe. Although BERT representation largely contributes to the performance boost, our proposed framework remains strong and outperforms current SOTA approaches when GloVe is used.
still outperform (MATRES and TCR) or are comparable with (TB-Dense) current SOTA. These results confirm the improvements of our method.

Conclusion
In this paper, we propose a novel deep structured model based on SSVM that combines the benefits of structured models' ability to encode structure knowledge, and data-driven deep neural architectures' ability to learn long-range features. Our experimental results exhibit the effectiveness of this approach for event temporal relation extraction. One interesting future direction is further leveraging commonsense knowledge, domain knowledge in temporal relation, and linguistics information to create more robust and comprehensive global constraints for structured learning. Another direction is to improve feature representations by designing novel neural architectures that better capture negation and hypothetical phrases as discussed in error analysis. We plan to leverage large amount of unannotated corpora to help event temporal relation extraction as well.