Neural Ranking Models for Temporal Dependency Structure Parsing

We design and build the first neural temporal dependency parser. It utilizes a neural ranking model with minimal feature engineering, and parses time expressions and events in a text into a temporal dependency tree structure. We evaluate our parser on two domains: news reports and narrative stories. In a parsing-only evaluation setup where gold time expressions and events are provided, our parser reaches 0.81 and 0.70 f-score on unlabeled and labeled parsing respectively, a result that is very competitive against alternative approaches. In an end-to-end evaluation setup where time expressions and events are automatically recognized, our parser beats two strong baselines on both data domains. Our experimental results and discussions shed light on the nature of temporal dependency structures in different domains and provide insights that we believe will be valuable to future research in this area.


Introduction
Temporal relation classification is important for a range of NLP applications that include but are not limited to story timeline construction, question answering, summarization, etc. Most work on temporal information extraction models the task as a pair-wise classification problem (Bethard et al., 2007;Chambers et al., 2007;Chambers and Jurafsky, 2008;Ning et al., 2018a): given an individual pair of time expressions and/or events, the system predicts whether they are temporally related and which specific relation holds between them. An alternative approach is to model the temporal relations in a text as a temporal dependency structure (TDS) for the entire text . Such a temporal dependency structure has the advantage that (1) it can be easily used to infer additional temporal relations between time expressions and/or events that are not directly connected via the transitivity properties of temporal relations, (2) it is computationally more efficient because a model does not need to consider all pairs of time expressions and events in a text, and (3) it is easier to use for downstream applications such as timeline construction.
However, most existing automatic systems are pair-wise models trained with traditional statistical classifiers using a large number of manually crafted features . The few exceptions include the work of , which describes a temporal dependency parser based on traditional feature-based classifiers, and Dligach et al. (2017), which describes a system using neural network based models to classify individual temporal relations. More recently, a semi-structured approach has also been proposed (Ning et al., 2018b).
In this work, taking advantage of a newly available data set annotated with temporal dependency structures -the Temporal Dependency Tree (TDT) Corpus 1 (Zhang and Xue, 2018), we develop a neural temporal dependency structure parser using minimal hand-crafted linguistic features. One of the advantages of neural network based models is that they are easily adaptable to new domains, and we demonstrate this advantage by evaluating our temporal dependency parser on data from two domains: news reports and narrative stories. Our results show that our model beats a strong logistic regression baseline. Direct comparison with existing models is impossible because the only similar dataset used in previous work that we are aware of is not available to us , but we show that our models are competitive against similar systems reported in the literature.
The main contributions of this work are: • We design and build the first end-to-end neu-ral temporal dependency parser. The parser is based on a novel neural ranking model that takes a raw text as input, extracts events and time expressions, and arranges them in a temporal dependency structure. • We evaluate the parser by performing experiments on data from two domains: news reports and narrative stories, and show that our parser is competitive against similar parsers. We also show the two domains have very different temporal structural patterns, an observation that we believe will be very valuable to future temporal parser development.
The rest of the paper is organized as follows. Since temporal structure parsing is a relatively new task, we give a brief problem description in §2. We describe our end-to-end pipeline system in §3. Details of the neural ranking model are discussed in §4. In §5 we present and discuss our experimental results, and error analysis are presented in §6. In §7 we discuss related work and situate our work in the broader context, and we conclude our paper in §8.

Problem Description
In this section we give a brief description of the temporal dependency parsing task (more details in Zhang and Xue (2018)). In a temporal structure parsing task, a text is parsed into a dependency tree structure that represents the inherent temporal relations among time expressions and events in the text. The nodes in this tree are mostly time expressions and events which are represented as contiguous spans of words in the text. They can also be pre-defined meta nodes, which serve as reference times for other time expressions and events, and they constitute the top-most part of the tree. Unlike syntactic dependency parsing where each word in a sentence is a node in the dependency structure, in a temporal dependency structure only some of the words in a text are nodes in the structure. Therefore, this process naturally falls into two stages: first time expression and event recognition, and then temporal relation parsing. Fig-ure 1 is an example temporal dependency tree for a news report paragraph. Due to the fact that different types of time expressions and events behave differently in terms of what can be their antecedents, and recognition of these types can be helpful for determining temporal relations, finer classifications of time expressions and events are also defined. Time expressions are further classified into Vague Time, Absolute Concrete Time, and Relative Concrete Time, according to whether or not the time expression can be located on the timeline, and whether or not the interpretation of its temporal location depends on another time expression. Events are further classified into Eventive Event, State, Habitual Event, Completed Event, Ongoing Event, Modalized Event, Generic Habitual, and Generic State, according to the eventuality type of the event. Our experiments will show that these fine-grained classifications are very helpful for the overall temporal structure parsing accuracy.

A Pipeline System
We build a two-stage pipeline system to tackle this temporal structure parsing problem. The first stage performs event and time expression identification. In this stage, given a text as input, spans of words that indicate events or time expressions are identified and categorized. We model this stage as a sequence labeling process. A standard Bi-LSTM sequence model coupled with BIO labels is applied here. Word representations are the concatenation of word and POS tag embeddings.
The second stage performs the actual temporal structure parsing by identifying the antecedent for each time expression and event, and identifying the temporal relation between them. In this stage, given events and time expressions identified in the first stage as input, the model outputs a temporal dependency tree in which each child node is an event or time expression that is temporally related to another event or time expression or pre-defined meta node as its parent node. This stage is modeled as a ranking process: for each node, a finite set of neighboring nodes are first selected as its candidate parents. These candidates are then ranked with a neural network model and the highest ranking candidate is selected as its parent. We use a ranking model because it is simple, more intuitive and easier to train than a traditional transition-based or graph-based model, and  a From a news report in The Telegraph. Figure 1: Example text and its temporal dependency tree. DCT is Document Creation Time.

Present_Ref
the learned model rarely makes mistakes that violate the structural constraint of a tree. Since the model we use for Stage 1 is a very standard model with little modifications, we don't describe it in detail in this paper due to the limitation of space. Our neural ranking model for Stage 2 is described in detail in the next section.

Model Description
We use a neural ranking model for the parsing stage. For each time expression or event node i in a text, a group of candidate parent nodes (time expressions, events, or pre-defined meta nodes) are selected. In practice, we select a window from the beginning of the text to two sentences after node i, and select all nodes in this window and all pre-defined meta nodes as the candidate parents if node i is an event. Since the parent of a time expression can only be a pre-defined meta node or another time expression as described in Zhang and Xue (2018), we select all time expressions in the same window and all pre-defined meta nodes as the candidate parents if node i is a time expression. Let y i be a candidate parent of node i, a score is then computed for each pair of (i, y i ).Through ranking, the candidate with the highest score is then selected as the final parent for node i.
Model architecture is shown in Figure 2. Word embeddings are used as word representations (e.g. w k ). A Bi-LSTM sequence layer is built on each word over the entire text, computing Bi-LSTM output vectors for each word (e.g. w * k ). The node representation for each time expression or event is the summation of the Bi-LSTM output vectors of all words in the text span (e.g. x i ). The pair representation for node i and one of its candidates y i is the concatenation of the Bi-LSTM output vectors of these two nodes g i,y i = [x i , x y i ], which is then sent through a Multi-Layer Perceptron to compute a score for this pair s i,y i . Finally all pair scores of the current node i are concatenated into vector c i , and taking sof tmax on it generates the final distribution o i , which is the probability distribution of each candidate being the parent of node i.
Formally, the Forward Computation is:

Learning
Let D be the training data set of K texts, N k the number of nodes in text D k , and y i the gold parent for node i. Our neural model is trained to maximize P (y 1 , ..., y N k |D k ) over the whole training set. More specifically, the cost function is defined as follows: For each training example, cross-entropy loss is minimized: where s i,y i is the score for child-candidate pair (i, y i ) as described in §4.1.

Decoding
During decoding, the parser constructs the temporal dependency tree incrementally by identifying the parent node for each event or time expression in textual order. To ensure the output parse is a valid dependency tree, two constraints are applied in the decoding process: (i) there can only be one parent for each node, and (ii) descendants of a node cannot be its parent to avoid cycles. Candidates violating these constraints are omitted from the ranking process. 2

Temporal Relation Labeling
The neural model described above generates an unlabeled temporal dependency tree, with each parent being the most salient reference time for the child. However it doesn't model the specific temporal relation (e.g. "before", "overlap") between a parent and a child. We extend this basic architecture to both identify parent-child pairs and predict their temporal relations. In this new model, instead of ranking child-candidate pairs (i, y i ), we rank child-candidate-relation tuples (i, y i , l k ), where l k is the kth relation in the pre-defined set of possible temporal relation labels L. We compute this ranking by re-defining the pair score s i,y i . Here, pair score s i,y i is no longer a scalar score but a vector s i,y i of size |L|, where s i,y i [k] is the scalar score for y i being the parent of i with temporal relation l k . Accordingly, the lengths of c i and o i are number of candidates * |L|. Finally, the tuple (i, y i , l k ) associated with the highest score in o i predicts that y i is the parent for i with temporal relation label l k .

Linguistically Enriched Models
A variation of the basic neural model is a model that takes a few linguistic features as input ex-2 An alternative decoding approach would be to perform a global search for a Maximum Spanning Tree. However, due to the nature of temporal structures, our greedy decoding process rarely hits the constraints. plicitly. In this model, we extend the pair representation g i,y i with local features: Time and event type feature: Stage 1 of the pipeline not only extracts text spans that are time expressions or events, but also labels them with pre-defined categories of different types of time expressions and events. Readers are referred to Zhang and Xue (2018) for the full category list. Through a careful examination of the data, we notice that time expressions or events are selective as to what types of time expression or events can be their parent. In other words, the category of the child time expression or event has a strong indication on which candidate can be its parent. For example, a time expression's parent can only be another time expression or a pre-defined meta node, and can never be an event; and an eventive event's parent is almost certainly another eventive event, and is highly unlikely to be a stative event. Therefore, we include the time expression and event type information predicted by stage 1 in this model as a feature. More formally, we represent a time/event type as a fixed-length embedding t, and concatenate it to the pair representation Distance features: Distance information can be useful for predicting the parent of a child. Intuitively, candidates that are closer to the child are more likely to be the actual parent. Through data examination, we also find that a high percentage of nodes have parents in close proximity. Therefore, we include two distance features in this model: the node distance between a candidate and the child nd i,y i , and whether they are in the same sentence ss i,y i . One-hot representations are used for both features to represent according conditions listed in Table 1. conditions for feature nd i,y i : i.node id − y i .node id = 1 i.node id − y i .node id > 1 and i.sent id = y i .sent id i.node id − y i .node id > 1 and i.sent id = y i .sent id i.node id − y i .node id < 1 conditions for feature ss i,y i : i.sent id = y i .sent id i.sent id = y i .sent id The final pair representation for our linguistically enriched model is as follows:

Attention Model on Time and Event Representation
In the basic neural model, a straight-forward sumpooling is used as the multi-word time expression and event representation. However, multi-word event expressions usually have meaning-bearing head words. For example, in the event "took a trip", "trip" is more representative than "took" and "a". Therefore, we add an attention mechanism (Bahdanau et al., 2014) over the Bi-LSTM output vectors in each multi-word expression to learn a task-specific notion of headedness (Lee et al., 2017): wherex i is a weighted sum of Bi-LSTM output vectors in span i. The weights w i,t are automatically learned. The final pair representation for our attention model is as follows: This model variation is also beneficial in an end-to-end system, where time expression and event spans are automatically extracted in Stage 1. When extracted spans are not guaranteed correct time expressions and events, an attention layer on a slightly larger context of an extracted span has a better chance of finding representative head words than a sum-pooling layer strictly on words within a event or time expression span.

Data
All of our experiments are conducted on the datasets described in Zhang and Xue (2018). This is a temporal dependency structure corpus in Chinese. It covers two domains: news reports and narrative fairy tales. It consists of 115 news articles sampled from Chinese TempEval2 datasets (Verhagen et al., 2010) and Chinese Wikipedia News 3 , and 120 fairy tale stories sampled from Grimm Fairy Tales 4 . 20% of this corpus, distributed evenly on both domains, are double annotated with high inter-annotator agreements. We use this part of the data as our development and test datasets (10% documents for development and 10% for testing), and the remaining 80% as our training dataset.

Baseline Systems
We build two baseline systems to compare with our neural model. The first is a simple baseline which links every time expression or event to its immediate previous time expression or event. According to our data, if only position information is considered, the most likely parent for a child is its immediate previous time expression or event. This baseline uses the most common temporal relation edge label in the training datasets, i.e. "overlap" for news data, and "before" for grimm data.
The second baseline is a more competitive baseline for stage 2 in the pipeline. It takes the output of the first stage as input, and uses a similar ranking architecture but with logistic regression classifiers instead of neural classifiers. The purpose of this baseline is to compare our neural models against a traditional statistical model under otherwise similar settings. We conduct robust feature engineering on this logistic regression model to make sure it is a strong benchmark to compete against. Table 2 lists the features and feature combinations used in this model. time type and event type features: i.type and y i .type if i.type = absolute time and y i .type = root if i.type = time and y i .type = root are i.type and y i .type time, eventive, or stative are i.type and y i .type root, time, or event are i.type and y i .type root, time, eventive, or stative if i.type = y i .type = event andŷ.type = state, for allŷ between i and y i distance features: if i.sent id = y i .sent id i.node id − y i .node id if i.node id − y i .node id = 1 combination features: if i.type = state and i.sent id = y i .sent id if i.type = state and i.node id − y i .node id = 1 if i.type = y i .type = event and i.node id − y i .node id = 1 if i.type = state and y i .type = event and i.node id − y i .node id = 1 and i.node id in sent = 1 and i.sent id = 1 other features: if i and y i are in quotation marks

Evaluation
We perform two types of evaluations for our systems. First, we evaluate the stages of the pipeline and the entire pipeline, i.e. end-to-end systems where both time expression and event recognition, as well as temporal dependency structures are automatically predicted. Our models are compared against the two strong baselines described in §5.2. These evaluations are described in §5.3.1.
The second evaluation focuses only on the temporal relation structure parsing part of our pipeline (i.e. Stage 2), using gold standard time expression and event spans and labels. Since most previous work on temporal relation identification use gold standard time expression and event spans, this evaluation gives us some sense of how our models perform against models reported in previous work even though a strict comparison is impossible because different data sets are used. These evaluations are described in §5.3.2.
All neural networks in this work are implemented in Python with the DyNet library (Neubig et al., 2017   32, 256, and 256 respectively. POS tags in Stage 1 are acquired using the joint POS tagger from Wang and Xue (2014). The tagger is trained on Chinese Treebank 7.0 (Xue et al., 2010). For Stage 2, the dimensions of word embeddings, time/event type embeddings, Bi-LSTM output vectors, and MLP hidden layers are tuned on the dev set to 32, 16, 32, and 32 respectively. The optimizer is Adam with early stopping and learning rate 0.001.

End-to-End System Evaluation
Stage 1: Time and Event Recognition For Stage 1 in the pipeline, we perform BIO tagging with the full set of time expression and event types (i.e. a 11-way classification on all extracted spans). Extracted spans will be nodes in the final dependency tree, and time/event types will support features in the next stage. We evaluate Stage 1 performance using 10-fold cross-validation of the entire data set. We use the "exact match" evaluation metrics for BIO sequence labeling tasks, and compute precision, recall, and f-score for each label type. We first ignore fine-grained time/event types and only evaluate unlabeled span detection and time/event binary classification to show how well our system identify events and time expressions, and how well our system distinguishes time expressions from events.  formance on both news and narrative domains. Time expressions have a higher recognition rate than events in news data, which is consistent with the observation that time expressions usually have a more limited vocabulary and more strict lexical patterns. On the other hand, due to the scarcity of time expressions in the Grimm data, time expression recognition in this domain has a very high precision but low recall, which results in a much lower f-score than news. Labeled full set evaluation results on time/event type classification are reported in Table 4. Time expressions have higher recognition rates than events on both domains, and dominant event types ("event", "state", etc.) have higher and more stable recognition rates than other types. Event types with very few training instances, such as "modalized event" (<7%), achieve lower and more unstable recognition rates. Other types with less than 2% instances achieve close to 0 recognition f-scores, and are not reported in this table.
Stage 2: Temporal Dependency Parsing For Stage 2 in the pipeline, we conduct experiments on the five systems described above: a simple baseline, a logistic regression baseline, a basic neural model, a linguistically enriched neural model, and an attention neural model. All models are trained on automatically predicted spans of time expressions and events, and time/event types generated by Stage 1 using 10-fold cross-validation, with gold standard edges (and edge labels) mapped onto the automatic spans. Evaluations in Stage 2 are against gold standard spans and edges, and evaluation metrics are precision, recall, and f-score on child, parent tuples for unlabeled trees, and child, relation, parent triples for labeled trees.
Bottom rows in Table 5 report the end-to-end performance of our five systems on both domains. On both labeled and unlabeled parsing, our basic neural model with only lexical input performs comparable to the logistic regression model. And our enriched neural model with only three simple linguistic features outperforms both the logistic regression model and the basic neural model on news, improving the performance by more than 10%. However, our models only slightly improve the unlabeled parsing over the simple baseline on narrative Grimm data. This is probably due to (1) it is a very strong baseline to link every node to its immediate previous node, since in an narrative discourse linear temporal sequences are very common; and (2) most events breaking the temporal linearity in a narrative discourse are implicit stative descriptions which are harder to model with only lexical and distance features. Finally, attention mechanism improves temporal relation labeling on both domains.

Temporal Relation Evaluation
To facilitate comparison with previous work where gold events are used as parser input, we report our results on temporal dependency parsing with gold time expression and event spans in Table 5 (top rows). These results are in the same ballpark as what is reported in previous work on temporal relation extraction. The best performance in  are 0.84 and 0.65 fscores for unlabeled and labeled parses, achieved by temporal structure parsers trained and evaluated on narrative children's stories. Our best performing model (Neural-attention) reports 0.81 and 0.70 f-scores on unlabeled and labeled parses respectively, showing similar performance. It is important to note, however, that these two works use different data sets, and are not directly comparable. Finally, parsing accuracy with gold time/event spans as input is substantially higher than that with predicted spans, showing the effects of error propagation.

Error Analysis
We perform error analysis on the output of our best model (Neural-attention) on the development data sets. We focus on analyzing our neural ranking model (i.e. Stage 2), with gold time expression and event spans and labels as input.
First, we look at errors by the types of antecedents. Most events in both news and grimm data depend on their immediate previous event or time expression as their reference time parent. 71% of the events in the news data set and 78% of the events in the Grimm data have the immediate previous node as their antecedent. The confusion matrix in Table 6 illustrates how strongly this bias affects our models. Our model learns the bias and incorrectly links around half of the events (47% in news and 46% in grimm) to their immediate previous node when the correct temporal dependency is further back in the text.  Table 6: Parent node confusion matrix. Rows are gold parents and columns are automatically parsed parents. "pre" means the parent is the immediate previous node of the child event, "far" means the parent is further back from the child event.
Second, we look at errors in temporal relation labels. Considering only correctly recognized parent-child pairs, we draw a confusion matrix as in Table 7. Our data has very few after relations in both domains, which explains why the model has difficulty identifying this relation. There are also very few include and depend-on relations in the Grimm data, however they are identified with a  relatively high accuracy. This is probably because, according to the temporal dependency structure design (Zhang and Xue, 2018), these relations hold only between restricted pairs of parent and child: include requires a time expression parent and an event child, and depend-on requires that the parent be the rootf. The main confusion among temporal relations is between before and overlap.
In news data, with a high occurrence of overlap relations (60% overlap and 5% before), most before parents are wrongly recognized as overlap.
Grimm data has a more balanced distribution of these two temporal relations (46% overlap and 50% before), however, 13% before and 17% overlap are wrongly labeled as the other.
7 Related Work

Related Work on Temporal Relation Modeling
There is a significant amount of research on temporal relation extraction (Bethard et al., 2007;Bethard, 2013;Chambers and Jurafsky, 2008;Chambers et al., 2014;Ning et al., 2018a). Most of the previous work models temporal relation extraction as pair-wise classification between individual pairs of events and/or time expressions. Some of the models also add a global reasoning step to local pair-wise classification, typically using Integer Linear Programming, to exploit the transitivity property of temporal relations (Chambers and Jurafsky, 2008). Such a pair-wise clas-sification approach is often dictated by the way the data is annotated. In most of the widely used temporal data sets, temporal relations between individual pairs of events and/or time expressions are annotated independently of one another (Pustejovsky et al., 2003;Chambers et al., 2014;Styler IV et al., 2014;O'Gorman et al., 2016;Mostafazadeh et al., 2016). Our work is most closely related to that of , which also treats temporal relation modeling as temporal dependency structure parsing. However, their dependency structure, as described in , is only over events, excluding time expressions which are an important source of temporal information, and it also excludes states (stative events), which makes the temporal dependency structure incomplete. Moreover, their corpus only consists of data in the narrative stories domain. We instead choose to develop our model based on the data set described in Zhang and Xue (2018), which introduces a more comprehensive and linguistically grounded annotation scheme for temporal dependency structures. This structure includes both events and time expressions, and uses the linguistic notion of temporal anaphora to guide the annotation of the temporal dependency structure. Since in this temporal dependency structure each parentchild pair is considered to be an instance of temporal anaphora, the parent is also called the antecedent and the child is also referred to as the anaphor. Their corpus consists of data from two domains: news reports and narrative stories.
More recently, Ning et al. (2018b) proposed a semi-structured approach to model temporal relations in a text. Based on the observation that not all pairs of events have well-defined temporal relations, they propose a multi-axis representation in which well-defined temporal relations only hold between events on the same axis. The temporal relations between events in a text form multiple disconnected subgraphs. Like other work before them, their annotation scheme only covers events, to the exclusion of time expressions.

Related Work on Neural Dependency Parsing
Most prior work on neural dependency parsing is aimed at syntactic dependency parsing, i.e. parsing a sentence into a dependency tree that represents the syntactic relations among the words.
Recent work on dependency parsing typically uses transition-based or graph-based architectures combined with contextual vector representations learned with recurrent neural networks (e.g. Bi-LSTMs) (Kiperwasser and Goldberg, 2016). Temporal dependency parsing is, however, different from syntactic dependency parsing. In temporal dependency parsing, for each event or time expression, there is more than one other event or time expression that can serve as its reference time, while the most closely related one is selected as the gold standard reference time parent. This naturally falls into a ranking process where all possible reference times are ranked and the best is selected. In this sense our neural ranking model for temporal dependency parsing is closely related to the neural ranking model for coreference resolution described in Lee et al. (2017), both of which extract related spans of words (entity mentions for coreference resolution, and events or time expressions for temporal dependency parsing). However, our temporal dependency parsing model differs from Lee et al's coreference model in that the ranking model for coreference only needs to output the best candidate for each individual pairing and cluster all pairs that are coreferent to each other. In contrast, our ranking model for temporal dependency parsing needs to rank not only the candidate antecedents but also the temporal relations between the antecedent and the anaphor. In addition, the model also adds connectivity and acyclic constraints in the decoding process to guarantee a tree-structured output.

Conclusion and Future Work
In this paper, we present the first end-to-end neural temporal dependency parser. We evaluate the parser with both gold standard and automatically recognized time expressions and events. In both experimental settings, the parser outperforms two strong baselines and shows competitive results against prior temporal systems.
Our experimental results show that the model performance drops significantly when automatically predicted event and time expressions are used as input instead of gold standard ones, indicating an error propagation problem. Therefore, in future work we plan to develop joint models that simultaneously extract events and time expressions, and parse their temporal dependency structure.