Exploring Contextualized Neural Language Models for Temporal Dependency Parsing

Extracting temporal relations between events and time expressions has many applications such as constructing event timelines and time-related question answering. It is a challenging problem that requires syntactic and semantic information at sentence or discourse levels, which may be captured by deep language models such as BERT (Devlin et al., 2019). In this paper, we developed several variants of BERT-based temporal dependency parser, and show that BERT significantly improves temporal dependency parsing (Zhang and Xue,2018a). Source code and trained models will be made available at github.com.


Introduction
Temporal relation extraction has many applications including constructing event timelines for news articles or narratives and time-related question answering. Recently, Zhang and Xue (2018b) presented Temporal Dependency Parsing (TDP), which organizes time expressions and events in a document to form a Temporal Dependency Tree (TDT). Consider the following example: Example 1: Kuchma and Yeltsin signed a cooperation plan on February 27, 1998. Russia and Ukraine share similar cultures, and Ukraine was ruled from Moscow for centuries. Yeltsin and Kuchma called for the ratification of the treaty, saying it would create a "strong legal foundation". Figure 1 shows the corresponding TDT. Compared to previous pairwise approaches for temporal relation extraction such as Cassidy et al. (2014), a TDT is much more concise but preserves the same (if not more) information. However, TDP is challenging because it requires syntactic and semantic information at sentence and discourse levels.  DCT is Document Creation Time (March 1, 1998) Recently, deep language models such as BERT Devlin et al. (2019) have been shown to be successful at many NLP tasks, because (1) they provide contextualized word embeddings that are pretrained with very large corpora, and (2) BERT in particular is shown to capture syntactic and semantic information (Tenney et al., 2019, Clark et al., 2019, which may include but is not limited to tense and temporal connectives. Such information is relevant for temporal dependency parsing. In this paper, we investigate the potential for applying BERT to this task. We developed two models that incorporate BERT into TDP, starting from a straightforward usage of pre-trained BERT word embeddings, to using BERT as an encoder and training it within an end-to-end system. Experiments showed that BERT improves TDP performance in all models, with the best model achieving a 13 absolute F1 point improvement over our reimplementation of the neural model in (Zhang and Xue, 2019) 1 . We present technical details, experiments, and analysis in the rest of this paper.

Related Work
Much previous work has been devoted to classification of relations between events and time expressions, notably TimeML (Pustejovsky et al., 2003a), TimeBank (Pustejovsky et al., 2003b), and recently TimeBank-Dense (Cassidy et al., 2014) which annotates all n 2 pairs of relations. Pair-wise annotation has two problems: O(n 2 ) complexity, and the possibility of inconsistent predictions such as A before B, B before C, C before A. To address these issues, Zhang and Xue (2018b) present a tree structure of relations between time expressions and events. There, all time expressions are children of the root (if they are absolute), of the special time expression node Document Creation Time (DCT), or of other time expressions. All events are children of either a time expression or another event. Each edge is labelled with before, after, overlap, or depends on. Organizing time expressions and events into a tree reduces the annotation complexity to O(n) and avoids cyclic inconsistencies.
This paper builds on the chain of work done by Zhang and Xue (2018b), Zhang and Xue (2018a) and Zhang and Xue (2019), which presents an English corpus annotated with this schema as well as a first neural architecture. Zhang and Xue (2018a) uses a BiLSTM model with simple attention and randomly initialized word embeddings. This paper capitalizes on recent advances in pre-trained, contextualized word embeddings such as ELMo (Peters et al., 2018), ULMFit (Howard and Ruder, 2018) and BERT (Devlin et al., 2019). Besides offering richer contextual information, BERT in particular is shown to capture syntactic and semantic properties (Tenney et al., 2019, Clark et al., 2019 relevant to TDP, which we show yield improvements over the original model.

BERT-based Models
Following Zhang and Xue (2018a), we transformed temporal dependency parsing (TDP) to a ranking problem: given a child mention (event or time expression) x i , the problem is to select the most appropriate parent mention from among the root node, DCT or an event or time expression from the window x i−k , . . . , x i , . . . , x i+m 2 around x i , along with the relation label (before, after, overlap, depends on). A Temporal Dependency Tree (TDT) is assembled by selecting the highest-ranked predic-2 We set k = 10, m = 3 in all experiments. Figure 2: Model architecture for TDP with three different encoders (orange, blue, green boxes). Shown with the parent, child input pairs for a given child (event or time expression) x i . For simplicity, we did not show < x i , root > and < x i , DCT >, which are included as candidate pairs for all x i . tion parent, relation type for each event and time expression in a document (while avoiding cycles).
As shown in Figure 2, we developed three models that share a similar overall architecture: the model takes a pair of mentions (child and parent) as input and passes each pair through an encoder which embeds the nodes and surrounding context into a dense representation. Hand-crafted features are concatenated onto the dense representation, which is then passed to one or two feed-forward layers and a softmax function to generate scores for each relation label for each pair. We tested three types of encoder: • BiLSTM with non-contextualized embeddings feeds the document's word embeddings (one per word) to a BiLSTM to encode the pair as well as the surrounding context. The word embeddings can be either randomly initialized (identical to Zhang and Xue (2018a)), or pre-trained from a large corpus -we used GloVe (Pennington et al., 2014).
• BiLSTM with frozen BERT embeddings replaces the above word embeddings with frozen (pre-trained) BERT contextualized word embeddings. We used the BERT-base uncased model 3 , which has been trained on English Wikipedia and the BookCorpus.
• BERT as encoder: BERT's encoder architecture (with pre-trained weights) is used directly to encode the pairs. Its weights are fine-tuned in the end-to-end TDP training process.
All models use the same loss function and scoring as in Zhang and Xue (2018a).

Model 1: BiLSTM with Frozen BERT
The first model adjusts the model architecture from Zhang and Xue (2018a) to replace its word embeddings with frozen BERT embeddings. That is, word embeddings are computed via BERT for every sentence in the document; then, these word embeddings are processed as in the original model by a BiLSTM. The BiLSTM output is passed to an attention mechanism (which handles events / time expressions with multiple words), then combined with the hand-crafted features (listed in Table 2) and passed to a feed-forward network with one hidden layer, which ranks each relation label for each (possible) parent / child pair.

Model 2: BERT as Encoder
This model takes advantage of BERT's encoding and classification capabilities since BERT uses the Transformer architecture (Vaswani et al., 2017). The embedding of the first token [CLS] is interpreted as a classification output and fine-tuned.
To represent a child-parent pair with context, BERT as encoder constructs a "sentence" for the (potential) parent node and a "sentence" for the child node. These are passed to BERT in that order and concatenated with BERT's [SEP] token. Each "sentence" is formed of the word(s) of the node, the node's label (TIMEX or EVENT), a separator token ':' and the sentence containing the node, as shown in Table 1

Additional Features
We used several hand-crafted binary and scalar features (Table 2) in all models, expanding on the features in Zhang and Xue (2018a).
Node distance features parent is previous node in document parent is before child in same sentence parent is before child, more than one sentence away parent is after child parent and child are in same sentence scaled distance between nodes scaled distance between sentences Time expression / event label features child is time expression and parent is root child and parent both time expressions child is event and parent is DCT parent is padding node 4

Experiments
We use the training, development and test data from Zhang and Xue (2019) for all experiments. We evaluated four configurations of the encoders above. Firstly BiLSTM (re-implemented) reimplements Zhang and Xue (2018a)'s model 5 in TensorFlow (Abadi et al., 2016) for fair comparison. Replacing its randomly-initialized embeddings with GloVe (Pennington et al., 2014) yields BiLSTM with GloVe. We also test the models BiLSTM with frozen BERT and BERT as encoder as described in Section 3. We used Adam (Kingma and Ba, 2014) as the optimizer and performed coarse-to-fine grid search for key parameters such as learning rate and number of epochs using the dev set. We observed that when fine-tuning BERT in the BERT as encoder model, a lower learning rate (0.0001) paired with more epochs (75) achieves higher performance, compared to using learning rate 0.001 with 50 epochs for the BiLSTM models.
Model F1-score Baseline (Zhang and Xue, 2019) 0.18 BiLSTM (re-implemented) 0.55 BiLSTM (Zhang and Xue, 2019) 0.60 BiLSTM with GloVe 0.58 BiLSTM with frozen BERT 0.61 BERT as encoder 0.68 . Table 3 summarizes the F1 scores 6 of our models. We also include the rule-based baseline and the performance reported in Zhang and Xue (2019) 7 as a baseline.
BiLSTM with frozen BERT outperforms the re-implemented baseline BiLSTM model by 6 points and BiLSTM with GloVe by 3 points in F1score, respectively. This indicates that the frozen, pre-trained BERT embeddings improve temporal relation extraction compared to either kind of noncontextualized embedding. Fine-tuning the BERTbased encoder (BERT as encoder) resulted in an absolute improvement of as much as 13 absolute F1 points over the BiLSTM re-implementation, and 8 F1 points over the reported results in Zhang and Xue (2019). This demonstrates that contextualized word embeddings and the BERT architecture, pretrained with large corpora and fine-tuned for this task, can significantly improve TDP.
We also calculated accuracies for each model on time expressions or events subdivided by their type of parent: DCT, a time expression other than DCT, or another event. Difficult categories are children of DCT and children of events. By this breakdown, the main difference between the BiLSTM and the BiLSTM with frozen BERT is its performance on children of DCT: with BERT, it scores 0.48 instead of 0.38. Conversely BERT as encoder sees improvements across the board, with a 0.21 increase on children of DCT over the BiLSTM, a 0.14 increase for children of other time expressions, and a 0.11 increase for children of events.

Analysis
Why BERT helps: Comparing the temporal dependency trees produced by the models for the test set, we see that these improvements correspond to the phenomena below.
Firstly, unlike the original BiLSTM, BERT as encoder is able to properly relate time expressions occurring syntactically after the event, such as Kuchma and Yeltsin signed a cooperation plan on February 27, 1998 in Example 1. (The BiLSTM falsely relates signed to the "previous" time expression DCT). This shows BERT's ability to "look forward", attending to information indicating a parent appearing after the child.
Secondly, BERT as encoder is able to capture verb tense, and use it to determine the correct label in almost all cases, both for DCT and for chains of events. It knows that present tense sentences (share similar cultures) overlap DCT, while past perfect events (was ruled from Moscow) happen either before DCT or before the event immediately adjacent (salient) to them. Similarly, progressive tense (saying) may indicate overlapping events.
Thirdly, BERT as encoder captures syntax related to time. They are particularly adept at progressive and reported speech constructions such as Yeltsin and Kuchma called for the ratification of the treaty, saying [that] it would create . . . where it identifies that called and saying overlap and create is after saying. Similarly, BERT's ability to handle syntactic properties (Tenney et al., 2019, Clark et al., 2019 may allow it to detect in which direction adverbs such as since should be applied to the events. This means that while all models may identify the correct parent in these cases, BERT as encoder is much more likely to choose the correct label, whereas the non-contextualized BiLSTM models almost always choose either before for DCT or after for children of events. Lastly, both BERT as encoder and BiLSTM with frozen BERT are much better than the BiL-STM at identifying context changes (new "sections") and linking these events to DCT rather than to a time expression in the previous sections (evidenced by the scores reported above on children of DCT). Because BERT's word embeddings use the sentence as context, the models using BERT may be able to "compare" the sentences and judge that they are unrelated despite being adjacent.
Equivalent TDP trees: We note that in cases where BERT as encoder is incorrect, it sometimes produces an equivalent or very similar tree (since relations such as overlap are transitive, there may be multiple equivalent ways of arranging the tree). Future work could involve developing a more flexible scoring function to account for this.
Limitations: There are also limitations to BERT as encoder. For example, it is still fooled by syntactic ambiguity. Consider: Example 2: Foreign ministers agreed to set up a panel to investigate who shot down the Rwandan president's plane on April 6, 1994.
A human reading this sentence will infer based on world knowledge that April 6, 1994 should be attached to the subclause who shot down . . . , not to the matrix clause (agreed), but a syntactic parser would produce both parses. BERT as encoder incorrectly attaches agreed to April 6, 1994: even BERT's contextualized embeddings are not sufficient to identify the correct parse.

Conclusion and Future Work
We present two models that incorporate BERT into temporal dependency parsers, and observe significant gains compared to previous approaches. We present an analysis of where and how BERT helps with this challenging task.
For future research, we plan to explore the interaction between the representation learnt by BERT and the hand-crafted features added at the final layer, as well as develop a more flexible scoring function which can handle equivalent trees.