Severing the Edge between before and after: Neural Architectures for Temporal Ordering of Events

In this paper, we propose a neural architecture and a set of training methods for ordering events by predicting temporal relations. Our proposed models receive a pair of events within a span of text as input and they identify temporal relations (Before, After, Equal, Vague) between them. Given that a key challenge with this task is the scarcity of annotated data, our models rely on either pretrained representations (i.e. RoBERTa, BERT or ELMo), transfer and multi-task learning (by leveraging complementary datasets), and self-training techniques. Experiments on the MATRES dataset of English documents establish a new state-of-the-art on this task.


Introduction
The task of temporal ordering of events involves predicting the temporal relation between a pair of input events in a span of text ( Figure 1). This task is challenging as it requires deep understanding of temporal aspects of language and the amount of annotated data is scarce.
In this paper, we present a set of neural architectures for temporal ordering of events. Our main model (Section 2) is similar to the temporal ordering models designed by Goyal and Durrett (2019), Liu et al. (2019a) and Ning et al. (2019).
Our main contributions are: (1) a neural architecture that can flexibly adapt different encoders and pretrained word embedders to form a contextual pairwise argument representation. Given the scarcity of training data, (2) we explore the application of an existing framework for Scheduled Multitask-Learning (henceforth SMTL) (Kiperwasser and Ballesteros, 2018) by leveraging complementary (temporal and non temporal) information to our models; this imitates pretraining and finetuning. This consumes timex information in a different way than Goyal and Durrett (2019). (3) A self-training method that incorporates the predictions of our model and learns from them; we test it jointly with the SMTL method.
Our baseline model that uses RoBERTa (Liu et al., 2019b) already surpasses the state-of-the-art by 2 F1 points. Applying SMTL techniques affords further improvements with at least one of our auxiliary tasks. Finally, our self-training experiments, explored via SMTL as well, establishes yet another state-of-the-art yielding a total improvement of almost 4 F1 points over results from past work.

Our Baseline Model
Our pairwise temporal ordering model receives as input a sequence X [0,n) of n tokens 2 https://catalog.ldc.upenn.edu/ LDC2006T08 3 https://catalog.ldc.upenn.edu/ LDC2002T31 (or subword units for BERT-like models) i.e. {x 0 , x 1 , ..., x n−1 }, representing the input text. A subsequence span i is defined by start i , end i ∈ [0, n). Subsequences span 1 and span 2 represent the input pair of argument events e1 and e2 respectively. The goal of the model is to predict the temporal relation between e1 and e2. First, the model embeds the input sequence into a vector representation using either static wang2vec representations (Ling et al., 2015), or contextualized representations from ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), or RoBERTa (Liu et al., 2019b). These embedded sequences are then optionally encoded with either LSTMs or Transformers. When BERT or RoBERTa is used to embed the input, we do not use any sequence encoders. The final sequence representation H [0,n) comprises of individual token representations i.e. {h 0 , h 1 , ..., h n−1 }.
While the goal is to predict the temporal relation between span 1 and span 2 , the context around these two spans also has linguistic signals that connect the two arguments. To use this contextual information, we extract five constituent subsequences from the sequence representation H [0,n) : (1) S 1 , the subsequence before span 1 i.e., H [0,start 1 ) , (2) S 2 , the subsequence corresponding to span 1 i.e., H [start 1 ,end 1 ) , (3) S 3 , the subsequence between span 1 and span 2 i.e, H [end 1 ,start 2 ) , (4) S 4 , the subsequence corresponding to span 2 i.e., H [start 2 ,end 2 ) and (5) S 5 , the subsequence after span 2 , i.e. H [end 2 ,n) . Each of these subsequences S i has a variable number of tokens which are pooled to yield a fixed size representation s i : where pool is the result of concatenating the output of an attention mechanism (we use the word attention pooling method (Yang et al., 2016) for all tokens in a given span) and mean pooling. The final contextual pair representation c is formed by concatenating 4 the five span representations s i with a sequence representation r. For models with BERT and RoBERTa, r is the CLS and <s> token representation respectively while for other models r = pool(H [0,n) ). c = s 1 s 2 s 3 s 4 s 5 r This final contextual pair representation c is then projected with a fully connected layer followed by 4 is used to denote concatenation a softmax function to get a distribution over the output classes. The entire model is trained end-toend using the cross entropy loss.

Multi-task Learning
While the model described in the previous section can be directly trained using labeled training data, the amount of annotated training data for this task (in the MATRES dataset) is limited. We enrich our model with useful information from other complementary tasks via SMTL.

Method
We adapt the framework of Kiperwasser and Ballesteros (2018), where three schedulers are used. They follow either a constant, sigmoid or exponential curve p(t), where p(t) is the probability of picking a batch from the main task, t is the amount of data visited so far throughout the training process and α is a hyperparameter. The constant scheduler splits the batches randomly; at any time step, the model will be trained with sentences belonging to either the main task or the auxiliary task (p const (t) = α, 0 ≤ α ≤ 1) . The sigmoid scheduler allows the model to visit batches from both the auxiliary task and the main task at the beginning while the latest updates are always with batches consisting of batches from the main task (p sig (t) = 1 1+e −αt ). The exponential scheduler starts by visiting only the batches from the auxiliary task while the latest updates are always from the main task (p exp (t) = 1 − e −αt ).
Following past work, we prepend a trained task vector to the encoder to help the model to differentiate between the main and the auxiliary tasks (Ammar et al., 2016;Johnson et al., 2017;Kiperwasser and Ballesteros, 2018, inter alia).

Auxiliary Datasets
We use three different auxiliary datasets in our SMTL setup. The first two have a different taxonomy and label set than MATRES, but have gold annotations. The last one is a silver dataset with predicted labels and same taxonomy as MATRES.
Our first dataset is the ACE relation extraction task. 5 We hypothesize that this task can add knowledge of different domains and of the concept of linking two spans in text given a taxonomy Robert F. Angelo, who (event, left) Phoenix at (timex, the beginning of October). of relations. While this is not directly related to events and our farthest task in terms of similarity, the pairwise span classification is the reason we include this.
We also use a closer and complementary temporal annotation dataset, i.e. the Timebank and Aquaint annotations involving timex relations (timex-event, event-timex, timex-timex) (Ning et al., 2018;Goyal and Durrett, 2019). 6 We expect the model to greatly benefit from being exposed to the timex relations in an MTL framework by learning about temporality in general and by adding specificity of the event-event temporal relations from the MATRES annotations. Figure 2 shows an example of the data annotated with an event-timex relation.
We use self-training (Scudder, 1965) to generate our third dataset: a silver dataset. This requires an unlabeled text, a tagger to extract events from this text, and a classifier to predict temporal relations for pairs of extracted events. As our unlabeled text, we use 6,000 random documents from the CNN / Daily Mail dataset which is a collection of news articles collected between 2007(Hermann et al., 2015. We picked 85K segments of text within these documents that contain between 10 and 40 tokens after tokenization. We train a RoBERTa-based named entity tagger and use it to tag events in these segments. 7 This results in about 65K events. We consider all 285K pairs of events that lie within a segment as candidates for temporal ordering. Finally, we use our baseline RoBERTa temporal model to classify the temporal relation between these candidate pairs and use the top 2 3 rd most confident classifications based on softmax scores to get about 190K instances of silver relations. 6 http://www.timeml.org/publications/ timeMLdocs/timeml_1.2.1.html. 7 The tagger is simply a dense layer on top of RoBERTa representation. We evaluate the tagger by using it to tag events in the MATRES validation set. The tagger reaches a F1 score of 89.5 on the MATRES development set.

Experiments and Results
The MATRES dataset is our primary dataset for training and validation. As in previous work, we use TimeBank and AQUAINT (256 articles) for training, 25 articles of which are selected at random for validation and Platinum (20 articles) as a held-out test set (Ning et al., 2018;Goyal and Durrett, 2019;Ning et al., 2019). Articles from TimeBank and AQUAINT at full length are about 400 tokens long on average. We believe that the document in its entirety is not required to infer the temporality between a given pair of events. Moreover, BERT style models are also often pre-trained for shorter inputs than this. For these reasons, we truncate our input text to a window of sentences 8 starting with one sentence before the first event argument up to and including one sentence after the second event argument.
We use one set of hyperparameters for all LSTM models and another set for all the Transformer models (both with and without ELMo embedder). 9 BERT and/or RoBERTa are loaded as a replacement of the Transformer parameters and they are therefore used both as embedders and encoders. We run our SMTL and self training experiments with our best baseline model on the development data: the RoBERTa model.
For the SMTL experiments, we explore the α hyperparameter, and we pick the one that produces the highest scores in our development data.
Finally, we picked our best SMTL model on the development data (see Table, this is the constant scheduler with silver data) parameters and continue training on the gold data only; we reduce the learning rate to 10 −6 . This is because the model trained in the first step is already in a good state and we want to avoid distorting it with aggressive updates.
We compare our results (Table 1) with other top performing systems. First, we observe that among models without contextualized representations, the LSTM encoder is 2.5 F1 points better than the Transformer encoder. We observe that replacing static word representations with ELMo representations leads to significantly worse F1 with . Results highlighted in bold are the best in each metric. We report average (and standard deviation) of accuracy and F1 over 5 runs with different random seeds. Given that it does not carry temporal information, we treat the relation VAGUE as a no relation for the F1 results as in Ning et al. (2019). For the SMTL experiments, the selected α value is shown between parentheses.
the LSTM encoder, but marginally improves upon the F1 of the Transformer encoder. We attribute this difference to the non-complementary nature of LSTM and ELMo representations, as ELMo is also LSTM-based, and thus the ELMo+LSTM combination might need more training data in order to extract meaningful signals. Importantly, however, our base model that uses pretrained RoBERTa surpasses the previous stateof-the-art (Ning et al., 2019) which uses BERT. Our BERT models yield very similar results to them. The main differences are that they do not finetune BERT along with the updates to the model, while we do and also, we model the context around the argument spans explicitly as part of S 1 , S 3 and S 5 . The reason why RoBERTa is better than BERT in this case is likely due to the fact that it has been trained longer, over more data, and over longer sequences. This matters because our temporal ordering model usually takes into account a long span in which both events occur.
The SMTL experiments show that the auxiliary task with timex annotations provides nonnegligible improvements of almost 1 F1 point on top of our RoBERTa model. Learning from the timex annotations makes our model more aware of time relations and thus, better at ordering events in time. The sigmoid and exponent schedulers perform better than the constant scheduler, suggesting that the model needs to first learn about temporality, and then learn to be more specialized on predicting temporal ordering relations later. We believe this timex multi-tasking setup to be an implicit yet effective way to teach our model about timexes in general without timex embeddings used in (Goyal and Durrett, 2019). When we use the ACE relation extraction dataset as an auxiliary task, none of the schedulers produce improvements while the sigmoid and exponent scheduler fare significantly worse. This result suggests that if the tasks differ too much, SMTL might not be a helpful strategy.
The self-training experiments (including SMTL with silver data) show that the silver data helps to reach better performance with constant being the best scheduler. Furthermore, fine-tuning of the best model (according to development set score, which in this case it is the same as test set score) on the gold data gives us another boost in performance establishing a new state of the art in the task that is 2.7 F1 points better than our RoBERTa baseline, and almost 4 points better than the previous published results.

Conclusions and Future Work
This paper presents neural architectures for ordering events in time. It establishes a new state-ofthe-art on the task through pretraining, leveraging complementary tasks through SMTL and selftraining techniques.
For the future, instead of using the RoBERTa baseline model for the self-training experiments, we could run several iterations by retraining on the data produced by our best self-trained model(s); this could be a good avenue for further improvements. In addition we plan to extend our work by moving to other languages beyond English (we currently have not tried this due to lack of data) using cross-lingual models, (Subburathinam et al., 2019), applying other architectures like CNNs (Nguyen and Grishman, 2015), incorporating tree structure in our models (Miwa and Bansal, 2016) and/or by handling jointly performing event recognition and temporal ordering (Li and Ji, 2014;Katiyar and Cardie, 2017).